Identify problematic string operations for Unicode-readiness

Last updated on June 21, 2024

Find guidance and resources to identify and resolve potentially problematic string operations for Unicode-readiness.

On this page

Investigate system information processing 

When preparing a system to support Unicode characters it’s important to know how information is processed. You should review how the system handles text strings. 

In computer programming, text strings are used to communicate information from a computer program to the person using the program. Examples of string operations include:

  • Searching for a client using their name to find their account records
  • Storing and sorting text records are examples of string operations

These operations become more complex with Unicode data. It's important that applications are tested with data in Indigenous languages to ensure the outcome is consistent and applications continue to work as expected.

Sorting and searching data

When sorting text string data, the collation rules determine the resulting sorting order for the character the data belongs to. There are generally two types of collation rules:

  • Binary collation: The sort order is determined by the numeric order of the character encoding (the way in which characters are stored as binary numbers)
  • Language-aware collation: It’s possible to place ‘é’ in a more logical place in the ordering. For example, Excel follows “European ordering rules” ensuring that the following results are in sequence: é (with the accent) and e (no accent)

Some Unicode characters with diacritics can have multiple UTF-8 encodings, making it challenging to search for them. To address this issue, Unicode provides normalization forms that can eliminate ambiguity during searching.

String manipulation

There are several common operations that can be applied to text strings. These may behave unexpectedly when applied to text strings that contain Unicode characters.        

 

String length

With Unicode data, the number of characters can be less than the number of bytes required for the encoding, which can be up to 4 bytes per character.The functionality you are programming may need you to know the length of a text string in terms of the number of:

  • Bytes: to determine the amount of storage required when storage is allocated by number of bytes
  • Characters: to determine the amount of storage required when storage is allocated by number of characters
  • Graphemes: to allocate screen space

Some graphemes, such as the “e” combined with an acute accent (‘é’) are made up of multiple characters. In such a case, the number of graphemes in a string is less than the number of characters, and the number of characters is less than the number of bytes.

Diagram illustrating the character encoding of the name L'Oreal, which consists of an 'e' with an accent. This name contains 7 graphemes consisting of 8 characters and using 9 bytes of storage. The 'e' with the accent is a composite grapheme, composed of two characters. The 'e' character occupies one byte, while the accent character uses two bytes, totaling three bytes for the 'e' with the accent.

 

 

String comparison 

With non-ASCII Unicode text strings, checking that two text strings are equal is complicated because two logically identical strings might have different encodings, as described above under sorting and searching.  The logical order of two text strings is also not always clear.

 

Position in string 

For ASCII data, the position of a particular character in a text string does not depend on whether the measurement is done in terms of bytes or characters. The nth byte and the encoding of the nth character in an ASCII text string are the same.

For non-ASCII Unicode data, this is not the case. In the image above, the position of ‘a’ is the 6th grapheme, the 7th character, and the 8th byte.

 

Substring 

The “substring” operation can be viewed as positioning in a text string, then extracting a specific length of data. As such, the complexities of performing the substring operation on non-ASCII Unicode data are a combination of those involved in string length and position in string (see above).

 

Encryption and decryption 

There are various methods for encrypting data before storing or transmitting it, and then decrypting it upon retrieval. Encryption and decryption methods that work well when the subject data can be viewed as a string of fixed-size, single-byte characters may not work when the characters have multi-byte or varying length encodings as non-ASCII Unicode characters do.

The Cryptography with International Character Sets guidance provides two principles to keep in mind when encrypting/decrypting non-ASCII Unicode data: 

  • Work with bytes, not text strings 
  • Do not store encrypted data in a string type

Different encoding methods with Unicode can produce different binary representations of the same text. Hence, any system decrypting data from another system must know the encoding used in the original system. 

Complete your Unicode-readiness assessment

Once you've ensured your string operations are able to support Unicode, the final step is to test your dataflows for Unicode-readiness.