Foundational IM/IT concepts for Unicode-readiness

Last updated on June 21, 2024

The first step in assessing your systems for Unicode-readiness is understanding the terminology for characters and encodings. Find the information you need to get started on this page.

A character set is a list of characters, and an encoding scheme represents them in the system as ones and zeroes (binary data). When storing text as binary data, you must specify the encoding for that text . An encoding scheme is necessary to transfer data between systems.

Unicode/UTF-8 standard

Unicode is an international encoding standard created in the early 1990’s. Its goal was to include all the characters used in any of the world’s living languages. Since then, it has undergone significant changes.

To process Unicode data, all the system’s data stores need to be configured to store data in Unicode's standard encodings. In B.C. government systems, the standard encoding is UTF-8. Database companies such as Oracle provide utilities for converting non-Unicode databases to Unicode/UTF-8.

Understanding language processing

There are many things that impact how IM/IT systems process languages. The following is a data architecture entity relationship diagram showing how terms like "byte", "font", "encoding", "grapheme", "glyph" and "character set" relate to one another: 

 

This diagram illustrates the relationship between character sets, characters, encodings of characters, compositions of characters into graphemes, and the display of graphemes by a font. A character set consists of characters. Examples of character sets are ASCII, ISO-8859-1 and Unicode. Characters are digitally encoded as ones and zeros using one of the encoding mechanisms for the particular character set. Examples of encodings for Unicode are UTF-8 and UTF-16. What a person perceives as a single character may actually be several characters superimposed upon one another. For example, a c cédille is composed of the Latin character c with a superimposed cedilla accent. This composition is called a grapheme. How a grapheme appears when displayed on the screen or paper is governed by the font used. BC Sans is an example of a font.

 

Brief history of character sets

A character set comprises characters, such as ASCII, ISO-8859-1, and Unicode. Their encoding mechanism digitally encodes characters as ones and zeros. UTF-8 and UTF-16, are encoding examples for Unicode.

What we see as a single character could actually be many characters superimposed upon one another creating a grapheme. For example, the character "c cédille" combines the Latin character 'c' with a superimposed cedilla accent.

How a grapheme appears when displayed on the screen or paper is determined by the font used. BC Sans, for example, is a font influencing the visual representation of graphemes.

Limitations of current character sets

Many older IM/IT systems allowed users to type using only the characters available on the  US ASCII keyboard. Complex systems, in particular, have not undergone modernization. This is due to the risk of service delivery issues like data loss, corruption or security risks. We need careful planning and execution to reduce errors and modernize successfully.

Some of our applications use a z/OS® (mainframe) operating system. The data in these is encoded in the Extended Binary Coded Decimal Interchange (EBCDIC) which came before American Standard Code for Information Interchange (ASCII) became commonly used.

Most of our current systems use ASCII or a limited extended version of ASCII such as ISO-8859 -1 (Latin1) or Windows-1252. These consume one byte of storage for each character when digitally encoded.

Potential issues with existing programs:

  • Some programs assume the one-byte property
  • Challenges arise when encountering Unicode data, which can use up to four bytes to store a single character

EBCDIC and ASCII Limitations:

  • EBCDIC allows for 256 possible values
  • ASCII allows for 128 possible values
  • Both include numeric digits, letters (both lower and upper case), punctuation, and control symbols

ASCII and 8-bit extended character sets don’t cover all characters used in Indigenous languages in B.C. Unicode is the only character set  that includes characters for Indigenous languages in B.C.

Start Unicode-readiness assessment

Use the terminology you have learned to complete your Unicode-readiness assessment. The next step in your assessment is to review system components.