Foundational IM/IT concepts for Unicode-readiness

Last updated on July 9, 2024

The first step in assessing your systems for Unicode-readiness is understanding the terminology for characters and encodings. Find the information you need to get started on this page.

Character set and encoding

A character set is a list of characters, and an encoding scheme represents them in the system as ones and zeroes (binary data). When storing text as binary data, you must specify the encoding for that text . An encoding scheme is necessary to transfer data between systems.

Unicode/UTF-8 standard

Unicode is an international encoding standard created in the early 1990’s. Its goal was to include all the characters used in any of the world’s living languages. Since then, it has undergone significant changes.

To process Unicode data, all the system’s data stores need to be configured to store data in Unicode's standard encodings. In B.C. government systems, the standard encoding is UTF-8. Database companies such as Oracle provide utilities for converting non-Unicode databases to Unicode/UTF-8.

Understanding language processing

There are many things that impact how IM/IT systems process languages. The following is a data architecture entity relationship diagram showing how terms like "byte", "font", "encoding", "grapheme", "glyph" and "character set" relate to one another:

Brief history of character sets

A character set comprises characters, such as ASCII, ISO-8859-1, and Unicode. Their encoding mechanism digitally encodes characters as ones and zeros. UTF-8 and UTF-16, are encoding examples for Unicode.

What we see as a single character could actually be many characters superimposed upon one another creating a grapheme. For example, the character "c cédille" combines the Latin character 'c' with a superimposed cedilla accent.

How a grapheme appears when displayed on the screen or paper is determined by the font used. BC Sans, for example, is a font influencing the visual representation of graphemes.

Limitations of current character sets

Many older IM/IT systems allowed users to type using only the characters available on the US ASCII keyboard. Complex systems, in particular, have not undergone modernization. This is due to the risk of service delivery issues like data loss, corruption or security risks. We need careful planning and execution to reduce errors and modernize successfully.

Some of our applications use a z/OS® (mainframe) operating system. The data in these is encoded in the Extended Binary Coded Decimal Interchange (EBCDIC) which came before American Standard Code for Information Interchange (ASCII) became commonly used.

Most of our current systems use ASCII or a limited extended version of ASCII such as ISO-8859 -1 (Latin1) or Windows-1252. These consume one byte of storage for each character when digitally encoded.

Potential issues with existing programs:

Some programs assume the one-byte property
Challenges arise when encountering Unicode data, which can use up to four bytes to store a single character

EBCDIC and ASCII Limitations:

EBCDIC allows for 256 possible values
ASCII allows for 128 possible values
Both include numeric digits, letters (both lower and upper case), punctuation, and control symbols

ASCII and 8-bit extended character sets don’t cover all characters used in Indigenous languages in B.C. Unicode is the only character set that includes characters for Indigenous languages in B.C.

Start Unicode-readiness assessment

Use the terminology you have learned to complete your Unicode-readiness assessment. The next step in your assessment is to review system components.

Did you find what you were looking for?

The B.C. Public Service acknowledges the territories of First Nations around B.C. and is grateful to carry out our work on these lands. We acknowledge the rights, interests, priorities, and concerns of all Indigenous Peoples - First Nations, Métis, and Inuit - respecting and acknowledging their distinct cultures, histories, rights, laws, and governments.

More topics