Contents
Introduction
Unicode, formally The Unicode Standard, is a text encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 of the standard defines 154998 characters and 168 scripts used in various ordinary, literary, academic, and technical contexts.
(This short introduction was pulled from Wikipedia).
Please see the Wikipedia Lemma about Unicode for in-depth and exhausting general information. Debian contents follows.
Unicode has been subject to very similar but not completely identical standardization from ISO/IEC as Universal Coded Character Set in ISO/IEC 10646 since 1993. The standard has been updated and amended most recently in 2020 as ISO/IEC 10646:2020 for Unicode 13.0, 15.0 and 16.0.
The Unicode Consortium is a non-profit organization in the USA, founded in January 1991. Since then, Unicode evolved and grew. It is widely believed that the major software ecosystems, the world of Free Software being one of them, is able to put Unicode to productive use since about the millennium. Unicode and its encodings have obsoleted the older character encodings for international characters such as ISO/IEC-8859, KOI, Code Pages, Windows-1252 etc.
This page is intended to compile some knowledge about Unicode and its encodings to form a common base of communication inside the Debian Project. Of course, it might be used outside Debian as well. Feel free to add your knowledge here. The perluniintro manual page, which gives its own introduction to Unicode, can be a reference in case of ambiguities.
The Unicode definition includes scripts containing Emoticons, simple graphics, symbols, and of course many languages. It supports languages that are not written from left to right, and it has an area of deliberately undefined letters, the so-called "Private Area" that can be used locally to include characters that are not part of the official definition. Probably the most prominent use of this Private Area is the definition of the letters of the Klingon language endorsed by the Klingon Language Institute.
Unicode also defines various properties for the characters, like "uppercase" or "lowercase", "decimal digit", or "punctuation"; these properties are independent of the names of the characters. Furthermore, various operations on the characters like uppercasing, lowercasing, and collating (sorting) are defined.
Common Encodings of Unicode in Software
Unicode strings stored in IT systems are nowadays usually encoded in one of the Unicode Transformation Formats (UTF). Other formats do exist but do not see significant usage.
The most widely used UTF variants are UTF-8 and UTF-32. Other more or less serious variants have been suggested but seldomly see practical use.
Wikipedia has an exhausting Comparision of Unicode encodings available.
UTF-8
UTF-8 is by far the most common encoding of strings that go beyond US-ASCII in these days. UTF-8 is dominant for all countries/languages on the internet, is used in most standards, often the only allowed encoding, and is supported by all modern operating systems and programming languages. It is standardized on the IETF side of the standards body as RFC 3629 - UTF-8, a transformation format of ISO 10646.
UTF-8 is backwards compatible with US-ASCII. The first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that a UTF-8-encoded file using only those characters is identical to an ASCII file.
UTF-8 is a variable length encoding: Only non-ASCII characters are encoded into more than a single byte.
UTF-8 is Space Efficient in Storage
An UTF-8 encoded string must be read from the beginning to access an indexed character
UTF-8 strings work without BOM, but a BOM might be present at the beginning of an UTF-8 encoded stream.
Not all sequences of bytes are valid UTF-8.
UTF-32
UTF-32, sometimes called UCS-4, is a fixed-length encoding used to encode Unicode code points that uses exactly 32 bits per code point. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point's numerical value.
UTF-32 is the only Unicode Transformation Format that is not a variable length encoding. An indexed character can thus be directly accessed.
UTF-32 is Not Space Efficient
Both UTF-32-BE (big endian) and UTF-32-LE (little endian) exist. BOM is rarely used to distinguish both.
UTF-32 is used in situations where the desire for computing efficiency trumps the desire for storage efficiency. Recent Python versions, for example, use UTF-32 to internally represent strings that need characters beyond ASCII.
UTF-16 (Honorable Mention Only)
The original idea behind UTF-16 was the same idea than for UTF-32. It has lost its sense when it became impossible to encode all Unicode code points in 16 bits, forcing UTF-16 to become a variable length encoding. Some authors say that UTF-16 should have been deprecated when the set of Unicode code points got bigger than 2^16.
UTF-16 is rarely used and should be avoided for new implementations.
Normalization: How to sort and compare
Among the most frequent operations applied to strings is comparing. We can compare for Equality and Order (needed, for example, to sort). Searching is easier when you can sort the set in which we need to search. Software implementing the Unicode string search and comparison functionality must take into account that different Code Points can be equivalent or even equal.
Among the challenges are:
- Accented characters are usually available as dedicated Code Points but can also be combined from the character and the accent: É (U+00C9) looks like an E combined with an Accent, but is its own codepoint
- It is possible to have more than one accent ("combining mark") on a single character. If that is the case, the order of combining marks matters for comparision, but not for presentation.
- There is a dedicated Code Point for the Ohm Sign (U+2126) which is different from the Greek Omega (U+03a9). Similiar situations exist for other physical units, currencies etc.
- Font Variants can cause glyphs to look alike
- There are different Code Points that express different Kinds of Whitespace
- There are positional, circled, width variant, rotated, superscript, subscript, squared variants, fractions and ligatures.
See Unicode Equivalence and Unicode Annex #15 for a more detailed discussion.
The Annex describes two different kinds of equivalence: Canonical and Compatibility Equivalence. For example, the Accented Character is canonically equivalent to the combination of Character and Accent, and the Ohm sign is canonically equivalent to the Greek Capital Omega. On the other hand, a non breaking space is compatibly equivalent to a breaking space, the double-lined R (for the set of real numbers, for example) is compatibly equivalent to the capital latin letter R, i² is compatibly equivalent to i2.
The Annex defines four normalization forms:
- Normalization Form D (NFD): Canonical Decomposition
- Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
- Normalization Form KD (NFKD): Compatibility Decomposition
- Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
It must be noted that normalization may lose semantic information that was implicitly present in the string before normalization was applied. For example, an author might write 10 Å, using U+212B on purpose to reflect that this is a physical length measured in Ångström. Losing the information in normalization may not be desireable, but you still might want Å (the length unit) sorted between A (Ampere) and C (Degrees Celsius). Applications must decide whether they want to accept this loss, especially when storing normalized or not normalized strings.
It might be desireable to store just the non-normalized variant (as entered by the user) and do (and re-do, over and over) the normalization when searching, sorting and testing for equality. It might be even efficient (from the CPU point of view) to store the normalized form in addition to the non-normalized form to not have to do the normalization over and over again.
For example, NFC would solve both the Accent and the Ohm issue:
- Both U+00E9 (é) and U+0065, U+0301 are NFC-normalized to U+00E9,
- Both U+2126 (Ohm sign) and U+0349 (omega) are NFC-normalized to U+0349 (omega).
NFC normalization alone will not, however, solve homograph collisions: a (U+0061 Latin small letter a) and а (U+0430 Cyrillic small letter a) are NFC-normalized to different codepoints.
Normalization is done with the help of tables that are part of the Standard, but full coverage is not guaranteed. Users of Normalization might find cases where similarly looking things do not have the same binary representation and sort differently even after normalization.
This chapter hopefully outlines that comparing Unicode string is by no means a trivial thing.
Common Problems with Unicode
The Homoglyph Problem
Even in us-ascii, there are glyphs that look different for a casually looking human (commonly called "confusables"). The most common examples are 1, I, and l (the number one, the capital I, the lower case L), or 0 and O (the number Zero and the Capital O). This is quite commonly misused by adversaries and attackers to fool humans into believing that a string is equal to another (like in typosquatting or phishing).
In Unicode, this is multiplied by the fact that there are Codepoints that map to identical glyphs ("Homoglyphs"), for example:
- Accented characters and the combination of the unaccented character with an accent
- Physical Units like the Ohm Sign (U+2126) which is different from the Greek Omega (U+03a9).
- In many typefaces, the Greek letter Α, the Cyrillic letter А and the Latin letter A are visually identical, as are the Latin letter a and the Cyrillic letter а.
The Double-UTF-8 Problem
Especially when converting, it might be possible that an UTF-8 encoded string gets displayed as ISO-8859-1 or some other encoding, or an already UTF-8 encoded string is passed through encoding a second time. This can be seen on various places on the Internet, usually when non-ascii characters are presented as two weird characters.
This is a quite common software bug that also might be the result of text being passed between different systems or software with one of the sides being improperly configured.
This is commonly referred to as "double UTF-8" and mocked as "WTF-8" which is a game of words on an exclamation of astonishment ("What the...") and the fact that the W is commonly spelled out as "Double-U" in English.
Overlong Encodings
At least, in UTF-8, it is possible to encode a character that is normally encoded in z bytes in more than z bytes while the common libraries can still properly decode the character. This is a security probem because it allows the same code point to be encoded in multiple, different ways. Overlong Encodings have been used to bypass security validations in various software products in the past.
Other possible security issues
Unicode contains a vast number of nonprinting characters that need special attention in software. Since Unicode also supports languages that do not usually write from left to right, there are also control characters to change the writing direction.
Glossary
(Borrowed heavily from Wikipedia)
Byte Order Mark
The byte-order mark (BOM) is a particular usage of the special Unicode character code, U+FEFF ZERO WIDTH NO-BREAK SPACE, whose appearance as a magic number at the start of a text stream can signal several things to a program reading the text.
ASCII, US-ASCII
ASCII, the American Standard Code for Information Interchange, is among the oldest character encoding standards for electronic communication. Originating in the 1960ies, ASCII codes represent text in computers, telecommunications equipment, and other devices. ASCII has just 128 code points, of which only 95 are printable characters, which severely limit its scope.
Most notable is the absence of any international characters. Originally using only 7 bits, the second half of the space opened by 8 bit representation was used in various, mutually incompatible variants to incorporate international characters (most languages using the latin alphabet having their own variant) or simple character graphics.
Character
In computing and telecommunications, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language. That is to say, roughly, it is a unit of information that encodes a character (symbol).
Code Point
A code point is a particular position in a table, where the position has been assigned a meaning. In Unicode, Code points are normally assigned to abstract characters. An abstract character is not a graphical glyph but a unit of textual data. However, code points may also be left reserved for future assignment (most of the Unicode code space is unassigned), or given other designated functions.
The distinction between a code point and the corresponding abstract character is not pronounced in Unicode but is evident for many other encoding schemes, where numerous code pages may exist for a single code space.
Composition
see pre-composed characters
Confusables
Some code points are visually similar and thus can cause confusion among humans. Such characters are often called "confusable characters" or "confusables". See RFC 8264 Section 12.5.
Decomposition
See pre-composed characters
Encoding
Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical values that make up a character encoding are known as code points and collectively comprise a code space, a code page, or character map.
Glyph
A glyph is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A grapheme, or part of a grapheme (such as a diacritic), or sometimes several graphemes in combination (a composed glyph) can be represented by a glyph. A glyph can occupy more than one column.
Grapheme
In linguistics, a grapheme is the smallest functional unit of a writing system.
pre-composed character
While Unicode generally treats accented characters as the composition of the base character with the respective accent, some already accented characters have been grandfathered in from older standards. These are called "pre-composed characters", and they can be decomposed into the base character and the accent.
Script
In Unicode, a script is a collection of letters and other written signs used to represent textual information in one or more writing systems. Some scripts support one and only one writing system and language, for example, Armenian. Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages. Some languages make use of multiple alternate writing systems and thus also use several scripts; for example, in Turkish, the Arabic script was used before the 20th century but transitioned to Latin in the early part of the 20th century. More or less complementary to scripts are symbols and Unicode control characters.
Literature
Chapter 7 of nick black's free Book Hacking the Planet with Notcurses: A Guide to TUIs and Character Semigraphics contains a good history of character encoding from ASCII up to Unicode and its different encodings. This page would not have been possible without this source.
nick black's rant about Unicode encodings on debian-devel.
Relevant standards and RFCs
Unicode Technical Report #36 - Unicode Security Considerations
Unicode® Technical Standard #39 - Unicode Security Mechanisms
RFC 6943 - Issues in Identifier Comparison for Security Purposes
- perl documentation gives good basic explanation: perlunicode, perluniintro, perluniprops