Unicode is a computing standard that supports text written in a large number of modern and ancient writing systems. Among other things the standard defines a code mapping and various character encoding schemes. The code mapping assigns each character a unique number. The following table lists some examples.
Unicode Value | Decimal Value |
Hex Value |
Character |
---|---|---|---|
U+0041 | 65 | 0041 | A |
U+00C6 | 198 | 00C6 | Æ |
U+27A8 | 10152 | 27A8 | ➨ |
U+BBBB | 48059 | BBBB | 뮻 |
The full Unicode code mapping is documented in a number of charts located at the The Unicode Code Charts.
Encoding
A character's encoding defines the format of the bytes used to store the character in either memory or on disk. ASCII encoding is the simplest. With ASCII encoding a character at position 65 (decimal) in the code chart is stored as a single byte with the value 65 (or 01000001). Unfortunately ASCII encoding is not capable of storing more than 128 characters.
To address ASCII encoding's limitations a number of different encoding schemes were tried. The following table lists some popular schemes in use before Unicode encodings became popular.
Encoding Scheme | Word Length (bits) |
Number of Bytes per Character | Number of Characters |
Notes |
---|---|---|---|---|
ASCII | 7 | 1 | 128 | |
ISO-8859 | 8 | 1 | 256 |
|
OEM Code Pages | 8 | 1 | 256 | |
Windows Code Pages | 8 | 1 | 256 |
|
Once Unicode appeared a number of newer encoding schemes were created. Some of the more popular ones are listed below.
Introduced in Unicode Version | Encoding Scheme | Word Length (bits) |
Number of Bytes per Character | Uses a Byte Order Mark (BOM) | Notes |
---|---|---|---|---|---|
1.0 | UCS-2 | 16 | 2 | No |
|
2.0 | UTF-16 | 16 | 2 or 4 | Yes U+FFFE (Little Endian) U+FEFF (Big Endian) |
|
2.0 | UTF-16LE | 16 | 2 or 4 | No |
|
2.0 | UTF-16BE | 16 | 2 or 4 | No |
|
3.0 | CESU-8 | 8 | 1,2,3, or 6 | No |
|
3.0 (1,2,3 word characters)
3.1 (4 word characters) |
UTF-8 | 8 | 1,2,3, or 4 | Optional U+EFBBBF (UTF-8 BOM) |
UTF-8
UTF-8 is a variable width, 8-bit Unicode encoding that can address over 1 million distinct characters. It is backward compatible with ASCII and is the most popular encoding for Oracle Unicode databases, email, and web pages.
It is important to note that a Unicode character's position in the code chart does not identify the actual bit sequence used to encode the character. Take the Unicode character U+00C6 for example. The hex value C6 in binary is 11000110 but the Unicode character U+00C6 is actually encoded in UTF-8 as two bytes, 11000011 10000110.
Byte Order Marks
A Byte Order Mark (BOM) is a sequence of bytes placed at the start of some files to identify the file's encoding. A BOM may be found in some Windows files. LINUX/UNIX files do not use BOM's. The following table lists three common BOMs.
Standard Encoding Name | Byte Order Mark
(hexadecimal representation) |
---|---|
UTF-8 | EF BB BF |
UTF-16 (Little Endian) | FF FE |
UTF-16 (Big Endian) | FE FF |
According to the Unicode standard a BOM is optional in UTF-8 files.
However, some programs only expect UTF-8 files with no BOM.
When such programs open a UTF-8 file that includes a BOM
they may incorrectly display the BOM as these three printable
characters: 
.
Planes
Unicode characters are grouped into a number of planes. Each plane contains 65,536 code points. The first 65,536 code points (0x0000-0xFFFF) are in Plane 0, which is also called the "Basic Multilingual Plane (BMP)". The BMP contains the characters for almost all modern languages and symbols used for general computing purposes. It is also the only plane of characters supported in early versions of the Unicode standard. Planes above the BMP contain code points for ancient scripts, mathematical and musical symbols, and rarely used Han ideographs.
While significant support and attention is given to planes above the BMP in Unicode standards documentation it is important to note that characters in these planes are rarely encountered in general computing. One reason for this is that all modern font technologies only support a maximum of 65,536 code points per font. Since characters in the BMP are essential in many applications most fonts allocate their 65,536 characters only to this plane.
Pan-Unicode Fonts
Even though any single font can hold up to 65,536 Unicode glyphs in practice there are less than a dozen (as of 2009-12) fonts that hold anywhere near that amount. Fonts with a large number of Unicode glyphs are known as "pan-Unicode fonts", "Unicode fonts", or "Unicode typefaces". Examples of pan-Unicode proportional fonts include
- GNU FreeFont fonts
- Code 2000
- GNU Unifont
- Wen Quan Yi fonts
- Arial Unicode MS.
Monospace pan-Unicode fonts are even rarer than proportional ones. Notable free or shareware monospace fonts include:
- FreeMono - part of the GNU FreeFont family
- Everson Mono (shareware)
- Wen Quan Yi Zen Hei Mono - this font includes a large number of Chinese, Japanese, Korean, and Vietnamese glyphs; unfortunately Zen Hei Mono co-exists in the same .ttf file as its proportional cousin, Zen Hei, which appears to make Zen Hei Mono unavailable to some programs.
Font Substitution
To work around the limited availability of pan-Unicode fonts many applications rely on a technique called font substitution. With this technique when an application needs to display a character whose glyph is missing from the desired font it searches other fonts, preferably from the same font family, for the missing glyph and then uses the first one it finds to render the character.
Applications that use font substitution include Firefox, Notepad, Excel 2007, Internet Explorer (v7 and newer), Adobe Reader, SQL Developer.
Applications that do not use font substitution include Internet Explorer (v6 and older), PuTTY, Windows XP's cmd.exe window, and applications like SQL*Plus that run inside cmd.exe windows.