SQL Features Tutorials: Unicode

Unicode is a computing standard that supports text written in a large number of modern and ancient writing systems. Among other things the standard defines a code mapping and various character encoding schemes. The code mapping assigns each character a unique number. The following table lists some examples.

Unicode Value	Decimal Value	Hex Value	Character
U+0041	65	0041	A
U+00C6	198	00C6	Æ
U+27A8	10152	27A8	➨
U+BBBB	48059	BBBB	뮻

The full Unicode code mapping is documented in a number of charts located at the The Unicode Code Charts.

Encoding

A character's encoding defines the format of the bytes used to store the character in either memory or on disk. ASCII encoding is the simplest. With ASCII encoding a character at position 65 (decimal) in the code chart is stored as a single byte with the value 65 (or 01000001). Unfortunately ASCII encoding is not capable of storing more than 128 characters.

To address ASCII encoding's limitations a number of different encoding schemes were tried. The following table lists some popular schemes in use before Unicode encodings became popular.

Encoding Scheme	Word Length (bits)	Number of Bytes per Character	Number of Characters	Notes
ASCII	7	1	128
ISO-8859	8	1	256	defines 15 different encodings, parts 1-16 (part 12 omitted)
OEM Code Pages	8	1	256	used by Windows character based applications aka "DOS" or "IBM PC" code pages e.g. cp437 (DOS US) and cp850 (DOS Western European) only one OEM code page is active on a system at any given time
Windows Code Pages	8	1	256	used by Windows Win32 applications Windows also calls these "ANSI" code pages, though they do not necessarily map to any ANSI standards e.g. 1252 is the Western European code page, 932 is the Japanese code page only one windows code page is active on a system at any given time a system can have both an active OEM and an active Windows code page at the same time

Once Unicode appeared a number of newer encoding schemes were created. Some of the more popular ones are listed below.

Introduced in Unicode Version	Encoding Scheme	Word Length (bits)	Number of Bytes per Character	Uses a Byte Order Mark (BOM)	Notes
1.0	UCS-2	16	2	No	obsolete
2.0	UTF-16	16	2 or 4	Yes U+FFFE (Little Endian) U+FEFF (Big Endian)	two different byte orders are supported, Little Endian and Big Endian Little Endian is popular on Intel systems Big Endian is popular on Motorola and SPARC systems
2.0	UTF-16LE	16	2 or 4	No	uses Little Endian byte order in practice most software ignores any accidental BOM's at the start of data
2.0	UTF-16BE	16	2 or 4	No	uses Big Endian byte order in practice most software ignores any accidental BOM's at the start of data
3.0	CESU-8	8	1,2,3, or 6	No	Oracle uses this encoding in its UTF8 character set, which exists for backward compatibility with Oracle 8 databases the characters U+0000 - U+FFFF are encoded the same as those in UTF-8 characters above U+FFFF are encoded as two separate 3 byte characters
3.0 (1,2,3 word characters) 3.1 (4 word characters)	UTF-8	8	1,2,3, or 4	Optional U+EFBBBF (UTF-8 BOM)

UTF-8

UTF-8 is a variable width, 8-bit Unicode encoding that can address over 1 million distinct characters. It is backward compatible with ASCII and is the most popular encoding for Oracle Unicode databases, email, and web pages.

It is important to note that a Unicode character's position in the code chart does not identify the actual bit sequence used to encode the character. Take the Unicode character U+00C6 for example. The hex value C6 in binary is 11000110 but the Unicode character U+00C6 is actually encoded in UTF-8 as two bytes, 11000011 10000110.

Byte Order Marks

A Byte Order Mark (BOM) is a sequence of bytes placed at the start of some files to identify the file's encoding. A BOM may be found in some Windows files. LINUX/UNIX files do not use BOM's. The following table lists three common BOMs.

Standard Encoding Name	Byte Order Mark (hexadecimal representation)
UTF-8	EF BB BF
UTF-16 (Little Endian)	FF FE
UTF-16 (Big Endian)	FE FF

According to the Unicode standard a BOM is optional in UTF-8 files. However, some programs only expect UTF-8 files with no BOM. When such programs open a UTF-8 file that includes a BOM they may incorrectly display the BOM as these three printable characters: ï»¿.

Planes

Unicode characters are grouped into a number of planes. Each plane contains 65,536 code points. The first 65,536 code points (0x0000-0xFFFF) are in Plane 0, which is also called the "Basic Multilingual Plane (BMP)". The BMP contains the characters for almost all modern languages and symbols used for general computing purposes. It is also the only plane of characters supported in early versions of the Unicode standard. Planes above the BMP contain code points for ancient scripts, mathematical and musical symbols, and rarely used Han ideographs.

While significant support and attention is given to planes above the BMP in Unicode standards documentation it is important to note that characters in these planes are rarely encountered in general computing. One reason for this is that all modern font technologies only support a maximum of 65,536 code points per font. Since characters in the BMP are essential in many applications most fonts allocate their 65,536 characters only to this plane.

Pan-Unicode Fonts

Even though any single font can hold up to 65,536 Unicode glyphs in practice there are less than a dozen (as of 2009-12) fonts that hold anywhere near that amount. Fonts with a large number of Unicode glyphs are known as "pan-Unicode fonts", "Unicode fonts", or "Unicode typefaces". Examples of pan-Unicode proportional fonts include

GNU FreeFont fonts
Code 2000
GNU Unifont
Wen Quan Yi fonts
Arial Unicode MS.

Monospace pan-Unicode fonts are even rarer than proportional ones. Notable free or shareware monospace fonts include:

FreeMono - part of the GNU FreeFont family
Everson Mono (shareware)
Wen Quan Yi Zen Hei Mono - this font includes a large number of Chinese, Japanese, Korean, and Vietnamese glyphs; unfortunately Zen Hei Mono co-exists in the same .ttf file as its proportional cousin, Zen Hei, which appears to make Zen Hei Mono unavailable to some programs.

Font Substitution

To work around the limited availability of pan-Unicode fonts many applications rely on a technique called font substitution. With this technique when an application needs to display a character whose glyph is missing from the desired font it searches other fonts, preferably from the same font family, for the missing glyph and then uses the first one it finds to render the character.

Applications that use font substitution include Firefox, Notepad, Excel 2007, Internet Explorer (v7 and newer), Adobe Reader, SQL Developer.

Applications that do not use font substitution include Internet Explorer (v6 and older), PuTTY, Windows XP's cmd.exe window, and applications like SQL*Plus that run inside cmd.exe windows.

SQL Snippets ™: Tutorials for Oracle Developers