Unicode: Unicode on Windows XP

Note: If you are unfamiliar with Unicode see the brief primer at this link before continuing.

Working with character sets on Windows XP is not trivial. Its commands and applications support different combinations of various encoding standards that have emerged over the years. In general when working with Unicode data from an Oracle database (e.g. creating a SQL*Plus spool file) it is easiest to either use SQL Developer or do as much work as you can on a UNIX system and then transfer the data to Windows as needed (e.g. to load the spool file into a spreadsheet).

If you cannot use SQL Developer or UNIX then the following information can help. The first chart summarizes character set support in a few common applications. Detailed information on each of the applications is presented in separate topics later in this tutorial.

Encoding	Byte Order Mark ?	Command Prompt (cmd.exe)		Powershell ISE		Notepad		Excel 2007		Internet Explorer 6	Firefox 3	SQL*Plus		SQL Developer
¹ Some characters may not display properly due to font limitations. ² Data in this format is converted to UTF-8 before it is displayed in SQL Developer
Encoding	Byte Order Mark ?	TYPE	Write to File	TYPE	Write to File	Open File	Save File	Open File	Save File	Open File	Open File	Screen Output	Spool File	Results Tab	Export Files
cp850 (OEM or MS-DOS)		Y	Y		Y			Y	Y		Y	Y	Y	Y²	Y
windows-1252 (Windows "ANSI")		Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y	Y²	Y
UTF-8		Y¹	Y			Y		Y			Y	Y¹	Y	Y	Y
UTF-8	Y			Y	Y	Y	Y	Y		Y	Y				Y
UTF-16LE			Y			Y					Y				Y
UTF-16	LE	Y¹		Y	Y	Y	Y	Y	Y	Y	Y
UTF-16BE						Y					Y			Y²	Y
UTF-16	BE			Y	Y	Y	Y			Y	Y				Y

Code Pages

Windows XP (as well as earlier versions of Windows and DOS) implements character sets in "code pages". Character based applications use whichever code page is set as the active "OEM" (aka "MS-DOS") code page and Win32 applications use whichever code page is set as the active "ANSI" code page. (Note that Windows "ANSI" code pages do not necessarily map to official ANSI standard character sets.)

The following table lists some useful code pages and their corresponding Oracle character set. Bold values identify characters sets that will be discussed later in this document.

Windows Code Pages
Code Page Type	Code Page Number	IANA Character Set Name	Oracle Character Set	Notes
OEM (MS-DOS)	437	cp437	US8PC437
	737		EL8PC737
	850	cp850	WE8PC850
	852	cp852	EE8PC852
	857	cp857	TR8PC857
	858	cp00858	WE8PC858
	861	cp861	IS8PC861
	862	cp862	IW8PC1507
	865	cp865	N8PC865
	866	cp866	RU8PC866
ANSI	874	TIS-620	TH8TISASCII
	932	SHIFT_JIS	JA16SJIS
	936	GBK	ZHS16GBK
	949	EUC-KR	KO16MSWIN949
	950	BIG5	ZHT16MSWIN950	except for Hong Kong
	1250	us-ascii	EE8MSWIN1250
	1251	windows-1251	CL8MSWIN1251
	1252	windows-1252	WE8MSWIN1252
	1253	windows-1253	EL8MSWIN1253
	1254	windows-1254	TR8MSWIN1254
	1255	windows-1255	IW8MSWIN1255
	1256	windows-1256	AR8MSWIN1256
	1257	windows-1257	BLT8MSWIN1257
	1258	windows-1258	VN8MSWIN1258
Other	1200	UTF-16LE	AL16UTF16LE	available only to managed applications
	1201	UTF-16BE	AL16UTF16	available only to managed applications
	65001	UTF-8	AL32UTF8	Oracle character set UTF8 (which is available for backward compatibility with Oracle 8) can also be used with characters U+0000 - U+FFFF

A full list of MS code pages is available at this link. In a Command Prompt window the current OEM code page can be viewed or set using the CHCP command.

Command Prompt

Displaying File Contents

To view the largest number of characters properly in a Command Prompt window (aka "DOS Window" or "cmd.exe") its font should be set to "Lucida Console". To do this right-click the title bar and choose Properties / Font and then highlight "Lucida Console". While "Lucida Console" does include more characters than the default Raster fonts, it still lacks a large number of Unicode characters (notably Asian characters). Characters that have no glyph in the Lucida Console font will appear as box characters, e.g.

⌷⌷⌷

Installing a font with more characters, like "Arial Unicode MS", will not resolve this issue because such fonts are not compatible with cmd.exe (see necessary criteria for fonts to be available in a command window). Users who absolutely need to see a wide variety of characters in a character mode window have two options.

copy an existing font and customize it to make it compatible with cmd.exe (requires digital font creation skills)
use Windows Powershell ISE (see the next section)

When displaying file contents using the TYPE command characters may not display properly if the session's code page does not match the file's encoding.

D:\Work\Unicode>chcp 850
Active code page: 850

D:\Work\Unicode>type cp850.txt
abc-àèìòù©                              <-- correct

D:\Work\Unicode>type cp1252-ansi.txt
abc-ÓÞý‗¨®                              <-- incorrect since file's encoding differs from session's

D:\Work\Unicode>chcp 1252
Active code page: 1252

D:\Work\Unicode>type cp1252-ansi.txt
abc-àèìòù©                              <-- now it displays properly

D:\Work\Unicode>chcp 65001
Active code page: 65001

D:\Work\Unicode>type utf-8-no-bom.txt
abc-àèìòù©-⌷⌷⌷                       <-- Lucida Console does not have glyphs 
                                            for the last three Korean characters; 
                                            however they will cut/paste into GUI
                                            applications correctly

The following chart lists test results observed by TYPE'ing files with various encodings.

CHCP Setting	Encoding	Byte Order Mark	Test Data	Result	Notes
850	cp850		abc-àèìòù©	Pass
1252	windows-1252		abc-àèìòù©	Pass
65001	UTF-8		abc-àèìòù©-뮻뮼뮽	Pass	replacement characters (boxes) appeared for Korean characters
(any)	UTF-8	UTF-8	abc-àèìòù©-뮻뮼뮽	Fail	BOM character was incorrectly displayed
(any)	UTF-16LE		abc-àèìòù©-뮻뮼뮽	Fail
(any)	UTF-16	LE	abc-àèìòù©-뮻뮼뮽	Pass	replacement characters (boxes) appeared for Korean characters
(any)	UTF-16BE		abc-àèìòù©-뮻뮼뮽	Fail
(any)	UTF-16	BE	abc-àèìòù©-뮻뮼뮽	Fail

Saving Files

By default files created from a command line in Windows XP are saved using the current OEM code page. The current code page can be viewed and set using the CHCP command.

D:\Work\Unicode>chcp
Active code page: 850

D:\Work\Unicode>echo abc-àèìòù© > cp850-encoded.txt

D:\Work\Unicode>chcp 65001
D:\Work\Unicode>echo abc-àèìòù©⌷⌷⌷ > utf-8-no-bom-encoded.txt

In the example above the echo'd text includes three Korean characters at the end of the string. Since Lucida Console has no glyphs for these characters they appear as boxes on the screen. However when utf-8-no-bom-encoded.txt is opened in a GUI application that supports font substitution and UTF-8 the characters are typically displayed properly.

When cmd.exe is started with the /U switch files are created with UTF-16LE encoding ( without a Byte Order Mark (BOM) ).

cmd /u /c echo abc-àèìòù© > utf-16le-no-bom.txt

Note that UTF-16 code pages are not supported in Command Prompt windows.

D:\Work\Unicode>chcp 1200
Invalid code page

D:\Work\Unicode>chcp 1201
Invalid code page

Cutting and Pasting Unicode Characters

Even though font limitations prevent us from seeing all Unicode characters in a cmd.exe window we can still cut and paste characters into Unicode aware GUI clients like Notepad or Excel 2007. Unlike cmd.exe, which uses a single character set to display all characters, when applications like Notepad encounter characters that are not defined in the current character set they scan other character sets for the missing glyphs and display those when found (this is known as font substitution). Thus any characters that appeared missing, garbled, or were represented by boxes in cmd.exe can appear properly in Notepad, even when both applications are set to the same font.

Windows Powershell ISE

An GUI alternative to the character based Command Prompt window is available for Windows XP (and other versions of Windows). It is called Powershell ISE and is available as a component of Powershell v2 inside the Windows Management Framework Core. The Windows Powershell ISE tool displays three panes -- Script, Command, and Output. Here is a screen shot of these panes.

Windows Powershell ISE Screenshot

Unlike Command Prompt windows, Windows Powershell ISE's Output Pane will correctly display most Unicode characters, even those that have no glyph in the default font.

Displaying File Contents

In Powershell the TYPE command is actually an alias for a cmdlet called Get-Content. Other aliases for this cmdlet are "cat" and "gc". It is important to note that Powershell's TYPE command behaves differently from cmd.exe's TYPE command when it comes to encodings. Here are some examples of what you can expect to see in the Output Pane.

PS D:\work\unicode> chcp
Active code page: 850

PS D:\work\unicode> type cp850.txt
abc-…Š�•—¸                                <-- fails in ISE, works in cmd.exe

PS D:\work\unicode> chcp 1252
Active code page: 1252

PS D:\work\unicode> type cp1252-ansi.txt
abc-àèìòù©                                 <-- works in both

PS D:\work\unicode> chcp 65001
Active code page: 65001

PS D:\work\unicode> type utf-8-no-bom.txt
abc-Ã Ã¨Ã¬Ã²Ã¹Â©-ë®»ë®¼ë®½                 <-- fails in ISE, works in cmd.exe

The following chart lists test results observed by TYPE'ing files with various encodings.

CHCP Setting	Encoding	Byte Order mark	Test Data	Result
850	cp850		abc-àèìòù©	Fail
1252	windows-1252		abc-àèìòù©	Pass
65001	UTF-8		abc-àèìòù©-뮻뮼뮽	Fail
(any)	UTF-8	UTF-8	abc-àèìòù©-뮻뮼뮽	Pass
(any)	UTF-16LE		abc-àèìòù©-뮻뮼뮽	Fail
(any)	UTF-16	LE	abc-àèìòù©-뮻뮼뮽	Pass
(any)	UTF-16BE		abc-àèìòù©-뮻뮼뮽	Fail
(any)	UTF-16	BE	abc-àèìòù©-뮻뮼뮽	Pass

Creating Files

When files are created using the ">" redirection operator in Powershell ISE they are created using a UTF-16 (with LE BOM) encoding regardless of the current code page setting. To create a file with a different encoding data can be piped to the "Out-File" cmdlet whose sytax is

Out-File [-FilePath] <string> [[-Encoding] <string>]"

Here are some examples.

PS D:\work> echo abc-àèìòù© | Out-File a-cp850-file.txt -Encoding OEM 

PS D:\work> echo abc-àèìòù© | Out-File a-cp1252-file.txt -Encoding Default

PS D:\work> echo abc-àèìòù©-뮻뮼뮽 | Out-File a-utf-8-bom-file.txt -Encoding UTF8

PS D:\work> echo abc | Out-File a-ascii-file.txt -Encoding ASCII

Valid values for Out-File's -Encoding parameter are:

-Encoding
<string>	Encoding	Default
Default	the system's current ANSI code page
OEM	the system's current OEM (aka MS-DOS) code page
ASCII	US-ASCII
UTF7	UTF-7
UTF8	UTF-8 (BOM)
Unicode	UTF16 (LE BOM)	Y
BigEndianUnicode	UTF16 (BE BOM)
UTF32	UTF-32 (LE BOM)

SQL Snippets ™: Tutorials for Oracle Developers

Unicode

Unicode on Windows XP

Code Pages

Command Prompt

Displaying File Contents

Saving Files

Cutting and Pasting Unicode Characters

Windows Powershell ISE

Displaying File Contents

Creating Files

See Also

Linking to SQL Snippets ™