Unicode

Unicode on Windows XP

Note: If you are unfamiliar with Unicode see the brief primer at this link before continuing.

Working with character sets on Windows XP is not trivial. Its commands and applications support different combinations of various encoding standards that have emerged over the years. In general when working with Unicode data from an Oracle database (e.g. creating a SQL*Plus spool file) it is easiest to either use SQL Developer or do as much work as you can on a UNIX system and then transfer the data to Windows as needed (e.g. to load the spool file into a spreadsheet).

If you cannot use SQL Developer or UNIX then the following information can help. The first chart summarizes character set support in a few common applications. Detailed information on each of the applications is presented in separate topics later in this tutorial.

1 Some characters may not display properly due to font limitations.

2 Data in this format is converted to UTF-8 before it is displayed in SQL Developer

Encoding
Byte Order Mark ? Command Prompt
(cmd.exe)
Powershell ISE
Notepad
Excel 2007
Internet Explorer 6 Firefox 3 SQL*Plus
SQL Developer
TYPE
Write to File
TYPE Write to File Open File
Save File
Open File
Save File
Open File Open File Screen Output
Spool File
Results Tab Export Files
cp850 (OEM or MS-DOS)
Y
Y

Y


Y
Y

Y Y
Y Y2 Y
windows-1252
(Windows "ANSI")
Y
Y
Y
Y
Y
Y
Y
Y
Y Y Y
Y Y2 Y
UTF-8
Y1
Y


Y

Y
Y Y1
Y Y Y
UTF-8
Y

Y
Y
Y
Y
Y
Y Y

Y
UTF-16LE

Y


Y



Y

Y
UTF-16
LE Y1

Y
Y
Y
Y
Y
Y
Y Y

UTF-16BE




Y



Y

Y2 Y
UTF-16
BE

Y
Y
Y
Y


Y Y

Y

Code Pages

Windows XP (as well as earlier versions of Windows and DOS) implements character sets in "code pages". Character based applications use whichever code page is set as the active "OEM" (aka "MS-DOS") code page and Win32 applications use whichever code page is set as the active "ANSI" code page. (Note that Windows "ANSI" code pages do not necessarily map to official ANSI standard character sets.)

The following table lists some useful code pages and their corresponding Oracle character set. Bold values identify characters sets that will be discussed later in this document.

Windows Code Pages
Code Page Type
Code Page Number
IANA Character Set Name Oracle Character Set
Notes
OEM
(MS-DOS)
437 cp437 US8PC437
737 EL8PC737
850 cp850 WE8PC850
852 cp852 EE8PC852
857 cp857 TR8PC857
858 cp00858 WE8PC858
861 cp861 IS8PC861
862 cp862 IW8PC1507
865 cp865 N8PC865
866 cp866 RU8PC866
ANSI
874 TIS-620 TH8TISASCII
932 SHIFT_JIS JA16SJIS
936 GBK ZHS16GBK
949 EUC-KR KO16MSWIN949
950
BIG5 ZHT16MSWIN950
  • except for Hong Kong
1250
us-ascii EE8MSWIN1250
1251
windows-1251 CL8MSWIN1251
1252
windows-1252 WE8MSWIN1252
1253
windows-1253 EL8MSWIN1253
1254 windows-1254 TR8MSWIN1254
1255 windows-1255 IW8MSWIN1255
1256 windows-1256 AR8MSWIN1256
1257 windows-1257 BLT8MSWIN1257
1258 windows-1258 VN8MSWIN1258
Other
1200 UTF-16LE AL16UTF16LE
  • available only to managed applications
1201 UTF-16BE AL16UTF16
  • available only to managed applications
65001
UTF-8 AL32UTF8
  • Oracle character set UTF8 (which is available for backward compatibility with Oracle 8) can also be used with characters U+0000 - U+FFFF

A full list of MS code pages is available at this link. In a Command Prompt window the current OEM code page can be viewed or set using the CHCP command.

Command Prompt

Displaying File Contents

To view the largest number of characters properly in a Command Prompt window (aka "DOS Window" or "cmd.exe") its font should be set to "Lucida Console". To do this right-click the title bar and choose Properties / Font and then highlight "Lucida Console". While "Lucida Console" does include more characters than the default Raster fonts, it still lacks a large number of Unicode characters (notably Asian characters). Characters that have no glyph in the Lucida Console font will appear as box characters, e.g.

⌷⌷⌷

Installing a font with more characters, like "Arial Unicode MS", will not resolve this issue because such fonts are not compatible with cmd.exe (see necessary criteria for fonts to be available in a command window). Users who absolutely need to see a wide variety of characters in a character mode window have two options.

  1. copy an existing font and customize it to make it compatible with cmd.exe (requires digital font creation skills)
  2. use Windows Powershell ISE (see the next section)

When displaying file contents using the TYPE command characters may not display properly if the session's code page does not match the file's encoding.

D:\Work\Unicode>chcp 850
Active code page: 850

D:\Work\Unicode>type cp850.txt
abc-àèìòù©                              <-- correct

D:\Work\Unicode>type cp1252-ansi.txt
abc-ÓÞý‗¨®                              <-- incorrect since file's encoding differs from session's

D:\Work\Unicode>chcp 1252
Active code page: 1252 D:\Work\Unicode>type cp1252-ansi.txt abc-àèìòù© <-- now it displays properly D:\Work\Unicode>chcp 65001
Active code page: 65001 D:\Work\Unicode>type utf-8-no-bom.txt abc-àèìòù©-⌷⌷⌷ <-- Lucida Console does not have glyphs for the last three Korean characters; however they will cut/paste into GUI applications correctly

The following chart lists test results observed by TYPE'ing files with various encodings.

CHCP Setting
Encoding Byte Order Mark Test Data Result Notes
850
cp850 abc-àèìòù© Pass
1252
windows-1252 abc-àèìòù© Pass
65001
UTF-8 abc-àèìòù©-뮻뮼뮽 Pass replacement characters (boxes) appeared for Korean characters
(any)
UTF-8 UTF-8 abc-àèìòù©-뮻뮼뮽 Fail BOM character was incorrectly displayed
(any) UTF-16LE abc-àèìòù©-뮻뮼뮽 Fail
(any) UTF-16 LE abc-àèìòù©-뮻뮼뮽 Pass replacement characters (boxes) appeared for Korean characters
(any) UTF-16BE abc-àèìòù©-뮻뮼뮽 Fail
(any) UTF-16 BE abc-àèìòù©-뮻뮼뮽 Fail
Saving Files

By default files created from a command line in Windows XP are saved using the current OEM code page. The current code page can be viewed and set using the CHCP command.

D:\Work\Unicode>chcp
Active code page: 850

D:\Work\Unicode>echo abc-àèìòù© > cp850-encoded.txt

D:\Work\Unicode>chcp 65001
D:\Work\Unicode>echo abc-àèìòù©⌷⌷⌷ > utf-8-no-bom-encoded.txt

In the example above the echo'd text includes three Korean characters at the end of the string. Since Lucida Console has no glyphs for these characters they appear as boxes on the screen. However when utf-8-no-bom-encoded.txt is opened in a GUI application that supports font substitution and UTF-8 the characters are typically displayed properly.

When cmd.exe is started with the /U switch files are created with UTF-16LE encoding ( without a Byte Order Mark (BOM) ).

cmd /u /c echo abc-àèìòù© > utf-16le-no-bom.txt

Note that UTF-16 code pages are not supported in Command Prompt windows.

D:\Work\Unicode>chcp 1200
Invalid code page

D:\Work\Unicode>chcp 1201
Invalid code page
Cutting and Pasting Unicode Characters

Even though font limitations prevent us from seeing all Unicode characters in a cmd.exe window we can still cut and paste characters into Unicode aware GUI clients like Notepad or Excel 2007. Unlike cmd.exe, which uses a single character set to display all characters, when applications like Notepad encounter characters that are not defined in the current character set they scan other character sets for the missing glyphs and display those when found (this is known as font substitution). Thus any characters that appeared missing, garbled, or were represented by boxes in cmd.exe can appear properly in Notepad, even when both applications are set to the same font.

Windows Powershell ISE

An GUI alternative to the character based Command Prompt window is available for Windows XP (and other versions of Windows). It is called Powershell ISE and is available as a component of Powershell v2 inside the Windows Management Framework Core. The Windows Powershell ISE tool displays three panes -- Script, Command, and Output. Here is a screen shot of these panes.

Windows Powershell ISE Screenshot

Unlike Command Prompt windows, Windows Powershell ISE's Output Pane will correctly display most Unicode characters, even those that have no glyph in the default font.

Displaying File Contents

In Powershell the TYPE command is actually an alias for a cmdlet called Get-Content. Other aliases for this cmdlet are "cat" and "gc". It is important to note that Powershell's TYPE command behaves differently from cmd.exe's TYPE command when it comes to encodings. Here are some examples of what you can expect to see in the Output Pane.

PS D:\work\unicode> chcp
Active code page: 850

PS D:\work\unicode> type cp850.txt
abc-…Š�•—¸                                <-- fails in ISE, works in cmd.exe

PS D:\work\unicode> chcp 1252
Active code page: 1252

PS D:\work\unicode> type cp1252-ansi.txt
abc-àèìòù©                                 <-- works in both

PS D:\work\unicode> chcp 65001
Active code page: 65001

PS D:\work\unicode> type utf-8-no-bom.txt
abc-à èìòù©-뮻뮼뮽                 <-- fails in ISE, works in cmd.exe

The following chart lists test results observed by TYPE'ing files with various encodings.

CHCP Setting
Encoding Byte Order mark Test Data Result
850
cp850 abc-àèìòù© Fail
1252
windows-1252 abc-àèìòù© Pass
65001
UTF-8 abc-àèìòù©-뮻뮼뮽 Fail
(any)
UTF-8 UTF-8 abc-àèìòù©-뮻뮼뮽 Pass
(any) UTF-16LE abc-àèìòù©-뮻뮼뮽 Fail
(any) UTF-16 LE abc-àèìòù©-뮻뮼뮽 Pass
(any) UTF-16BE abc-àèìòù©-뮻뮼뮽 Fail
(any) UTF-16 BE abc-àèìòù©-뮻뮼뮽 Pass

Creating Files

When files are created using the ">" redirection operator in Powershell ISE they are created using a UTF-16 (with LE BOM) encoding regardless of the current code page setting. To create a file with a different encoding data can be piped to the "Out-File" cmdlet whose sytax is

Out-File [-FilePath] <string> [[-Encoding] <string>]" 

Here are some examples.

PS D:\work> echo abc-àèìòù© | Out-File a-cp850-file.txt -Encoding OEM 

PS D:\work> echo abc-àèìòù© | Out-File a-cp1252-file.txt -Encoding Default

PS D:\work> echo abc-àèìòù©-뮻뮼뮽 | Out-File a-utf-8-bom-file.txt -Encoding UTF8

PS D:\work> echo abc | Out-File a-ascii-file.txt -Encoding ASCII

Valid values for Out-File's -Encoding parameter are:

-Encoding
<string> Encoding Default
Default
the system's current ANSI code page
OEM
the system's current OEM (aka MS-DOS) code page
ASCII
US-ASCII
UTF7
UTF-7
UTF8
UTF-8 (BOM)
Unicode
UTF16 (LE BOM)
Y
BigEndianUnicode
UTF16 (BE BOM)
UTF32
UTF-32 (LE BOM)

See Also




Linking to SQL Snippets ™

To link to this page in Oracle Technology Network Forums or OraFAQ Forums cut and paste this code.

  • [url=http://www.sqlsnippets.com/en/topic-13410.html]SQL Snippets: Unicode - Unicode on Windows XP[/url]

To link to this page in HTML documents or Blogger comments cut and paste this code.

  • <a href="http://www.sqlsnippets.com/en/topic-13410.html">SQL Snippets: Unicode - Unicode on Windows XP</a>

To link to this page in other web sites use the following values.

  • Link Text : SQL Snippets: Unicode - Unicode on Windows XP
  • URL (href): http://www.sqlsnippets.com/en/topic-13410.html