Note: If you are unfamiliar with Unicode
see the brief primer at this link before continuing.
Working with character sets on Windows XP is not trivial. Its commands and
applications support different combinations of various encoding standards that
have emerged over the years. In general when working with Unicode data from an
Oracle database (e.g. creating a SQL*Plus spool file) it is easiest to either
use SQL Developer or do as much work as you can on a UNIX system and then
transfer the data to Windows as needed (e.g. to load the spool file into a
spreadsheet).
If you cannot use SQL Developer or UNIX then the following information can
help. The first chart summarizes character set support in a few common
applications. Detailed information on each of the applications is
presented in separate topics later in this tutorial.
| 1 Some characters may not display properly
due to font limitations.
2 Data in this format is converted to UTF-8 before it is
displayed in SQL Developer
|
Encoding
|
Byte Order Mark ? |
Command Prompt
(cmd.exe)
|
Powershell ISE
|
Notepad
|
Excel 2007
|
Internet Explorer 6 |
Firefox 3 |
SQL*Plus
|
SQL Developer |
TYPE
|
Write to
File
|
TYPE |
Write to
File |
Open File
|
Save File
|
Open File
|
Save File
|
Open File |
Open File |
Screen
Output
|
Spool File
|
Results Tab |
Export Files |
cp850 (OEM or MS-DOS)
|
|
Y
|
Y
|
|
Y
|
|
|
Y
|
Y
|
|
Y |
Y
|
Y |
Y2 |
Y |
windows-1252
(Windows "ANSI")
|
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y |
Y |
Y
|
Y |
Y2 |
Y |
UTF-8
|
|
Y1
|
Y
|
|
|
Y
|
|
Y |
|
|
Y |
Y1
|
Y |
Y |
Y |
UTF-8
|
Y |
|
|
Y
|
Y
|
Y
|
Y
|
Y |
|
Y |
Y |
|
|
|
Y |
UTF-16LE
|
|
|
Y
|
|
|
Y
|
|
|
|
|
Y |
|
|
|
Y |
UTF-16
|
LE |
Y1
|
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y
|
Y |
Y |
|
|
|
|
UTF-16BE
|
|
|
|
|
|
Y
|
|
|
|
|
Y |
|
|
Y2 |
Y |
UTF-16
|
BE |
|
|
Y
|
Y
|
Y
|
Y
|
|
|
Y |
Y |
|
|
|
Y |
Code Pages
Windows XP (as well as earlier versions of Windows and DOS) implements
character sets in "code pages". Character based applications use whichever code
page is set as the active "OEM" (aka "MS-DOS") code page and Win32 applications
use whichever code page is set as the active "ANSI" code page. (Note that
Windows "ANSI" code pages do not necessarily map to official ANSI standard
character sets.)
The following table lists some useful code pages and their corresponding
Oracle character set. Bold values identify characters sets that will be
discussed later in this document.
Windows Code Pages
Code Page Type
|
Code Page Number
|
IANA Character Set Name |
Oracle Character Set
|
Notes
|
OEM
(MS-DOS)
|
437 |
cp437 |
US8PC437 |
|
| 737 |
|
EL8PC737 |
|
| 850 |
cp850 |
WE8PC850 |
|
| 852 |
cp852 |
EE8PC852 |
|
| 857 |
cp857 |
TR8PC857 |
|
| 858 |
cp00858 |
WE8PC858 |
|
| 861 |
cp861 |
IS8PC861 |
|
| 862 |
cp862 |
IW8PC1507 |
|
| 865 |
cp865 |
N8PC865 |
|
| 866 |
cp866 |
RU8PC866 |
|
ANSI
|
874 |
TIS-620 |
TH8TISASCII |
|
| 932 |
SHIFT_JIS |
JA16SJIS |
|
| 936 |
GBK |
ZHS16GBK |
|
| 949 |
EUC-KR |
KO16MSWIN949 |
|
950
|
BIG5 |
ZHT16MSWIN950
|
|
1250
|
us-ascii |
EE8MSWIN1250 |
|
1251
|
windows-1251 |
CL8MSWIN1251 |
|
1252
|
windows-1252 |
WE8MSWIN1252 |
|
1253
|
windows-1253 |
EL8MSWIN1253 |
|
| 1254 |
windows-1254 |
TR8MSWIN1254 |
|
| 1255 |
windows-1255 |
IW8MSWIN1255 |
|
| 1256 |
windows-1256 |
AR8MSWIN1256 |
|
| 1257 |
windows-1257 |
BLT8MSWIN1257 |
|
| 1258 |
windows-1258 |
VN8MSWIN1258 |
|
Other
|
1200 |
UTF-16LE |
AL16UTF16LE |
- available only to managed applications
|
| 1201 |
UTF-16BE |
AL16UTF16 |
- available only to managed applications
|
65001
|
UTF-8 |
AL32UTF8
|
- Oracle character set UTF8 (which is available for backward compatibility with Oracle 8)
can also be used with characters U+0000 - U+FFFF
|
A full list of MS code pages is available at this
link. In a Command Prompt window the current OEM code page can be viewed or
set using the CHCP command.
Command Prompt
Displaying File Contents
To view the largest number of characters properly in a Command Prompt window
(aka "DOS Window" or "cmd.exe") its font should be set to "Lucida Console". To
do this right-click the title bar and choose Properties / Font and then
highlight "Lucida Console". While "Lucida Console" does include more characters
than the default Raster fonts, it still lacks a large number of Unicode
characters (notably Asian characters). Characters that have no glyph in the
Lucida Console font will appear as box characters, e.g.
⌷⌷⌷
Installing a font with more characters, like "Arial Unicode MS", will not
resolve this issue because such fonts are not compatible with cmd.exe (see
necessary criteria for fonts to
be available in a command window). Users who absolutely need to see a wide
variety of characters in a character mode window have two options.
- copy an existing font and customize it to make it compatible with cmd.exe
(requires digital font creation skills)
- use Windows Powershell ISE (see the next section)
When displaying file contents using the TYPE command characters may not
display properly if the session's code page does not match the file's
encoding.
D:\Work\Unicode>chcp 850
Active code page: 850
D:\Work\Unicode>type cp850.txt
abc-àèìòù© <-- correct
D:\Work\Unicode>type cp1252-ansi.txt
abc-ÓÞý‗¨® <-- incorrect since file's encoding differs from session's
D:\Work\Unicode>chcp 1252
Active code page: 1252
D:\Work\Unicode>type cp1252-ansi.txt
abc-àèìòù© <-- now it displays properly
D:\Work\Unicode>chcp 65001
Active code page: 65001
D:\Work\Unicode>type utf-8-no-bom.txt
abc-àèìòù©-⌷⌷⌷ <-- Lucida Console does not have glyphs
for the last three Korean characters;
however they will cut/paste into GUI
applications correctly
The following chart lists test results observed by TYPE'ing files with
various encodings.
CHCP Setting
|
Encoding |
Byte Order Mark |
Test Data |
Result |
Notes
|
850
|
cp850 |
|
abc-àèìòù© |
Pass |
|
1252
|
windows-1252 |
|
abc-àèìòù© |
Pass |
|
65001
|
UTF-8 |
|
abc-àèìòù©-뮻뮼뮽 |
Pass |
replacement characters (boxes) appeared
for Korean characters
|
(any)
|
UTF-8 |
UTF-8 |
abc-àèìòù©-뮻뮼뮽 |
Fail |
BOM character was incorrectly
displayed
|
| (any) |
UTF-16LE |
|
abc-àèìòù©-뮻뮼뮽 |
Fail |
|
| (any) |
UTF-16 |
LE |
abc-àèìòù©-뮻뮼뮽 |
Pass |
replacement characters (boxes) appeared
for Korean characters |
| (any) |
UTF-16BE |
|
abc-àèìòù©-뮻뮼뮽 |
Fail |
|
| (any) |
UTF-16 |
BE |
abc-àèìòù©-뮻뮼뮽 |
Fail |
|
Saving Files
By default files created from a command line in Windows XP are saved using
the current OEM code page. The current code page can be viewed and set using
the CHCP command.
D:\Work\Unicode>chcp
Active code page: 850
D:\Work\Unicode>echo abc-àèìòù© > cp850-encoded.txt
D:\Work\Unicode>chcp 65001
D:\Work\Unicode>echo abc-àèìòù©⌷⌷⌷ > utf-8-no-bom-encoded.txt
In the example above the echo'd text includes three Korean characters at the
end of the string. Since Lucida Console has no glyphs for these characters they
appear as boxes on the screen. However when
utf-8-no-bom-encoded.txt is opened in a GUI application that
supports font
substitution and UTF-8 the characters are typically displayed properly.
When cmd.exe is started with the /U switch files are created with UTF-16LE
encoding ( without a
Byte Order Mark (BOM)
).
cmd /u /c echo abc-àèìòù© > utf-16le-no-bom.txt
Note that UTF-16 code pages are not supported in Command Prompt windows.
D:\Work\Unicode>chcp 1200
Invalid code page
D:\Work\Unicode>chcp 1201
Invalid code page
Cutting and Pasting Unicode Characters
Even though font limitations prevent us from seeing all Unicode characters
in a cmd.exe window we can still cut and paste characters into Unicode aware
GUI clients like Notepad or Excel 2007. Unlike cmd.exe, which uses a single
character set to display all characters, when applications like Notepad
encounter characters that are not defined in the current character set they
scan other character sets for the missing glyphs and display those when found
(this is known as font
substitution). Thus any characters that appeared missing, garbled, or were
represented by boxes in cmd.exe can appear properly in Notepad, even when both
applications are set to the same font.
Windows Powershell ISE
An GUI alternative to the character based Command Prompt window is available
for Windows XP (and other versions of Windows). It is called Powershell ISE and
is available as a component of Powershell v2 inside the Windows Management Framework
Core. The Windows Powershell ISE tool displays three panes -- Script,
Command, and Output. Here is a screen shot of these panes.
Unlike Command Prompt windows, Windows Powershell ISE's Output Pane will
correctly display most Unicode characters, even those that have no glyph in the
default font.
Displaying File Contents
In Powershell the TYPE command is actually an alias for a cmdlet called
Get-Content. Other aliases for this cmdlet are "cat" and "gc". It is important
to note that Powershell's TYPE command behaves differently from cmd.exe's TYPE
command when it comes to encodings. Here are some examples of what you can
expect to see in the Output Pane.
PS D:\work\unicode> chcp
Active code page: 850
PS D:\work\unicode> type cp850.txt
abc-…Š�•—¸ <-- fails in ISE, works in cmd.exe
PS D:\work\unicode> chcp 1252
Active code page: 1252
PS D:\work\unicode> type cp1252-ansi.txt
abc-àèìòù© <-- works in both
PS D:\work\unicode> chcp 65001
Active code page: 65001
PS D:\work\unicode> type utf-8-no-bom.txt
abc-à èìòù©-뮻뮼뮽 <-- fails in ISE, works in cmd.exe
The following chart lists test results observed by TYPE'ing files with
various encodings.
CHCP Setting
|
Encoding |
Byte Order mark |
Test Data |
Result |
850
|
cp850 |
|
abc-àèìòù© |
Fail |
1252
|
windows-1252 |
|
abc-àèìòù© |
Pass |
65001
|
UTF-8 |
|
abc-àèìòù©-뮻뮼뮽 |
Fail |
(any)
|
UTF-8 |
UTF-8 |
abc-àèìòù©-뮻뮼뮽 |
Pass |
| (any) |
UTF-16LE |
|
abc-àèìòù©-뮻뮼뮽 |
Fail |
| (any) |
UTF-16 |
LE |
abc-àèìòù©-뮻뮼뮽 |
Pass |
| (any) |
UTF-16BE |
|
abc-àèìòù©-뮻뮼뮽 |
Fail |
| (any) |
UTF-16 |
BE |
abc-àèìòù©-뮻뮼뮽 |
Pass |
Creating Files
When files are created using the ">" redirection operator in Powershell
ISE they are created using a UTF-16 (with LE BOM) encoding regardless of the
current code page setting. To create a file with a different encoding data can
be piped to the "Out-File" cmdlet whose sytax is
Out-File [-FilePath] <string> [[-Encoding] <string>]"
Here are some examples.
PS D:\work> echo abc-àèìòù© | Out-File a-cp850-file.txt -Encoding OEM
PS D:\work> echo abc-àèìòù© | Out-File a-cp1252-file.txt -Encoding Default
PS D:\work> echo abc-àèìòù©-뮻뮼뮽 | Out-File a-utf-8-bom-file.txt -Encoding UTF8
PS D:\work> echo abc | Out-File a-ascii-file.txt -Encoding ASCII
Valid values for Out-File's -Encoding parameter are:
-Encoding
| <string> |
Encoding |
Default |
Default |
the system's current ANSI code page |
|
OEM |
the system's current OEM (aka MS-DOS) code page |
|
ASCII |
US-ASCII |
|
UTF7 |
UTF-7 |
|
UTF8 |
UTF-8 (BOM) |
|
Unicode |
UTF16 (LE BOM) |
Y |
BigEndianUnicode |
UTF16 (BE BOM) |
|
UTF32 |
UTF-32 (LE BOM) |
|
See Also