Note: If you are unfamiliar with Unicode see the brief primer at this link before continuing.
Information in this document was gathered from a Red Hat Enterprise Linux AS release 3 system. The behaviour of other systems may vary.
Locale
A UNIX session's character encoding is controlled via it's
locale
plus a set of
Internationalization Variables that identify a users language, location, character encoding,
and local preferences. The locale
command will display a session's
current locale settings. Sample output for a typical session follows.
$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL=
Values for locale environment variables are in the following format.
language[_territory][.codeset]
language
is an ISO 639-1 code
like "en" for English or "it" for Italian.
territory
is an ISO 3166-1 country
code like "US" for United States or "IN" for India.
codeset
is a character encoding or character set name such as "UTF-8" or
"ISO-8859-1". Many UNIX/Linux systems use UTF-8 as the default codeset for most
locales.
Byte Order Mark
UNIX/Linux Unicode files do not use Byte Order Mark (BOM) characters.