27.8 Recognizing Coding Systems
Emacs tries to recognize which coding system to use for a given text
as an integral part of reading that text. (This applies to files
being read, output from subprocesses, text from X selections, etc.)
Emacs can select the right coding system automatically most of the
time—once you have specified your preferences.
Some coding systems can be recognized or distinguished by which byte
sequences appear in the data. However, there are coding systems that
cannot be distinguished, not even potentially. For example, there is no
way to distinguish between Latin-1 and Latin-2; they use the same byte
values with different meanings.
Emacs handles this situation by means of a priority list of coding
systems. Whenever Emacs reads a file, if you do not specify the coding
system to use, Emacs checks the data against each coding system,
starting with the first in priority and working down the list, until it
finds a coding system that fits the data. Then it converts the file
contents assuming that they are represented in this coding system.
The priority list of coding systems depends on the selected language
environment (see Language Environments). For example, if you use
French, you probably want Emacs to prefer Latin-1 to Latin-2; if you use
Czech, you probably want Latin-2 to be preferred. This is one of the
reasons to specify a language environment.
However, you can alter the coding system priority list in detail
with the command M-x prefer-coding-system. This command reads
the name of a coding system from the minibuffer, and adds it to the
front of the priority list, so that it is preferred to all others. If
you use this command several times, each use adds one element to the
front of the priority list.
If you use a coding system that specifies the end-of-line conversion
type, such as iso-8859-1-dos
, what this means is that Emacs
should attempt to recognize iso-8859-1
with priority, and should
use DOS end-of-line conversion when it does recognize iso-8859-1
.
Sometimes a file name indicates which coding system to use for the
file. The variable file-coding-system-alist
specifies this
correspondence. There is a special function
modify-coding-system-alist
for adding elements to this list. For
example, to read and write all ‘.txt’ files using the coding system
china-iso-8bit
, you can execute this Lisp expression:
(modify-coding-system-alist 'file "\\.txt\\'" 'chinese-iso-8bit)
The first argument should be file
, the second argument should be
a regular expression that determines which files this applies to, and
the third argument says which coding system to use for these files.
Emacs recognizes which kind of end-of-line conversion to use based on
the contents of the file: if it sees only carriage-returns, or only
carriage-return linefeed sequences, then it chooses the end-of-line
conversion accordingly. You can inhibit the automatic use of
end-of-line conversion by setting the variable inhibit-eol-conversion
to non-nil
. If you do that, DOS-style files will be displayed
with the ‘^M’ characters visible in the buffer; some people
prefer this to the more subtle ‘(DOS)’ end-of-line type
indication near the left edge of the mode line (see eol-mnemonic).
By default, the automatic detection of coding system is sensitive to
escape sequences. If Emacs sees a sequence of characters that begin
with an escape character, and the sequence is valid as an ISO-2022
code, that tells Emacs to use one of the ISO-2022 encodings to decode
the file.
However, there may be cases that you want to read escape sequences
in a file as is. In such a case, you can set the variable
inhibit-iso-escape-detection
to non-nil
. Then the code
detection ignores any escape sequences, and never uses an ISO-2022
encoding. The result is that all escape sequences become visible in
the buffer.
The default value of inhibit-iso-escape-detection
is
nil
. We recommend that you not change it permanently, only for
one specific operation. That's because many Emacs Lisp source files
in the Emacs distribution contain non-ASCII characters encoded in the
coding system iso-2022-7bit
, and they won't be
decoded correctly when you visit those files if you suppress the
escape sequence detection.
You can specify the coding system for a particular file using the
‘-*-...-*-’ construct at the beginning of a file, or a
local variables list at the end (see File Variables). You do this
by defining a value for the “variable” named coding
. Emacs
does not really have a variable coding
; instead of setting a
variable, this uses the specified coding system for the file. For
example, ‘-*-mode: C; coding: latin-1;-*-’ specifies use of the
Latin-1 coding system, as well as C mode. When you specify the coding
explicitly in the file, that overrides
file-coding-system-alist
.
The variables auto-coding-alist
,
auto-coding-regexp-alist
and auto-coding-functions
are
the strongest way to specify the coding system for certain patterns of
file names, or for files containing certain patterns; these variables
even override ‘-*-coding:-*-’ tags in the file itself. Emacs
uses auto-coding-alist
for tar and archive files, to prevent it
from being confused by a ‘-*-coding:-*-’ tag in a member of the
archive and thinking it applies to the archive file as a whole.
Likewise, Emacs uses auto-coding-regexp-alist
to ensure that
RMAIL files, whose names in general don't match any particular
pattern, are decoded correctly. One of the builtin
auto-coding-functions
detects the encoding for XML files.
If Emacs recognizes the encoding of a file incorrectly, you can
reread the file using the correct coding system by typing C-x
<RET> r coding-system
<RET>. To see what coding system Emacs actually used to decode
the file, look at the coding system mnemonic letter near the left edge
of the mode line (see Mode Line), or type C-h C <RET>.
The command unify-8859-on-decoding-mode
enables a mode that
“unifies” the Latin alphabets when decoding text. This works by
converting all non-ASCII Latin-n characters to either Latin-1 or
Unicode characters. This way it is easier to use various
Latin-n alphabets together. In a future Emacs version we hope
to move towards full Unicode support and complete unification of
character sets.
Once Emacs has chosen a coding system for a buffer, it stores that
coding system in buffer-file-coding-system
and uses that coding
system, by default, for operations that write from this buffer into a
file. This includes the commands save-buffer
and
write-region
. If you want to write files from this buffer using
a different coding system, you can specify a different coding system for
the buffer using set-buffer-file-coding-system
(see Specify Coding).
You can insert any possible character into any Emacs buffer, but
most coding systems can only handle some of the possible characters.
This means that it is possible for you to insert characters that
cannot be encoded with the coding system that will be used to save the
buffer. For example, you could start with an ASCII file and insert a
few Latin-1 characters into it, or you could edit a text file in
Polish encoded in iso-8859-2
and add some Russian words to it.
When you save the buffer, Emacs cannot use the current value of
buffer-file-coding-system
, because the characters you added
cannot be encoded by that coding system.
When that happens, Emacs tries the most-preferred coding system (set
by M-x prefer-coding-system or M-x
set-language-environment), and if that coding system can safely
encode all of the characters in the buffer, Emacs uses it, and stores
its value in buffer-file-coding-system
. Otherwise, Emacs
displays a list of coding systems suitable for encoding the buffer's
contents, and asks you to choose one of those coding systems.
If you insert the unsuitable characters in a mail message, Emacs
behaves a bit differently. It additionally checks whether the
most-preferred coding system is recommended for use in MIME messages;
if not, Emacs tells you that the most-preferred coding system is
not recommended and prompts you for another coding system. This is so
you won't inadvertently send a message encoded in a way that your
recipient's mail software will have difficulty decoding. (If you do
want to use the most-preferred coding system, you can still type its
name in response to the question.)
When you send a message with Mail mode (see Sending Mail), Emacs has
four different ways to determine the coding system to use for encoding
the message text. It tries the buffer's own value of
buffer-file-coding-system
, if that is non-nil
. Otherwise,
it uses the value of sendmail-coding-system
, if that is
non-nil
. The third way is to use the default coding system for
new files, which is controlled by your choice of language environment,
if that is non-nil
. If all of these three values are nil
,
Emacs encodes outgoing mail using the Latin-1 coding system.
When you get new mail in Rmail, each message is translated
automatically from the coding system it is written in, as if it were a
separate file. This uses the priority list of coding systems that you
have specified. If a MIME message specifies a character set, Rmail
obeys that specification, unless rmail-decode-mime-charset
is
nil
.
For reading and saving Rmail files themselves, Emacs uses the coding
system specified by the variable rmail-file-coding-system
. The
default value is nil
, which means that Rmail files are not
translated (they are read and written in the Emacs internal character
code).