Chapter 1. Overview
The C preprocessor implements the macro language used to transform C,
C++, and Objective-C programs before they are compiled. It can also be
useful on its own.
The C preprocessor, often known as cpp, is a macro processor
that is used automatically by the C compiler to transform your program
before compilation. It is called a macro processor because it allows
you to define macros, which are brief abbreviations for longer
constructs.
The C preprocessor is intended to be used only with C, C++, and
Objective-C source code. In the past, it has been abused as a general
text processor. It will choke on input which does not obey C's lexical
rules. For example, apostrophes will be interpreted as the beginning of
character constants, and cause errors. Also, you cannot rely on it
preserving characteristics of the input which are not significant to
C-family languages. If a Makefile is preprocessed, all the hard tabs
will be removed, and the Makefile will not work.
Having said that, you can often get away with using cpp on things which
are not C. Other Algol-ish programming languages are often safe
(Pascal, Ada, etc.) So is assembly, with caution. -traditional-cpp
mode preserves more white space, and is otherwise more permissive. Many
of the problems can be avoided by writing C or C++ style comments
instead of native language comments, and keeping macros simple.
Wherever possible, you should use a preprocessor geared to the language
you are writing in. Modern versions of the GNU assembler have macro
facilities. Most high level programming languages have their own
conditional compilation and inclusion mechanism. If all else fails,
try a true general text processor, such as GNU M4.
C preprocessors vary in some details. This manual discusses the GNU C
preprocessor, which provides a small superset of the features of ISO
Standard C. In its default mode, the GNU C preprocessor does not do a
few things required by the standard. These are features which are
rarely, if ever, used, and may cause surprising changes to the meaning
of a program which does not expect them. To get strict ISO Standard C,
you should use the -std=c89 or -std=c99 options, depending
on which version of the standard you want. To get all the mandatory
diagnostics, you must also use -pedantic. Chapter 12 Invocation.
This manual describes the behavior of the ISO preprocessor. To
minimize gratuitous differences, where the ISO preprocessor's
behavior does not conflict with traditional semantics, the
traditional preprocessor should behave the same way. The various
differences that do exist are detailed in the section Chapter 10 Traditional Mode.
For clarity, unless noted otherwise, references to CPP in this
manual refer to GNU CPP.
1.1. Character sets
Source code character set processing in C and related languages is
rather complicated. The C standard discusses two character sets, but
there are really at least four.
The files input to CPP might be in any character set at all. CPP's
very first action, before it even looks for line boundaries, is to
convert the file into the character set it uses for internal
processing. That set is what the C standard calls the source
character set. It must be isomorphic with ISO 10646, also known as
Unicode. CPP uses the UTF-8 encoding of Unicode.
At present, GNU CPP does not implement conversion from arbitrary file
encodings to the source character set. Use of any encoding other than
plain ASCII or UTF-8, except in comments, will cause errors. Use of
encodings that are not strict supersets of ASCII, such as Shift JIS,
may cause errors even if non-ASCII characters appear only in comments.
We plan to fix this in the near future.
All preprocessing work (the subject of the rest of this manual) is
carried out in the source character set. If you request textual
output from the preprocessor with the -E option, it will be
in UTF-8.
After preprocessing is complete, string and character constants are
converted again, into the execution character set. This
character set is under control of the user; the default is UTF-8,
matching the source character set. Wide string and character
constants have their own character set, which is not called out
specifically in the standard. Again, it is under control of the user.
The default is UTF-16 or UTF-32, whichever fits in the target's
wchar_t type, in the target machine's byte
order.[1] Octal and hexadecimal escape sequences do not undergo
conversion; '\x12' has the value 0x12 regardless of the currently
selected execution character set. All other escapes are replaced by
the character in the source character set that they represent, then
converted to the execution character set, just like unescaped
characters.
GCC does not permit the use of characters outside the ASCII range, nor
\u and \U escapes, in identifiers. We hope this will
change eventually, but there are problems with the standard semantics
of such "extended identifiers" which must be resolved through the
ISO C and C++ committees first.