Unix Programming - Internationalization

Internationalization

An in-depth discussion of code internationalization — designing software so the interface readily incorporates multiple languages and the vagaries of different character sets — would be out of scope for this book. However, a few lessons for good practice do stand out from Unix experience.

First, separate the message base from the code . Good Unix practice is to separate the message strings a program uses from its code. so that message dictionaries in other languages can be plugged in without modifying the code.

The best-known tool for this job is GNU gettext, which requires that you wrap native-language strings that need to be internationalized in a special macro. The macro uses each string as a key into per-language dictionaries which can be supplied as separate files. If no such dictionaries are available (or if they are but the string lookup does not return a match), the macro simply returns its argument, implicitly falling back on the native language in the code.

While gettext itself is messy and fragile as of mid-2003, its general philosophy is sound. For many projects, it is possible to craft a lighter-weight version of this idea with good results.

Second, there is a clear trend in modern Unixes to scrap all the historical cruft associated with multiple character sets and make applications natively speak UTF-8, the 8-bit shift encoding of the Unicode character set (as opposed to, say, making them natively speak 16-bit wide characters). The low 128 characters of UTF-8 are ASCII, and the low 256 are Latin-1, which means this choice is backward-compatible with the two most widely used character sets. The fact that XML and Java have made this choice helps, but the momentum is present even where XML and Java are not.

Third, beware of character ranges in regular expressions. The element [a-z] will not necessarily catch all lower-case letters if the script or program it's in is applied to (say) German, where the sharp-s or character is considered lower-case but does not fall in that range; similar problems arise with French accented letters. Its safer to use [[:lower:]]. and other symbolic ranges described in the POSIX standard.