GNU C Library (libc) Programming Guide - Other iconv Implementations

Next: glibc iconv Implementation, Previous: iconv Examples, Up: Generic Charset Conversion

6.5.3 Some Details about other `iconv` Implementations

This is not really the place to discuss the iconv implementation of other systems but it is necessary to know a bit about them to write portable programs. The above mentioned problems with the specification of the iconv functions can lead to portability issues.

The first thing to notice is that, due to the large number of character sets in use, it is certainly not practical to encode the conversions directly in the C library. Therefore, the conversion information must come from files outside the C library. This is usually done in one or both of the following ways:

The C library contains a set of generic conversion functions that can read the needed conversion tables and other information from data files. These files get loaded when necessary.
This solution is problematic as it requires a great deal of effort to apply to all character sets (potentially an infinite set). The differences in the structure of the different character sets is so large that many different variants of the table-processing functions must be developed. In addition, the generic nature of these functions make them slower than specifically implemented functions.
The C library only contains a framework that can dynamically load object files and execute the conversion functions contained therein.
This solution provides much more flexibility. The C library itself contains only very little code and therefore reduces the general memory footprint. Also, with a documented interface between the C library and the loadable modules it is possible for third parties to extend the set of available conversion modules. A drawback of this solution is that dynamic loading must be available.

Some implementations in commercial Unices implement a mixture of these possibilities; the majority implement only the second solution. Using loadable modules moves the code out of the library itself and keeps the door open for extensions and improvements, but this design is also limiting on some platforms since not many platforms support dynamic loading in statically linked programs. On platforms without this capability it is therefore not possible to use this interface in statically linked programs. The GNU C library has, on ELF platforms, no problems with dynamic loading in these situations; therefore, this point is moot. The danger is that one gets acquainted with this situation and forgets about the restrictions on other systems.

A second thing to know about other iconv implementations is that the number of available conversions is often very limited. Some implementations provide, in the standard release (not special international or developer releases), at most 100 to 200 conversion possibilities. This does not mean 200 different character sets are supported; for example, conversions from one character set to a set of 10 others might count as 10 conversions. Together with the other direction this makes 20 conversion possibilities used up by one character set. One can imagine the thin coverage these platform provide. Some Unix vendors even provide only a handful of conversions, which renders them useless for almost all uses.

This directly leads to a third and probably the most problematic point. The way the iconv conversion functions are implemented on all known Unix systems and the availability of the conversion functions from character set A to B and the conversion from B to C does not imply that the conversion from A to C is available.

This might not seem unreasonable and problematic at first, but it is a quite big problem as one will notice shortly after hitting it. To show the problem we assume to write a program that has to convert from A to C. A call like

     cd = iconv_open ("C", "A");

fails according to the assumption above. But what does the program do now? The conversion is necessary; therefore, simply giving up is not an option.

This is a nuisance. The iconv function should take care of this. But how should the program proceed from here on? If it tries to convert to character set B, first the two iconv_open calls

     cd1 = iconv_open ("B", "A");

and

     cd2 = iconv_open ("C", "B");

will succeed, but how to find B?

Unfortunately, the answer is: there is no general solution. On some systems guessing might help. On those systems most character sets can convert to and from UTF-8 encoded ISO 10646 or Unicode text. Beside this only some very system-specific methods can help. Since the conversion functions come from loadable modules and these modules must be stored somewhere in the filesystem, one could try to find them and determine from the available file which conversions are available and whether there is an indirect route from A to C.

This example shows one of the design errors of iconv mentioned above. It should at least be possible to determine the list of available conversion programmatically so that if iconv_open says there is no such conversion, one could make sure this also is true for indirect routes.

6.5.3 Some Details about other iconv Implementations

6.5.3 Some Details about other `iconv` Implementations