embedaddon/libiconv/NOTES - annotate

Return to NOTES CVS log
Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / libiconv
Annotation of embedaddon/libiconv/NOTES, revision 1.1.1.1

1.1       misho       1: Q: Why does libiconv support encoding XXX? Why does libiconv not support
                      2:    encoding ZZZ?
                      3: 
                      4: A: libiconv, as an internationalization library, supports those character
                      5:    sets and encodings which are in wide-spread use in at least one territory
                      6:    of the world.
                      7: 
                      8:    Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
                      9:    page "Languages, countries, and the charsets typically used for them".
                     10:    From this table, we can conclude that the following are in active use:
                     11: 
                     12:      ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
                     13:                           English, Faroese, Finnish, French, Galician, German,
                     14:                           Icelandic, Irish, Italian, Norwegian, Portuguese,
                     15:                           Scottish, Spanish, Swedish
                     16:      ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
                     17:                           Slovenian
                     18:      ISO-8859-3           Esperanto, Maltese
                     19:      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                     20:                           Serbian, Ukrainian
                     21:      ISO-8859-6           Arabic
                     22:      ISO-8859-7           Greek
                     23:      ISO-8859-8           Hebrew
                     24:      ISO-8859-9, CP1254   Turkish
                     25:      ISO-8859-10          Inuit, Lapp
                     26:      ISO-8859-13          Latvian, Lithuanian
                     27:      ISO-8859-15          Estonian
                     28:      KOI8-R               Russian
                     29:      SHIFT_JIS            Japanese
                     30:      ISO-2022-JP          Japanese
                     31:      EUC-JP               Japanese
                     32: 
                     33:    Ordered by frequency on the web (1997):
                     34:      ISO-8859-1, CP1252   96%
                     35:      SHIFT_JIS             1.6%
                     36:      ISO-2022-JP           1.2%
                     37:      EUC-JP                0.4%
                     38:      CP1250                0.3%
                     39:      CP1251                0.2%
                     40:      CP850                 0.1%
                     41:      MACINTOSH             0.1%
                     42:      ISO-8859-5            0.1%
                     43:      ISO-8859-2            0.0%
                     44: 
                     45:    Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
                     46: 
                     47:      ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
                     48:                           English, Estonian, Faroese, Finnish, French,
                     49:                           Galician, German, Greenlandic, Icelandic,
                     50:                           Indonesian, Irish, Italian, Lithuanian, Norwegian,
                     51:                           Occitan, Portuguese, Scottish, Spanish, Swedish,
                     52:                           Walloon, Welsh
                     53:      ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
                     54:                           Romanian, Serbian, Slovak, Slovenian
                     55:      ISO-8859-3           Esperanto
                     56:      ISO-8859-4           Estonian, Latvian, Lithuanian
                     57:      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
                     58:                           Serbian, Ukrainian
                     59:      ISO-8859-6           Arabic
                     60:      ISO-8859-7           Greek
                     61:      ISO-8859-8           Hebrew
                     62:      ISO-8859-9           Turkish
                     63:      ISO-8859-14          Breton, Irish, Scottish, Welsh
                     64:      ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
                     65:                           Faroese, Finnish, French, Galician, German,
                     66:                           Greenlandic, Icelandic, Irish, Italian, Lithuanian,
                     67:                           Norwegian, Occitan, Portuguese, Scottish, Spanish,
                     68:                           Swedish, Walloon, Welsh
                     69:      KOI8-R               Russian
                     70:      KOI8-U               Russian, Ukrainian
                     71:      EUC-JP (alias eucJP)      Japanese
                     72:      ISO-2022-JP (alias JIS7)  Japanese
                     73:      SHIFT_JIS (alias SJIS)    Japanese
                     74:      U90                       Japanese
                     75:      S90                       Japanese
                     76:      EUC-CN (alias eucCN)      Chinese
                     77:      EUC-TW (alias eucTW)      Chinese
                     78:      BIG5                      Chinese
                     79:      EUC-KR (alias eucKR)      Korean
                     80:      ARMSCII-8                 Armenian
                     81:      GEORGIAN-ACADEMY          Georgian
                     82:      GEORGIAN-PS               Georgian
                     83:      TIS-620 (alias TACTIS)    Thai
                     84:      MULELAO-1                 Laothian
                     85:      IBM-CP1133                Laothian
                     86:      VISCII                    Vietnamese
                     87:      TCVN                      Vietnamese
                     88:      NUNACOM-8                 Inuktitut
                     89: 
                     90:    Hint3: The character sets supported by Netscape Communicator 4.
                     91: 
                     92:      Where is this documented? For the complete picture, I had to use
                     93:      "strings netscape" and then a lot of guesswork. For a quick take,
                     94:      look at the "View - Character set" menu of Netscape Communicator 4.6:
                     95: 
                     96:      ISO-8859-{1,2,5,7,9,15}
                     97:      WINDOWS-{1250,1251,1253}
                     98:      KOI8-R               Cyrillic
                     99:      CP866                Cyrillic
                    100:      Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
                    101:      EUC-JP               Japanese
                    102:      SHIFT_JIS            Japanese
                    103:      GB2312               Chinese
                    104:      BIG5                 Chinese
                    105:      EUC-TW               Chinese
                    106:      Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
                    107: 
                    108:      UTF-8
                    109:      UTF-7
                    110: 
                    111:    Hint4: The character sets supported by Microsoft Internet Explorer 4.
                    112: 
                    113:      ISO-8859-{1,2,3,4,5,6,7,8,9}
                    114:      WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
                    115:      KOI8-R               Cyrillic
                    116:      KOI8-RU              Ukrainian
                    117:      ASMO-708             Arabic
                    118:      EUC-JP               Japanese
                    119:      ISO-2022-JP          Japanese
                    120:      SHIFT_JIS            Japanese
                    121:      GB2312               Chinese
                    122:      HZ-GB-2312           Chinese
                    123:      BIG5                 Chinese
                    124:      EUC-KR               Korean
                    125:      ISO-2022-KR          Korean
                    126:      WINDOWS-874          Thai
                    127:      WINDOWS-1258         Vietnamese
                    128: 
                    129:      UTF-8
                    130:      UTF-7
                    131:      UNICODE             actually UNICODE-LITTLE
                    132:      UNICODEFEFF         actually UNICODE-BIG
                    133: 
                    134:      and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
                    135: 
                    136:    We take the union of all these four sets. The result is:
                    137: 
                    138:    European and Semitic languages
                    139:      * ASCII.
                    140:        We implement this because it is occasionally useful to know or to
                    141:        check whether some text is entirely ASCII (i.e. if the conversion
                    142:        ISO-8859-x -> UTF-8 is trivial).
                    143:      * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
                    144:        We implement this because they are widely used. Except ISO-8859-4
                    145:        which appears to have been superseded by ISO-8859-13 in the baltic
                    146:        countries. But it's an ISO standard anyway.
                    147:      * ISO-8859-13
                    148:        We implement this because it's a standard in Lithuania and Latvia.
                    149:      * ISO-8859-14
                    150:        We implement this because it's an ISO standard.
                    151:      * ISO-8859-15
                    152:        We implement this because it's increasingly used in Europe, because
                    153:        of the Euro symbol.
                    154:      * ISO-8859-16
                    155:        We implement this because it's an ISO standard.
                    156:      * KOI8-R, KOI8-U
                    157:        We implement this because it appears to be the predominant encoding
                    158:        on Unix in Russia and Ukraine, respectively.
                    159:      * KOI8-RU
                    160:        We implement this because MSIE4 supports it.
                    161:      * KOI8-T
                    162:        We implement this because it is the locale encoding in glibc's Tajik
                    163:        locale.
                    164:      * PT154
                    165:        We implement this because it is the locale encoding in glibc's Kazakh
                    166:        locale.
                    167:      * RK1048
                    168:        We implement this because it's a standard in Kazakhstan.
                    169:      * CP{1250,1251,1252,1253,1254,1255,1256,1257}
                    170:        We implement these because they are the predominant Windows encodings
                    171:        in Europe.
                    172:      * CP850
                    173:        We implement this because it is mentioned as occurring in the web
                    174:        in the aforementioned statistics.
                    175:      * CP862
                    176:        We implement this because Ron Aaron says it is sometimes used in web
                    177:        pages and emails.
                    178:      * CP866
                    179:        We implement this because Netscape Communicator does.
                    180:      * CP1131
                    181:        We implement this because it is the locale encoding of a Belorusian
                    182:        locale in FreeBSD and MacOS X.
                    183:      * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
                    184:        Mac{Hebrew,Arabic}
                    185:        We implement these because the Sun JDK does, and because Mac users
                    186:        don't deserve to be punished.
                    187:      * Macintosh
                    188:        We implement this because it is mentioned as occurring in the web
                    189:        in the aforementioned statistics.
                    190:    Japanese
                    191:      * EUC-JP, SHIFT_JIS, ISO-2022-JP
                    192:        We implement these because they are widely used. EUC-JP and SHIFT_JIS
                    193:        are more used for files, whereas ISO-2022-JP is recommended for email.
                    194:      * CP932
                    195:        We implement this because it is the Microsoft variant of SHIFT_JIS,
                    196:        used on Windows.
                    197:      * ISO-2022-JP-2
                    198:        We implement this because it's the common way to represent mails which
                    199:        make use of JIS X 0212 characters.
                    200:      * ISO-2022-JP-1
                    201:        We implement this because it's in the RFCs, but I don't think it is
                    202:        really used.
                    203:      * U90, S90
                    204:        We DON'T implement this because I have no informations about what it
                    205:        is or who uses it.
                    206:    Simplified Chinese
                    207:      * EUC-CN = GB2312
                    208:        We implement this because it is the widely used representation
                    209:        of simplified Chinese.
                    210:      * GBK
                    211:        We implement this because it appears to be used on Solaris and Windows.
                    212:      * GB18030
                    213:        We implement this because it is an official requirement in the
                    214:        People's Republic of China.
                    215:      * ISO-2022-CN
                    216:        We implement this because it is in the RFCs, but I have no idea
                    217:        whether it is really used.
                    218:      * ISO-2022-CN-EXT
                    219:        We implement this because it's in the RFCs, but I don't think it is
                    220:        really used.
                    221:      * HZ = HZ-GB-2312
                    222:        We implement this because the RFCs recommend it for Usenet postings,
                    223:        and because MSIE4 supports it.
                    224:    Traditional Chinese
                    225:      * EUC-TW
                    226:        We implement it because it appears to be used on Unix.
                    227:      * BIG5
                    228:        We implement it because it is the de-facto standard for traditional
                    229:        Chinese.
                    230:      * CP950
                    231:        We implement this because it is the Microsoft variant of BIG5, used
                    232:        on Windows.
                    233:      * BIG5+
                    234:        We DON'T implement this because it doesn't appear to be in wide use.
                    235:        Only the CWEX fonts use this encoding. Furthermore, the conversion
                    236:        tables in the big5p package are not coherent: If you convert directly,
                    237:        you get different results than when you convert via GBK.
                    238:      * BIG5-HKSCS
                    239:        We implement it because it is the de-facto standard for traditional
                    240:        Chinese in Hongkong.
                    241:    Korean
                    242:      * EUC-KR
                    243:        We implement these because they appear to be the widely used
                    244:        representations for Korean.
                    245:      * CP949
                    246:        We implement this because it is the Microsoft variant of EUC-KR, used
                    247:        on Windows.
                    248:      * ISO-2022-KR
                    249:        We implement it because it is in the RFCs and because MSIE4 supports
                    250:        it, but I have no idea whether it's really used.
                    251:      * JOHAB
                    252:        We implement this because it is apparently used on Windows as a locale
                    253:        encoding (codepage 1361).
                    254:      * ISO-646-KR
                    255:        We DON'T implement this because although an old ASCII variant, its
                    256:        glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
                    257:        say it's a tilde, but Ken Lunde's "CJKV information processing" says
                    258:        it's an overline. And it is not ISO-IR registered.
                    259:    Armenian
                    260:      * ARMSCII-8
                    261:        We implement it because XFree86 supports it.
                    262:    Georgian
                    263:      * Georgian-Academy, Georgian-PS
                    264:        We implement these because they appear to be both used for Georgian;
                    265:        Xfree86 supports them.
                    266:    Thai
                    267:      * ISO-8859-11, TIS-620
                    268:        We implement these because it seems to be standard for Thai.
                    269:      * CP874
                    270:        We implement this because MSIE4 supports it.
                    271:      * MacThai
                    272:        We implement this because the Sun JDK does, and because Mac users
                    273:        don't deserve to be punished.
                    274:    Laotian
                    275:      * MuleLao-1, CP1133
                    276:        We implement these because XFree86 supports them. I have no idea which
                    277:        one is used more widely.
                    278:    Vietnamese
                    279:      * VISCII, TCVN
                    280:        We implement these because XFree86 supports them.
                    281:      * CP1258
                    282:        We implement this because MSIE4 supports it.
                    283:    Other languages
                    284:      * NUNACOM-8 (Inuktitut)
                    285:        We DON'T implement this because it isn't part of Unicode yet, and
                    286:        therefore doesn't convert to anything except itself.
                    287:    Platform specifics
                    288:      * HP-ROMAN8, NEXTSTEP
                    289:        We implement these because they were the native character set on HPs
                    290:        and NeXTs for a long time, and libiconv is intended to be usable on
                    291:        these old machines.
                    292:    Full Unicode
                    293:      * UTF-8, UCS-2, UCS-4
                    294:        We implement these. Obviously.
                    295:      * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
                    296:        We implement these because they are the preferred internal
                    297:        representation of strings in Unicode aware applications. These are
                    298:        non-ambiguous names, known to glibc. (glibc doesn't have
                    299:        UCS-2-INTERNAL and UCS-4-INTERNAL.)
                    300:      * UTF-16, UTF-16BE, UTF-16LE
                    301:        We implement these, because UTF-16 is still the favourite encoding of
                    302:        the president of the Unicode Consortium (for political reasons), and
                    303:        because they appear in RFC 2781.
                    304:      * UTF-32, UTF-32BE, UTF-32LE
                    305:        We implement these because they are part of Unicode 3.1.
                    306:      * UTF-7
                    307:        We implement this because it is essential functionality for mail
                    308:        applications.
                    309:      * C99
                    310:        We implement it because it's used for C and C++ programs and because
                    311:        it's a nice encoding for debugging.
                    312:      * JAVA
                    313:        We implement it because it's used for Java programs and because it's
                    314:        a nice encoding for debugging.
                    315:      * UNICODE (big endian), UNICODEFEFF (little endian)
                    316:        We DON'T implement these because they are stupid and not standardized.
                    317:    Full Unicode, in terms of `uint16_t' or `uint32_t'
                    318:    (with machine dependent endianness and alignment)
                    319:      * UCS-2-INTERNAL, UCS-4-INTERNAL
                    320:        We implement these because they are the preferred internal
                    321:        representation of strings in Unicode aware applications.
                    322: 
                    323: Q: Support encodings mentioned in RFC 1345 ?
                    324: A: No, they are not in use any more. Supporting ISO-646 variants is pointless
                    325:    since ISO-8859-* have been adopted.
                    326: 
                    327: Q: Support EBCDIC ?
                    328: A: No!
                    329: 
                    330: Q: How do I add a new character set?
                    331: A: 1. Explain the "why" in this file, above.
                    332:    2. You need to have a conversion table from/to Unicode. Transform it into
                    333:    the format used by the mapping tables found on ftp.unicode.org: each line
                    334:    contains the character code, in hex, with 0x prefix, then whitespace,
                    335:    then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
                    336:    counts as a comment delimiter until end of line.
                    337:    Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
                    338:    can include it in his collection.
                    339:    3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
                    340:    tools directory to generate the C code for the conversion. You may tweak
                    341:    the resulting C code if you are not satisfied with its quality, but this
                    342:    is rarely needed.
                    343:    If it's a two-dimensional character set (with rows and columns), use the
                    344:    'cjk_tab_to_h' program in the tools directory to generate the C code for
                    345:    the conversion. You will need to modify the main() function to recognize
                    346:    the new character set name, with the proper dimensions, but that shouldn't
                    347:    be too hard. This yields the CCS. The CES you have to write by hand.
                    348:    4. Store the resulting C code file in the lib directory. Add a #include
                    349:    directive to converters.h, and add an entry to the encodings.def file.
                    350:    5. Compile the package, and test your new encoding using a program like
                    351:    iconv(1) or clisp(1).
                    352:    6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
                    353:    encoding, create the complete table as a TXT file. For a stateful encoding,
                    354:    provide a text snippet encoded using your new encoding and its UTF-8
                    355:    equivalent.
                    356:    7. Update the README and man/iconv_open.3, to mention the new encoding.
                    357:    Add a note in the NEWS file.
                    358: 
                    359: Q: What about bidirectional text? Should it be tagged or reversed when
                    360:    converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
                    361:    this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
                    362: A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
                    363:    ISO-8859-E remains to be implemented.
                    364:    On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
                    365:    is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
                    366:    the same as ISO-8859-8-I. I'm confused.
                    367: 
                    368: Other character sets not implemented:
                    369: "MNEMONIC" = "csMnemonic"
                    370: "MNEM" = "csMnem"
                    371: "ISO-10646-UCS-Basic" = "csUnicodeASCII"
                    372: "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
                    373: "ISO-10646-J-1"
                    374: "UNICODE-1-1" = "csUnicode11"
                    375: "csWindows31Latin5"
                    376: 
                    377: Other aliases not implemented (and not implemented in glibc-2.1 either):
                    378:   From MSIE4:
                    379:     ISO-8859-1: alias ISO8859-1
                    380:     ISO-8859-2: alias ISO8859-2
                    381:     KSC_5601: alias KS_C_5601
                    382:     UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
                    383: 
                    384: 
                    385: Q: How can I integrate libiconv into my package?
                    386: A: Just copy the entire libiconv package into a subdirectory of your package.
                    387:    At configuration time, call libiconv's configure script with the
                    388:    appropriate --srcdir option and maybe --enable-static or --disable-shared.
                    389:    Then "cd libiconv && make && make install-lib libdir=... includedir=...".
                    390:    'install-lib' is a special (not GNU standardized) target which installs
                    391:    only the include file - in $(includedir) - and the library - in $(libdir) -
                    392:    and does not use other directory variables. After "installing" libiconv
                    393:    in your package's build directory, building of your package can proceed.
                    394: 
                    395: Q: Why is the testsuite so big?
                    396: A: Because some of the tests are very comprehensive.
                    397:    If you don't feel like using the testsuite, you can simply remove the
                    398:    tests/ directory.
                    399:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>