File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / libiconv / NOTES
Revision 1.1.1.2 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Wed Mar 17 13:38:46 2021 UTC (3 years, 3 months ago) by misho
Branches: libiconv, MAIN
CVS tags: v1_16p0, HEAD
libiconv 1.16

    1: Q: Why does libiconv support encoding XXX? Why does libiconv not support
    2:    encoding ZZZ?
    3: 
    4: A: libiconv, as an internationalization library, supports those character
    5:    sets and encodings which are in wide-spread use in at least one territory
    6:    of the world.
    7: 
    8:    Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
    9:    page "Languages, countries, and the charsets typically used for them".
   10:    From this table, we can conclude that the following are in active use:
   11: 
   12:      ISO-8859-1, CP1252   Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
   13:                           English, Faroese, Finnish, French, Galician, German,
   14:                           Icelandic, Irish, Italian, Norwegian, Portuguese,
   15:                           Scottish, Spanish, Swedish
   16:      ISO-8859-2           Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
   17:                           Slovenian
   18:      ISO-8859-3           Esperanto, Maltese
   19:      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
   20:                           Serbian, Ukrainian
   21:      ISO-8859-6           Arabic
   22:      ISO-8859-7           Greek
   23:      ISO-8859-8           Hebrew
   24:      ISO-8859-9, CP1254   Turkish
   25:      ISO-8859-10          Inuit, Lapp
   26:      ISO-8859-13          Latvian, Lithuanian
   27:      ISO-8859-15          Estonian
   28:      KOI8-R               Russian
   29:      SHIFT_JIS            Japanese
   30:      ISO-2022-JP          Japanese
   31:      EUC-JP               Japanese
   32: 
   33:    Ordered by frequency on the web (1997):
   34:      ISO-8859-1, CP1252   96%
   35:      SHIFT_JIS             1.6%
   36:      ISO-2022-JP           1.2%
   37:      EUC-JP                0.4%
   38:      CP1250                0.3%
   39:      CP1251                0.2%
   40:      CP850                 0.1%
   41:      MACINTOSH             0.1%
   42:      ISO-8859-5            0.1%
   43:      ISO-8859-2            0.0%
   44: 
   45:    Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
   46: 
   47:      ISO-8859-1           Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
   48:                           English, Estonian, Faroese, Finnish, French,
   49:                           Galician, German, Greenlandic, Icelandic,
   50:                           Indonesian, Irish, Italian, Lithuanian, Norwegian,
   51:                           Occitan, Portuguese, Scottish, Spanish, Swedish,
   52:                           Walloon, Welsh
   53:      ISO-8859-2           Albanian, Croatian, Czech, Hungarian, Polish,
   54:                           Romanian, Serbian, Slovak, Slovenian
   55:      ISO-8859-3           Esperanto
   56:      ISO-8859-4           Estonian, Latvian, Lithuanian
   57:      ISO-8859-5           Bulgarian, Byelorussian, Macedonian, Russian,
   58:                           Serbian, Ukrainian
   59:      ISO-8859-6           Arabic
   60:      ISO-8859-7           Greek
   61:      ISO-8859-8           Hebrew
   62:      ISO-8859-9           Turkish
   63:      ISO-8859-14          Breton, Irish, Scottish, Welsh
   64:      ISO-8859-15          Basque, Breton, Catalan, Danish, Dutch, Estonian,
   65:                           Faroese, Finnish, French, Galician, German,
   66:                           Greenlandic, Icelandic, Irish, Italian, Lithuanian,
   67:                           Norwegian, Occitan, Portuguese, Scottish, Spanish,
   68:                           Swedish, Walloon, Welsh
   69:      KOI8-R               Russian
   70:      KOI8-U               Russian, Ukrainian
   71:      EUC-JP (alias eucJP)      Japanese
   72:      ISO-2022-JP (alias JIS7)  Japanese
   73:      SHIFT_JIS (alias SJIS)    Japanese
   74:      U90                       Japanese
   75:      S90                       Japanese
   76:      EUC-CN (alias eucCN)      Chinese
   77:      EUC-TW (alias eucTW)      Chinese
   78:      BIG5                      Chinese
   79:      EUC-KR (alias eucKR)      Korean
   80:      ARMSCII-8                 Armenian
   81:      GEORGIAN-ACADEMY          Georgian
   82:      GEORGIAN-PS               Georgian
   83:      TIS-620 (alias TACTIS)    Thai
   84:      MULELAO-1                 Laothian
   85:      IBM-CP1133                Laothian
   86:      VISCII                    Vietnamese
   87:      TCVN                      Vietnamese
   88:      NUNACOM-8                 Inuktitut
   89: 
   90:    Hint3: The character sets supported by Netscape Communicator 4.
   91: 
   92:      Where is this documented? For the complete picture, I had to use
   93:      "strings netscape" and then a lot of guesswork. For a quick take,
   94:      look at the "View - Character set" menu of Netscape Communicator 4.6:
   95: 
   96:      ISO-8859-{1,2,5,7,9,15}
   97:      WINDOWS-{1250,1251,1253}
   98:      KOI8-R               Cyrillic
   99:      CP866                Cyrillic
  100:      Autodetect           Japanese  (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
  101:      EUC-JP               Japanese
  102:      SHIFT_JIS            Japanese
  103:      GB2312               Chinese
  104:      BIG5                 Chinese
  105:      EUC-TW               Chinese
  106:      Autodetect           Korean    (EUC-KR, ISO-2022-KR, but not JOHAB)
  107: 
  108:      UTF-8
  109:      UTF-7
  110: 
  111:    Hint4: The character sets supported by Microsoft Internet Explorer 4.
  112: 
  113:      ISO-8859-{1,2,3,4,5,6,7,8,9}
  114:      WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
  115:      KOI8-R               Cyrillic
  116:      KOI8-RU              Ukrainian
  117:      ASMO-708             Arabic
  118:      EUC-JP               Japanese
  119:      ISO-2022-JP          Japanese
  120:      SHIFT_JIS            Japanese
  121:      GB2312               Chinese
  122:      HZ-GB-2312           Chinese
  123:      BIG5                 Chinese
  124:      EUC-KR               Korean
  125:      ISO-2022-KR          Korean
  126:      WINDOWS-874          Thai
  127:      WINDOWS-1258         Vietnamese
  128: 
  129:      UTF-8
  130:      UTF-7
  131:      UNICODE             actually UNICODE-LITTLE
  132:      UNICODEFEFF         actually UNICODE-BIG
  133: 
  134:      and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
  135: 
  136:    We take the union of all these four sets. The result is:
  137: 
  138:    European and Semitic languages
  139:      * ASCII.
  140:        We implement this because it is occasionally useful to know or to
  141:        check whether some text is entirely ASCII (i.e. if the conversion
  142:        ISO-8859-x -> UTF-8 is trivial).
  143:      * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
  144:        We implement this because they are widely used. Except ISO-8859-4
  145:        which appears to have been superseded by ISO-8859-13 in the baltic
  146:        countries. But it's an ISO standard anyway.
  147:      * ISO-8859-13
  148:        We implement this because it's a standard in Lithuania and Latvia.
  149:      * ISO-8859-14
  150:        We implement this because it's an ISO standard.
  151:      * ISO-8859-15
  152:        We implement this because it's increasingly used in Europe, because
  153:        of the Euro symbol.
  154:      * ISO-8859-16
  155:        We implement this because it's an ISO standard.
  156:      * KOI8-R, KOI8-U
  157:        We implement this because it appears to be the predominant encoding
  158:        on Unix in Russia and Ukraine, respectively.
  159:      * KOI8-RU
  160:        We implement this because MSIE4 supports it.
  161:      * KOI8-T
  162:        We implement this because it is the locale encoding in glibc's Tajik
  163:        locale.
  164:      * PT154
  165:        We implement this because it is the locale encoding in glibc's Kazakh
  166:        locale.
  167:      * RK1048
  168:        We implement this because it's a standard in Kazakhstan.
  169:      * CP{1250,1251,1252,1253,1254,1255,1256,1257}
  170:        We implement these because they are the predominant Windows encodings
  171:        in Europe.
  172:      * CP850
  173:        We implement this because it is mentioned as occurring in the web
  174:        in the aforementioned statistics.
  175:      * CP862
  176:        We implement this because Ron Aaron says it is sometimes used in web
  177:        pages and emails.
  178:      * CP866
  179:        We implement this because Netscape Communicator does.
  180:      * CP1131
  181:        We implement this because it is the locale encoding of a Belorusian
  182:        locale in FreeBSD and MacOS X.
  183:      * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
  184:        Mac{Hebrew,Arabic}
  185:        We implement these because the Sun JDK does, and because Mac users
  186:        don't deserve to be punished.
  187:      * Macintosh
  188:        We implement this because it is mentioned as occurring in the web
  189:        in the aforementioned statistics.
  190:    Japanese
  191:      * EUC-JP, SHIFT_JIS, ISO-2022-JP
  192:        We implement these because they are widely used. EUC-JP and SHIFT_JIS
  193:        are more used for files, whereas ISO-2022-JP is recommended for email.
  194:      * CP932
  195:        We implement this because it is the Microsoft variant of SHIFT_JIS,
  196:        used on Windows.
  197:      * ISO-2022-JP-2
  198:        We implement this because it's the common way to represent mails which
  199:        make use of JIS X 0212 characters.
  200:      * ISO-2022-JP-1
  201:        We implement this because it's in the RFCs, but I don't think it is
  202:        really used.
  203:      * ISO-2022-JP-MS
  204:        We implement this because Microsoft Outlook Express / Microsoft MimeOLE
  205:        sends emails in this encoding.
  206:      * U90, S90
  207:        We DON'T implement this because I have no informations about what it
  208:        is or who uses it.
  209:    Simplified Chinese
  210:      * EUC-CN = GB2312
  211:        We implement this because it is the widely used representation
  212:        of simplified Chinese.
  213:      * GBK
  214:        We implement this because it appears to be used on Solaris and Windows.
  215:      * GB18030
  216:        We implement this because it is an official requirement in the
  217:        People's Republic of China.
  218:      * ISO-2022-CN
  219:        We implement this because it is in the RFCs, but I have no idea
  220:        whether it is really used.
  221:      * ISO-2022-CN-EXT
  222:        We implement this because it's in the RFCs, but I don't think it is
  223:        really used.
  224:      * HZ = HZ-GB-2312
  225:        We implement this because the RFCs recommend it for Usenet postings,
  226:        and because MSIE4 supports it.
  227:    Traditional Chinese
  228:      * EUC-TW
  229:        We implement it because it appears to be used on Unix.
  230:      * BIG5
  231:        We implement it because it is the de-facto standard for traditional
  232:        Chinese.
  233:      * CP950
  234:        We implement this because it is the Microsoft variant of BIG5, used
  235:        on Windows.
  236:      * BIG5+
  237:        We DON'T implement this because it doesn't appear to be in wide use.
  238:        Only the CWEX fonts use this encoding. Furthermore, the conversion
  239:        tables in the big5p package are not coherent: If you convert directly,
  240:        you get different results than when you convert via GBK.
  241:      * BIG5-HKSCS
  242:        We implement it because it is the de-facto standard for traditional
  243:        Chinese in Hongkong.
  244:    Korean
  245:      * EUC-KR
  246:        We implement these because they appear to be the widely used
  247:        representations for Korean.
  248:      * CP949
  249:        We implement this because it is the Microsoft variant of EUC-KR, used
  250:        on Windows.
  251:      * ISO-2022-KR
  252:        We implement it because it is in the RFCs and because MSIE4 supports
  253:        it, but I have no idea whether it's really used.
  254:      * JOHAB
  255:        We implement this because it is apparently used on Windows as a locale
  256:        encoding (codepage 1361).
  257:      * ISO-646-KR
  258:        We DON'T implement this because although an old ASCII variant, its
  259:        glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
  260:        say it's a tilde, but Ken Lunde's "CJKV information processing" says
  261:        it's an overline. And it is not ISO-IR registered.
  262:    Armenian
  263:      * ARMSCII-8
  264:        We implement it because XFree86 supports it.
  265:    Georgian
  266:      * Georgian-Academy, Georgian-PS
  267:        We implement these because they appear to be both used for Georgian;
  268:        Xfree86 supports them.
  269:    Thai
  270:      * ISO-8859-11, TIS-620
  271:        We implement these because it seems to be standard for Thai.
  272:      * CP874
  273:        We implement this because MSIE4 supports it.
  274:      * MacThai
  275:        We implement this because the Sun JDK does, and because Mac users
  276:        don't deserve to be punished.
  277:    Laotian
  278:      * MuleLao-1, CP1133
  279:        We implement these because XFree86 supports them. I have no idea which
  280:        one is used more widely.
  281:    Vietnamese
  282:      * VISCII, TCVN
  283:        We implement these because XFree86 supports them.
  284:      * CP1258
  285:        We implement this because MSIE4 supports it.
  286:    Other languages
  287:      * NUNACOM-8 (Inuktitut)
  288:        We DON'T implement this because it isn't part of Unicode yet, and
  289:        therefore doesn't convert to anything except itself.
  290:    Platform specifics
  291:      * HP-ROMAN8, NEXTSTEP
  292:        We implement these because they were the native character set on HPs
  293:        and NeXTs for a long time, and libiconv is intended to be usable on
  294:        these old machines.
  295:    Full Unicode
  296:      * UTF-8, UCS-2, UCS-4
  297:        We implement these. Obviously.
  298:      * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
  299:        We implement these because they are the preferred internal
  300:        representation of strings in Unicode aware applications. These are
  301:        non-ambiguous names, known to glibc. (glibc doesn't have
  302:        UCS-2-INTERNAL and UCS-4-INTERNAL.)
  303:      * UTF-16, UTF-16BE, UTF-16LE
  304:        We implement these, because UTF-16 is still the favourite encoding of
  305:        the president of the Unicode Consortium (for political reasons), and
  306:        because they appear in RFC 2781.
  307:      * UTF-32, UTF-32BE, UTF-32LE
  308:        We implement these because they are part of Unicode 3.1.
  309:      * UTF-7
  310:        We implement this because it is essential functionality for mail
  311:        applications.
  312:      * C99
  313:        We implement it because it's used for C and C++ programs and because
  314:        it's a nice encoding for debugging.
  315:      * JAVA
  316:        We implement it because it's used for Java programs and because it's
  317:        a nice encoding for debugging.
  318:      * UNICODE (big endian), UNICODEFEFF (little endian)
  319:        We DON'T implement these because they are stupid and not standardized.
  320:    Full Unicode, in terms of 'uint16_t' or 'uint32_t'
  321:    (with machine dependent endianness and alignment)
  322:      * UCS-2-INTERNAL, UCS-4-INTERNAL
  323:        We implement these because they are the preferred internal
  324:        representation of strings in Unicode aware applications.
  325: 
  326: Q: Support encodings mentioned in RFC 1345 ?
  327: A: No, they are not in use any more. Supporting ISO-646 variants is pointless
  328:    since ISO-8859-* have been adopted.
  329: 
  330: Q: Support EBCDIC ?
  331: A: No!
  332: 
  333: Q: How do I add a new character set?
  334: A: 1. Explain the "why" in this file, above.
  335:    2. You need to have a conversion table from/to Unicode. Transform it into
  336:    the format used by the mapping tables found on ftp.unicode.org: each line
  337:    contains the character code, in hex, with 0x prefix, then whitespace,
  338:    then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
  339:    counts as a comment delimiter until end of line.
  340:    Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
  341:    can include it in his collection.
  342:    3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
  343:    tools directory to generate the C code for the conversion. You may tweak
  344:    the resulting C code if you are not satisfied with its quality, but this
  345:    is rarely needed.
  346:    If it's a two-dimensional character set (with rows and columns), use the
  347:    'cjk_tab_to_h' program in the tools directory to generate the C code for
  348:    the conversion. You will need to modify the main() function to recognize
  349:    the new character set name, with the proper dimensions, but that shouldn't
  350:    be too hard. This yields the CCS. The CES you have to write by hand.
  351:    4. Store the resulting C code file in the lib directory. Add a #include
  352:    directive to converters.h, and add an entry to the encodings.def file.
  353:    5. Compile the package, and test your new encoding using a program like
  354:    iconv(1) or clisp(1).
  355:    6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
  356:    encoding, create the complete table as a TXT file. For a stateful encoding,
  357:    provide a text snippet encoded using your new encoding and its UTF-8
  358:    equivalent.
  359:    7. Update the README and man/iconv_open.3, to mention the new encoding.
  360:    Add a note in the NEWS file.
  361: 
  362: Q: What about bidirectional text? Should it be tagged or reversed when
  363:    converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
  364:    this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
  365: A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
  366:    ISO-8859-E remains to be implemented.
  367:    On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
  368:    is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
  369:    the same as ISO-8859-8-I. I'm confused.
  370: 
  371: Other character sets not implemented:
  372: "MNEMONIC" = "csMnemonic"
  373: "MNEM" = "csMnem"
  374: "ISO-10646-UCS-Basic" = "csUnicodeASCII"
  375: "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
  376: "ISO-10646-J-1"
  377: "UNICODE-1-1" = "csUnicode11"
  378: "csWindows31Latin5"
  379: 
  380: Other aliases not implemented (and not implemented in glibc-2.1 either):
  381:   From MSIE4:
  382:     ISO-8859-1: alias ISO8859-1
  383:     ISO-8859-2: alias ISO8859-2
  384:     KSC_5601: alias KS_C_5601
  385:     UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
  386: 
  387: 
  388: Q: How can I integrate libiconv into my package?
  389: A: Just copy the entire libiconv package into a subdirectory of your package.
  390:    At configuration time, call libiconv's configure script with the
  391:    appropriate --srcdir option and maybe --enable-static or --disable-shared.
  392:    Then "cd libiconv && make && make install-lib libdir=... includedir=...".
  393:    'install-lib' is a special (not GNU standardized) target which installs
  394:    only the include file - in $(includedir) - and the library - in $(libdir) -
  395:    and does not use other directory variables. After "installing" libiconv
  396:    in your package's build directory, building of your package can proceed.
  397: 
  398: Q: Why is the testsuite so big?
  399: A: Because some of the tests are very comprehensive.
  400:    If you don't feel like using the testsuite, you can simply remove the
  401:    tests/ directory.
  402: 

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>