embedaddon/libiconv/NOTES - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / libiconv / NOTES
Revision 1.1.1.2 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Wed Mar 17 13:38:46 2021 UTC (3 years, 3 months ago) by misho
Branches: libiconv, MAIN
CVS tags: v1_16p0, HEAD

libiconv 1.16

1: Q: Why does libiconv support encoding XXX? Why does libiconv not support 2: encoding ZZZ? 3: 4: A: libiconv, as an internationalization library, supports those character 5: sets and encodings which are in wide-spread use in at least one territory 6: of the world. 7: 8: Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a 9: page "Languages, countries, and the charsets typically used for them". 10: From this table, we can conclude that the following are in active use: 11: 12: ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch, 13: English, Faroese, Finnish, French, Galician, German, 14: Icelandic, Irish, Italian, Norwegian, Portuguese, 15: Scottish, Spanish, Swedish 16: ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak, 17: Slovenian 18: ISO-8859-3 Esperanto, Maltese 19: ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, 20: Serbian, Ukrainian 21: ISO-8859-6 Arabic 22: ISO-8859-7 Greek 23: ISO-8859-8 Hebrew 24: ISO-8859-9, CP1254 Turkish 25: ISO-8859-10 Inuit, Lapp 26: ISO-8859-13 Latvian, Lithuanian 27: ISO-8859-15 Estonian 28: KOI8-R Russian 29: SHIFT_JIS Japanese 30: ISO-2022-JP Japanese 31: EUC-JP Japanese 32: 33: Ordered by frequency on the web (1997): 34: ISO-8859-1, CP1252 96% 35: SHIFT_JIS 1.6% 36: ISO-2022-JP 1.2% 37: EUC-JP 0.4% 38: CP1250 0.3% 39: CP1251 0.2% 40: CP850 0.1% 41: MACINTOSH 0.1% 42: ISO-8859-5 0.1% 43: ISO-8859-2 0.0% 44: 45: Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file. 46: 47: ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch, 48: English, Estonian, Faroese, Finnish, French, 49: Galician, German, Greenlandic, Icelandic, 50: Indonesian, Irish, Italian, Lithuanian, Norwegian, 51: Occitan, Portuguese, Scottish, Spanish, Swedish, 52: Walloon, Welsh 53: ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish, 54: Romanian, Serbian, Slovak, Slovenian 55: ISO-8859-3 Esperanto 56: ISO-8859-4 Estonian, Latvian, Lithuanian 57: ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian, 58: Serbian, Ukrainian 59: ISO-8859-6 Arabic 60: ISO-8859-7 Greek 61: ISO-8859-8 Hebrew 62: ISO-8859-9 Turkish 63: ISO-8859-14 Breton, Irish, Scottish, Welsh 64: ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian, 65: Faroese, Finnish, French, Galician, German, 66: Greenlandic, Icelandic, Irish, Italian, Lithuanian, 67: Norwegian, Occitan, Portuguese, Scottish, Spanish, 68: Swedish, Walloon, Welsh 69: KOI8-R Russian 70: KOI8-U Russian, Ukrainian 71: EUC-JP (alias eucJP) Japanese 72: ISO-2022-JP (alias JIS7) Japanese 73: SHIFT_JIS (alias SJIS) Japanese 74: U90 Japanese 75: S90 Japanese 76: EUC-CN (alias eucCN) Chinese 77: EUC-TW (alias eucTW) Chinese 78: BIG5 Chinese 79: EUC-KR (alias eucKR) Korean 80: ARMSCII-8 Armenian 81: GEORGIAN-ACADEMY Georgian 82: GEORGIAN-PS Georgian 83: TIS-620 (alias TACTIS) Thai 84: MULELAO-1 Laothian 85: IBM-CP1133 Laothian 86: VISCII Vietnamese 87: TCVN Vietnamese 88: NUNACOM-8 Inuktitut 89: 90: Hint3: The character sets supported by Netscape Communicator 4. 91: 92: Where is this documented? For the complete picture, I had to use 93: "strings netscape" and then a lot of guesswork. For a quick take, 94: look at the "View - Character set" menu of Netscape Communicator 4.6: 95: 96: ISO-8859-{1,2,5,7,9,15} 97: WINDOWS-{1250,1251,1253} 98: KOI8-R Cyrillic 99: CP866 Cyrillic 100: Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS) 101: EUC-JP Japanese 102: SHIFT_JIS Japanese 103: GB2312 Chinese 104: BIG5 Chinese 105: EUC-TW Chinese 106: Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB) 107: 108: UTF-8 109: UTF-7 110: 111: Hint4: The character sets supported by Microsoft Internet Explorer 4. 112: 113: ISO-8859-{1,2,3,4,5,6,7,8,9} 114: WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257} 115: KOI8-R Cyrillic 116: KOI8-RU Ukrainian 117: ASMO-708 Arabic 118: EUC-JP Japanese 119: ISO-2022-JP Japanese 120: SHIFT_JIS Japanese 121: GB2312 Chinese 122: HZ-GB-2312 Chinese 123: BIG5 Chinese 124: EUC-KR Korean 125: ISO-2022-KR Korean 126: WINDOWS-874 Thai 127: WINDOWS-1258 Vietnamese 128: 129: UTF-8 130: UTF-7 131: UNICODE actually UNICODE-LITTLE 132: UNICODEFEFF actually UNICODE-BIG 133: 134: and various DOS character sets: DOS-720, DOS-862, IBM852, CP866. 135: 136: We take the union of all these four sets. The result is: 137: 138: European and Semitic languages 139: * ASCII. 140: We implement this because it is occasionally useful to know or to 141: check whether some text is entirely ASCII (i.e. if the conversion 142: ISO-8859-x -> UTF-8 is trivial). 143: * ISO-8859-{1,2,3,4,5,6,7,8,9,10} 144: We implement this because they are widely used. Except ISO-8859-4 145: which appears to have been superseded by ISO-8859-13 in the baltic 146: countries. But it's an ISO standard anyway. 147: * ISO-8859-13 148: We implement this because it's a standard in Lithuania and Latvia. 149: * ISO-8859-14 150: We implement this because it's an ISO standard. 151: * ISO-8859-15 152: We implement this because it's increasingly used in Europe, because 153: of the Euro symbol. 154: * ISO-8859-16 155: We implement this because it's an ISO standard. 156: * KOI8-R, KOI8-U 157: We implement this because it appears to be the predominant encoding 158: on Unix in Russia and Ukraine, respectively. 159: * KOI8-RU 160: We implement this because MSIE4 supports it. 161: * KOI8-T 162: We implement this because it is the locale encoding in glibc's Tajik 163: locale. 164: * PT154 165: We implement this because it is the locale encoding in glibc's Kazakh 166: locale. 167: * RK1048 168: We implement this because it's a standard in Kazakhstan. 169: * CP{1250,1251,1252,1253,1254,1255,1256,1257} 170: We implement these because they are the predominant Windows encodings 171: in Europe. 172: * CP850 173: We implement this because it is mentioned as occurring in the web 174: in the aforementioned statistics. 175: * CP862 176: We implement this because Ron Aaron says it is sometimes used in web 177: pages and emails. 178: * CP866 179: We implement this because Netscape Communicator does. 180: * CP1131 181: We implement this because it is the locale encoding of a Belorusian 182: locale in FreeBSD and MacOS X. 183: * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and 184: Mac{Hebrew,Arabic} 185: We implement these because the Sun JDK does, and because Mac users 186: don't deserve to be punished. 187: * Macintosh 188: We implement this because it is mentioned as occurring in the web 189: in the aforementioned statistics. 190: Japanese 191: * EUC-JP, SHIFT_JIS, ISO-2022-JP 192: We implement these because they are widely used. EUC-JP and SHIFT_JIS 193: are more used for files, whereas ISO-2022-JP is recommended for email. 194: * CP932 195: We implement this because it is the Microsoft variant of SHIFT_JIS, 196: used on Windows. 197: * ISO-2022-JP-2 198: We implement this because it's the common way to represent mails which 199: make use of JIS X 0212 characters. 200: * ISO-2022-JP-1 201: We implement this because it's in the RFCs, but I don't think it is 202: really used. 203: * ISO-2022-JP-MS 204: We implement this because Microsoft Outlook Express / Microsoft MimeOLE 205: sends emails in this encoding. 206: * U90, S90 207: We DON'T implement this because I have no informations about what it 208: is or who uses it. 209: Simplified Chinese 210: * EUC-CN = GB2312 211: We implement this because it is the widely used representation 212: of simplified Chinese. 213: * GBK 214: We implement this because it appears to be used on Solaris and Windows. 215: * GB18030 216: We implement this because it is an official requirement in the 217: People's Republic of China. 218: * ISO-2022-CN 219: We implement this because it is in the RFCs, but I have no idea 220: whether it is really used. 221: * ISO-2022-CN-EXT 222: We implement this because it's in the RFCs, but I don't think it is 223: really used. 224: * HZ = HZ-GB-2312 225: We implement this because the RFCs recommend it for Usenet postings, 226: and because MSIE4 supports it. 227: Traditional Chinese 228: * EUC-TW 229: We implement it because it appears to be used on Unix. 230: * BIG5 231: We implement it because it is the de-facto standard for traditional 232: Chinese. 233: * CP950 234: We implement this because it is the Microsoft variant of BIG5, used 235: on Windows. 236: * BIG5+ 237: We DON'T implement this because it doesn't appear to be in wide use. 238: Only the CWEX fonts use this encoding. Furthermore, the conversion 239: tables in the big5p package are not coherent: If you convert directly, 240: you get different results than when you convert via GBK. 241: * BIG5-HKSCS 242: We implement it because it is the de-facto standard for traditional 243: Chinese in Hongkong. 244: Korean 245: * EUC-KR 246: We implement these because they appear to be the widely used 247: representations for Korean. 248: * CP949 249: We implement this because it is the Microsoft variant of EUC-KR, used 250: on Windows. 251: * ISO-2022-KR 252: We implement it because it is in the RFCs and because MSIE4 supports 253: it, but I have no idea whether it's really used. 254: * JOHAB 255: We implement this because it is apparently used on Windows as a locale 256: encoding (codepage 1361). 257: * ISO-646-KR 258: We DON'T implement this because although an old ASCII variant, its 259: glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT 260: say it's a tilde, but Ken Lunde's "CJKV information processing" says 261: it's an overline. And it is not ISO-IR registered. 262: Armenian 263: * ARMSCII-8 264: We implement it because XFree86 supports it. 265: Georgian 266: * Georgian-Academy, Georgian-PS 267: We implement these because they appear to be both used for Georgian; 268: Xfree86 supports them. 269: Thai 270: * ISO-8859-11, TIS-620 271: We implement these because it seems to be standard for Thai. 272: * CP874 273: We implement this because MSIE4 supports it. 274: * MacThai 275: We implement this because the Sun JDK does, and because Mac users 276: don't deserve to be punished. 277: Laotian 278: * MuleLao-1, CP1133 279: We implement these because XFree86 supports them. I have no idea which 280: one is used more widely. 281: Vietnamese 282: * VISCII, TCVN 283: We implement these because XFree86 supports them. 284: * CP1258 285: We implement this because MSIE4 supports it. 286: Other languages 287: * NUNACOM-8 (Inuktitut) 288: We DON'T implement this because it isn't part of Unicode yet, and 289: therefore doesn't convert to anything except itself. 290: Platform specifics 291: * HP-ROMAN8, NEXTSTEP 292: We implement these because they were the native character set on HPs 293: and NeXTs for a long time, and libiconv is intended to be usable on 294: these old machines. 295: Full Unicode 296: * UTF-8, UCS-2, UCS-4 297: We implement these. Obviously. 298: * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE 299: We implement these because they are the preferred internal 300: representation of strings in Unicode aware applications. These are 301: non-ambiguous names, known to glibc. (glibc doesn't have 302: UCS-2-INTERNAL and UCS-4-INTERNAL.) 303: * UTF-16, UTF-16BE, UTF-16LE 304: We implement these, because UTF-16 is still the favourite encoding of 305: the president of the Unicode Consortium (for political reasons), and 306: because they appear in RFC 2781. 307: * UTF-32, UTF-32BE, UTF-32LE 308: We implement these because they are part of Unicode 3.1. 309: * UTF-7 310: We implement this because it is essential functionality for mail 311: applications. 312: * C99 313: We implement it because it's used for C and C++ programs and because 314: it's a nice encoding for debugging. 315: * JAVA 316: We implement it because it's used for Java programs and because it's 317: a nice encoding for debugging. 318: * UNICODE (big endian), UNICODEFEFF (little endian) 319: We DON'T implement these because they are stupid and not standardized. 320: Full Unicode, in terms of 'uint16_t' or 'uint32_t' 321: (with machine dependent endianness and alignment) 322: * UCS-2-INTERNAL, UCS-4-INTERNAL 323: We implement these because they are the preferred internal 324: representation of strings in Unicode aware applications. 325: 326: Q: Support encodings mentioned in RFC 1345 ? 327: A: No, they are not in use any more. Supporting ISO-646 variants is pointless 328: since ISO-8859-* have been adopted. 329: 330: Q: Support EBCDIC ? 331: A: No! 332: 333: Q: How do I add a new character set? 334: A: 1. Explain the "why" in this file, above. 335: 2. You need to have a conversion table from/to Unicode. Transform it into 336: the format used by the mapping tables found on ftp.unicode.org: each line 337: contains the character code, in hex, with 0x prefix, then whitespace, 338: then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#' 339: counts as a comment delimiter until end of line. 340: Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he 341: can include it in his collection. 342: 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the 343: tools directory to generate the C code for the conversion. You may tweak 344: the resulting C code if you are not satisfied with its quality, but this 345: is rarely needed. 346: If it's a two-dimensional character set (with rows and columns), use the 347: 'cjk_tab_to_h' program in the tools directory to generate the C code for 348: the conversion. You will need to modify the main() function to recognize 349: the new character set name, with the proper dimensions, but that shouldn't 350: be too hard. This yields the CCS. The CES you have to write by hand. 351: 4. Store the resulting C code file in the lib directory. Add a #include 352: directive to converters.h, and add an entry to the encodings.def file. 353: 5. Compile the package, and test your new encoding using a program like 354: iconv(1) or clisp(1). 355: 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless 356: encoding, create the complete table as a TXT file. For a stateful encoding, 357: provide a text snippet encoded using your new encoding and its UTF-8 358: equivalent. 359: 7. Update the README and man/iconv_open.3, to mention the new encoding. 360: Add a note in the NEWS file. 361: 362: Q: What about bidirectional text? Should it be tagged or reversed when 363: converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do 364: this, see qt-2.0.1/src/tools/qrtlcodec.cpp. 365: A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and 366: ISO-8859-E remains to be implemented. 367: On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email* 368: is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e. 369: the same as ISO-8859-8-I. I'm confused. 370: 371: Other character sets not implemented: 372: "MNEMONIC" = "csMnemonic" 373: "MNEM" = "csMnem" 374: "ISO-10646-UCS-Basic" = "csUnicodeASCII" 375: "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646" 376: "ISO-10646-J-1" 377: "UNICODE-1-1" = "csUnicode11" 378: "csWindows31Latin5" 379: 380: Other aliases not implemented (and not implemented in glibc-2.1 either): 381: From MSIE4: 382: ISO-8859-1: alias ISO8859-1 383: ISO-8859-2: alias ISO8859-2 384: KSC_5601: alias KS_C_5601 385: UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8 386: 387: 388: Q: How can I integrate libiconv into my package? 389: A: Just copy the entire libiconv package into a subdirectory of your package. 390: At configuration time, call libiconv's configure script with the 391: appropriate --srcdir option and maybe --enable-static or --disable-shared. 392: Then "cd libiconv && make && make install-lib libdir=... includedir=...". 393: 'install-lib' is a special (not GNU standardized) target which installs 394: only the include file - in $(includedir) - and the library - in $(libdir) - 395: and does not use other directory variables. After "installing" libiconv 396: in your package's build directory, building of your package can proceed. 397: 398: Q: Why is the testsuite so big? 399: A: Because some of the tests are very comprehensive. 400: If you don't feel like using the testsuite, you can simply remove the 401: tests/ directory. 402: