Annotation of embedaddon/libiconv/NOTES, revision 1.1.1.2
1.1 misho 1: Q: Why does libiconv support encoding XXX? Why does libiconv not support
2: encoding ZZZ?
3:
4: A: libiconv, as an internationalization library, supports those character
5: sets and encodings which are in wide-spread use in at least one territory
6: of the world.
7:
8: Hint1: On http://www.w3c.org/International/O-charset-lang.html you find a
9: page "Languages, countries, and the charsets typically used for them".
10: From this table, we can conclude that the following are in active use:
11:
12: ISO-8859-1, CP1252 Afrikaans, Albanian, Basque, Catalan, Danish, Dutch,
13: English, Faroese, Finnish, French, Galician, German,
14: Icelandic, Irish, Italian, Norwegian, Portuguese,
15: Scottish, Spanish, Swedish
16: ISO-8859-2 Croatian, Czech, Hungarian, Polish, Romanian, Slovak,
17: Slovenian
18: ISO-8859-3 Esperanto, Maltese
19: ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
20: Serbian, Ukrainian
21: ISO-8859-6 Arabic
22: ISO-8859-7 Greek
23: ISO-8859-8 Hebrew
24: ISO-8859-9, CP1254 Turkish
25: ISO-8859-10 Inuit, Lapp
26: ISO-8859-13 Latvian, Lithuanian
27: ISO-8859-15 Estonian
28: KOI8-R Russian
29: SHIFT_JIS Japanese
30: ISO-2022-JP Japanese
31: EUC-JP Japanese
32:
33: Ordered by frequency on the web (1997):
34: ISO-8859-1, CP1252 96%
35: SHIFT_JIS 1.6%
36: ISO-2022-JP 1.2%
37: EUC-JP 0.4%
38: CP1250 0.3%
39: CP1251 0.2%
40: CP850 0.1%
41: MACINTOSH 0.1%
42: ISO-8859-5 0.1%
43: ISO-8859-2 0.0%
44:
45: Hint2: The character sets mentioned in the XFree86 4.0 locale.alias file.
46:
47: ISO-8859-1 Afrikaans, Basque, Breton, Catalan, Danish, Dutch,
48: English, Estonian, Faroese, Finnish, French,
49: Galician, German, Greenlandic, Icelandic,
50: Indonesian, Irish, Italian, Lithuanian, Norwegian,
51: Occitan, Portuguese, Scottish, Spanish, Swedish,
52: Walloon, Welsh
53: ISO-8859-2 Albanian, Croatian, Czech, Hungarian, Polish,
54: Romanian, Serbian, Slovak, Slovenian
55: ISO-8859-3 Esperanto
56: ISO-8859-4 Estonian, Latvian, Lithuanian
57: ISO-8859-5 Bulgarian, Byelorussian, Macedonian, Russian,
58: Serbian, Ukrainian
59: ISO-8859-6 Arabic
60: ISO-8859-7 Greek
61: ISO-8859-8 Hebrew
62: ISO-8859-9 Turkish
63: ISO-8859-14 Breton, Irish, Scottish, Welsh
64: ISO-8859-15 Basque, Breton, Catalan, Danish, Dutch, Estonian,
65: Faroese, Finnish, French, Galician, German,
66: Greenlandic, Icelandic, Irish, Italian, Lithuanian,
67: Norwegian, Occitan, Portuguese, Scottish, Spanish,
68: Swedish, Walloon, Welsh
69: KOI8-R Russian
70: KOI8-U Russian, Ukrainian
71: EUC-JP (alias eucJP) Japanese
72: ISO-2022-JP (alias JIS7) Japanese
73: SHIFT_JIS (alias SJIS) Japanese
74: U90 Japanese
75: S90 Japanese
76: EUC-CN (alias eucCN) Chinese
77: EUC-TW (alias eucTW) Chinese
78: BIG5 Chinese
79: EUC-KR (alias eucKR) Korean
80: ARMSCII-8 Armenian
81: GEORGIAN-ACADEMY Georgian
82: GEORGIAN-PS Georgian
83: TIS-620 (alias TACTIS) Thai
84: MULELAO-1 Laothian
85: IBM-CP1133 Laothian
86: VISCII Vietnamese
87: TCVN Vietnamese
88: NUNACOM-8 Inuktitut
89:
90: Hint3: The character sets supported by Netscape Communicator 4.
91:
92: Where is this documented? For the complete picture, I had to use
93: "strings netscape" and then a lot of guesswork. For a quick take,
94: look at the "View - Character set" menu of Netscape Communicator 4.6:
95:
96: ISO-8859-{1,2,5,7,9,15}
97: WINDOWS-{1250,1251,1253}
98: KOI8-R Cyrillic
99: CP866 Cyrillic
100: Autodetect Japanese (EUC-JP, ISO-2022-JP, ISO-2022-JP-2, SJIS)
101: EUC-JP Japanese
102: SHIFT_JIS Japanese
103: GB2312 Chinese
104: BIG5 Chinese
105: EUC-TW Chinese
106: Autodetect Korean (EUC-KR, ISO-2022-KR, but not JOHAB)
107:
108: UTF-8
109: UTF-7
110:
111: Hint4: The character sets supported by Microsoft Internet Explorer 4.
112:
113: ISO-8859-{1,2,3,4,5,6,7,8,9}
114: WINDOWS-{1250,1251,1252,1253,1254,1255,1256,1257}
115: KOI8-R Cyrillic
116: KOI8-RU Ukrainian
117: ASMO-708 Arabic
118: EUC-JP Japanese
119: ISO-2022-JP Japanese
120: SHIFT_JIS Japanese
121: GB2312 Chinese
122: HZ-GB-2312 Chinese
123: BIG5 Chinese
124: EUC-KR Korean
125: ISO-2022-KR Korean
126: WINDOWS-874 Thai
127: WINDOWS-1258 Vietnamese
128:
129: UTF-8
130: UTF-7
131: UNICODE actually UNICODE-LITTLE
132: UNICODEFEFF actually UNICODE-BIG
133:
134: and various DOS character sets: DOS-720, DOS-862, IBM852, CP866.
135:
136: We take the union of all these four sets. The result is:
137:
138: European and Semitic languages
139: * ASCII.
140: We implement this because it is occasionally useful to know or to
141: check whether some text is entirely ASCII (i.e. if the conversion
142: ISO-8859-x -> UTF-8 is trivial).
143: * ISO-8859-{1,2,3,4,5,6,7,8,9,10}
144: We implement this because they are widely used. Except ISO-8859-4
145: which appears to have been superseded by ISO-8859-13 in the baltic
146: countries. But it's an ISO standard anyway.
147: * ISO-8859-13
148: We implement this because it's a standard in Lithuania and Latvia.
149: * ISO-8859-14
150: We implement this because it's an ISO standard.
151: * ISO-8859-15
152: We implement this because it's increasingly used in Europe, because
153: of the Euro symbol.
154: * ISO-8859-16
155: We implement this because it's an ISO standard.
156: * KOI8-R, KOI8-U
157: We implement this because it appears to be the predominant encoding
158: on Unix in Russia and Ukraine, respectively.
159: * KOI8-RU
160: We implement this because MSIE4 supports it.
161: * KOI8-T
162: We implement this because it is the locale encoding in glibc's Tajik
163: locale.
164: * PT154
165: We implement this because it is the locale encoding in glibc's Kazakh
166: locale.
167: * RK1048
168: We implement this because it's a standard in Kazakhstan.
169: * CP{1250,1251,1252,1253,1254,1255,1256,1257}
170: We implement these because they are the predominant Windows encodings
171: in Europe.
172: * CP850
173: We implement this because it is mentioned as occurring in the web
174: in the aforementioned statistics.
175: * CP862
176: We implement this because Ron Aaron says it is sometimes used in web
177: pages and emails.
178: * CP866
179: We implement this because Netscape Communicator does.
180: * CP1131
181: We implement this because it is the locale encoding of a Belorusian
182: locale in FreeBSD and MacOS X.
183: * Mac{Roman,CentralEurope,Croatian,Romania,Cyrillic,Greek,Turkish} and
184: Mac{Hebrew,Arabic}
185: We implement these because the Sun JDK does, and because Mac users
186: don't deserve to be punished.
187: * Macintosh
188: We implement this because it is mentioned as occurring in the web
189: in the aforementioned statistics.
190: Japanese
191: * EUC-JP, SHIFT_JIS, ISO-2022-JP
192: We implement these because they are widely used. EUC-JP and SHIFT_JIS
193: are more used for files, whereas ISO-2022-JP is recommended for email.
194: * CP932
195: We implement this because it is the Microsoft variant of SHIFT_JIS,
196: used on Windows.
197: * ISO-2022-JP-2
198: We implement this because it's the common way to represent mails which
199: make use of JIS X 0212 characters.
200: * ISO-2022-JP-1
201: We implement this because it's in the RFCs, but I don't think it is
202: really used.
1.1.1.2 ! misho 203: * ISO-2022-JP-MS
! 204: We implement this because Microsoft Outlook Express / Microsoft MimeOLE
! 205: sends emails in this encoding.
1.1 misho 206: * U90, S90
207: We DON'T implement this because I have no informations about what it
208: is or who uses it.
209: Simplified Chinese
210: * EUC-CN = GB2312
211: We implement this because it is the widely used representation
212: of simplified Chinese.
213: * GBK
214: We implement this because it appears to be used on Solaris and Windows.
215: * GB18030
216: We implement this because it is an official requirement in the
217: People's Republic of China.
218: * ISO-2022-CN
219: We implement this because it is in the RFCs, but I have no idea
220: whether it is really used.
221: * ISO-2022-CN-EXT
222: We implement this because it's in the RFCs, but I don't think it is
223: really used.
224: * HZ = HZ-GB-2312
225: We implement this because the RFCs recommend it for Usenet postings,
226: and because MSIE4 supports it.
227: Traditional Chinese
228: * EUC-TW
229: We implement it because it appears to be used on Unix.
230: * BIG5
231: We implement it because it is the de-facto standard for traditional
232: Chinese.
233: * CP950
234: We implement this because it is the Microsoft variant of BIG5, used
235: on Windows.
236: * BIG5+
237: We DON'T implement this because it doesn't appear to be in wide use.
238: Only the CWEX fonts use this encoding. Furthermore, the conversion
239: tables in the big5p package are not coherent: If you convert directly,
240: you get different results than when you convert via GBK.
241: * BIG5-HKSCS
242: We implement it because it is the de-facto standard for traditional
243: Chinese in Hongkong.
244: Korean
245: * EUC-KR
246: We implement these because they appear to be the widely used
247: representations for Korean.
248: * CP949
249: We implement this because it is the Microsoft variant of EUC-KR, used
250: on Windows.
251: * ISO-2022-KR
252: We implement it because it is in the RFCs and because MSIE4 supports
253: it, but I have no idea whether it's really used.
254: * JOHAB
255: We implement this because it is apparently used on Windows as a locale
256: encoding (codepage 1361).
257: * ISO-646-KR
258: We DON'T implement this because although an old ASCII variant, its
259: glyph for 0x7E is not clear: RFC 1345 and unicode.org's JOHAB.TXT
260: say it's a tilde, but Ken Lunde's "CJKV information processing" says
261: it's an overline. And it is not ISO-IR registered.
262: Armenian
263: * ARMSCII-8
264: We implement it because XFree86 supports it.
265: Georgian
266: * Georgian-Academy, Georgian-PS
267: We implement these because they appear to be both used for Georgian;
268: Xfree86 supports them.
269: Thai
270: * ISO-8859-11, TIS-620
271: We implement these because it seems to be standard for Thai.
272: * CP874
273: We implement this because MSIE4 supports it.
274: * MacThai
275: We implement this because the Sun JDK does, and because Mac users
276: don't deserve to be punished.
277: Laotian
278: * MuleLao-1, CP1133
279: We implement these because XFree86 supports them. I have no idea which
280: one is used more widely.
281: Vietnamese
282: * VISCII, TCVN
283: We implement these because XFree86 supports them.
284: * CP1258
285: We implement this because MSIE4 supports it.
286: Other languages
287: * NUNACOM-8 (Inuktitut)
288: We DON'T implement this because it isn't part of Unicode yet, and
289: therefore doesn't convert to anything except itself.
290: Platform specifics
291: * HP-ROMAN8, NEXTSTEP
292: We implement these because they were the native character set on HPs
293: and NeXTs for a long time, and libiconv is intended to be usable on
294: these old machines.
295: Full Unicode
296: * UTF-8, UCS-2, UCS-4
297: We implement these. Obviously.
298: * UCS-2BE, UCS-2LE, UCS-4BE, UCS-4LE
299: We implement these because they are the preferred internal
300: representation of strings in Unicode aware applications. These are
301: non-ambiguous names, known to glibc. (glibc doesn't have
302: UCS-2-INTERNAL and UCS-4-INTERNAL.)
303: * UTF-16, UTF-16BE, UTF-16LE
304: We implement these, because UTF-16 is still the favourite encoding of
305: the president of the Unicode Consortium (for political reasons), and
306: because they appear in RFC 2781.
307: * UTF-32, UTF-32BE, UTF-32LE
308: We implement these because they are part of Unicode 3.1.
309: * UTF-7
310: We implement this because it is essential functionality for mail
311: applications.
312: * C99
313: We implement it because it's used for C and C++ programs and because
314: it's a nice encoding for debugging.
315: * JAVA
316: We implement it because it's used for Java programs and because it's
317: a nice encoding for debugging.
318: * UNICODE (big endian), UNICODEFEFF (little endian)
319: We DON'T implement these because they are stupid and not standardized.
1.1.1.2 ! misho 320: Full Unicode, in terms of 'uint16_t' or 'uint32_t'
1.1 misho 321: (with machine dependent endianness and alignment)
322: * UCS-2-INTERNAL, UCS-4-INTERNAL
323: We implement these because they are the preferred internal
324: representation of strings in Unicode aware applications.
325:
326: Q: Support encodings mentioned in RFC 1345 ?
327: A: No, they are not in use any more. Supporting ISO-646 variants is pointless
328: since ISO-8859-* have been adopted.
329:
330: Q: Support EBCDIC ?
331: A: No!
332:
333: Q: How do I add a new character set?
334: A: 1. Explain the "why" in this file, above.
335: 2. You need to have a conversion table from/to Unicode. Transform it into
336: the format used by the mapping tables found on ftp.unicode.org: each line
337: contains the character code, in hex, with 0x prefix, then whitespace,
338: then the Unicode code point, in hex, 4 hex digits, with 0x prefix. '#'
339: counts as a comment delimiter until end of line.
340: Please also send your table to Mark Leisher <mleisher@crl.nmsu.edu> so he
341: can include it in his collection.
342: 3. If it's an 8-bit character set, use the '8bit_tab_to_h' program in the
343: tools directory to generate the C code for the conversion. You may tweak
344: the resulting C code if you are not satisfied with its quality, but this
345: is rarely needed.
346: If it's a two-dimensional character set (with rows and columns), use the
347: 'cjk_tab_to_h' program in the tools directory to generate the C code for
348: the conversion. You will need to modify the main() function to recognize
349: the new character set name, with the proper dimensions, but that shouldn't
350: be too hard. This yields the CCS. The CES you have to write by hand.
351: 4. Store the resulting C code file in the lib directory. Add a #include
352: directive to converters.h, and add an entry to the encodings.def file.
353: 5. Compile the package, and test your new encoding using a program like
354: iconv(1) or clisp(1).
355: 6. Augment the testsuite: Add a line to tests/Makefile.in. For a stateless
356: encoding, create the complete table as a TXT file. For a stateful encoding,
357: provide a text snippet encoded using your new encoding and its UTF-8
358: equivalent.
359: 7. Update the README and man/iconv_open.3, to mention the new encoding.
360: Add a note in the NEWS file.
361:
362: Q: What about bidirectional text? Should it be tagged or reversed when
363: converting from ISO-8859-8 or ISO-8859-6 to Unicode? Qt appears to do
364: this, see qt-2.0.1/src/tools/qrtlcodec.cpp.
365: A: After reading RFC 1556: I don't think so. Support for ISO-8859-8-I and
366: ISO-8859-E remains to be implemented.
367: On the other hand, a page on www.w3c.org says that ISO-8859-8 in *email*
368: is visually encoded, ISO-8859-8 in *HTML* is logically encoded, i.e.
369: the same as ISO-8859-8-I. I'm confused.
370:
371: Other character sets not implemented:
372: "MNEMONIC" = "csMnemonic"
373: "MNEM" = "csMnem"
374: "ISO-10646-UCS-Basic" = "csUnicodeASCII"
375: "ISO-10646-Unicode-Latin1" = "csUnicodeLatin1" = "ISO-10646"
376: "ISO-10646-J-1"
377: "UNICODE-1-1" = "csUnicode11"
378: "csWindows31Latin5"
379:
380: Other aliases not implemented (and not implemented in glibc-2.1 either):
381: From MSIE4:
382: ISO-8859-1: alias ISO8859-1
383: ISO-8859-2: alias ISO8859-2
384: KSC_5601: alias KS_C_5601
385: UTF-8: aliases UNICODE-1-1-UTF-8 UNICODE-2-0-UTF-8
386:
387:
388: Q: How can I integrate libiconv into my package?
389: A: Just copy the entire libiconv package into a subdirectory of your package.
390: At configuration time, call libiconv's configure script with the
391: appropriate --srcdir option and maybe --enable-static or --disable-shared.
392: Then "cd libiconv && make && make install-lib libdir=... includedir=...".
393: 'install-lib' is a special (not GNU standardized) target which installs
394: only the include file - in $(includedir) - and the library - in $(libdir) -
395: and does not use other directory variables. After "installing" libiconv
396: in your package's build directory, building of your package can proceed.
397:
398: Q: Why is the testsuite so big?
399: A: Because some of the tests are very comprehensive.
400: If you don't feel like using the testsuite, you can simply remove the
401: tests/ directory.
402:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>