Annotation of embedaddon/php/ext/mbstring/README_PHP3-i18n-ja, revision 1.1.1.2
1.1 misho 1: ==========================================
2: README for I18N Package
3: ==========================================
4:
5: o Name and location of package
6:
7: Name: php-3.0.18-i18n-ja-2
8: Location: http://www.happysize.co.jp/techie/php-ja-jp/
9: ftp://ftp.happysize.co.jp/php-ja-jp/
10: http://php.vdomains.org/
11: ftp://ftp.vdomains.org/pub/php-ja-jp/
12: http://php.jpnnet.com/
13:
14: Currently, this I18N version of PHP only adds Japanese support to base
15: PHP. It allows you to use Japanese in scripts, as well as conversion
16: between various Japanese encodings. It will work perfectly fine with
17: ASCII with i18n option enabled. (note: executable is bit larger due
18: to UNICODE table). The basic design aproach is to allow for other
19: languages to be added in the future. Developers are encourage to join
20: us!
21:
22: For more information on Japanese encodings, please refer to the
23: section "Additional Notes."
24:
25:
26: o What is this package?
27:
28: This package allows you to handle multiple Japanese encodings (SJIS, EUC,
29: UTF-8, JIS) in PHP. If you find any bugs in this package, please report
30: them to the appropriate mailing list. For now, the PHP-jp mailing list
31: is the best place for this.
32:
33: PHP-jp ML mailto:PHP-jp@sidecar.ics.es.osaka-u.ac.jp
34: http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
35: (discussions are in Japanese)
36:
37:
38: o Who should use this
39:
40: Due to lack of documentation, it's not intended for beginners. If
41: something goes wrong, be prepared to fix it on your own.
42:
43:
44: o Warranty and Copyright
45:
46: There is no warranty with this package. Use it at your own risk.
47:
48: Please refer to the source code for the copyrights. In general, each
49: program's copyright is owned by the programmer. Unless you obey the
50: copyright holders restrictions, you are not allowed to use it in any
51: form.
52:
53:
54: o Redistribution
55:
56: As described in the source code, this package and the components are
57: allowed to be redistributed with certain restrictions.
58:
59: Due to this package being still in beta, please try to redistribute
60: it as an entire package. Please try not to distribute it as a form
61: of patch. Because we would prefer to have this package distributed
62: as one single package (not patch of patch of patch), avoid releasing
63: any patch to this package.
64:
65:
66: o Who made this
67:
68: A team of volunteers, PHP3 Internationalization, has been contributing
69: their free time producing it. Although we are not related to the core
70: PHP programmers, we are hoping to have our modifications merged into the
71: core distribution in the near future. Thus, we did not call this a
72: "Japanese Patch" (or distribution). Our final goal is to have true
73: i18nized PHP!
74:
75: For anyone interested in this project, please drop us a line.
76:
77: Contact Address:
78: phpj-dev@kage.net
79: (Discussions are in Japanese, but feel free to write us in English)
80:
81: Webpage (English and Japanese):
82: http://php.jpnnet.com/
83:
84: Project Outline (Japanese):
85: http://www.happysize.co.jp/techie/php-ja-jp/spec.htm
86:
87: Developers:
88: Hironori Sato <satoh@jpnnet.com>
89: Shigeru Kanemoto <sgk@happysize.co.jp>
90: Tsukada Takuya <tsukada@fminn.nagano.nagano.jp>
91: U. Kenkichi <kenkichi@axes.co.jp>
92: Tateyama <tateyan@amy.hi-ho.ne.jp>
93: Other gracious contributors
94:
95:
96: o Future plans
97:
98: - fulfilling what's written in outline
99: - support for other languages other than Japanese
100: - make the character conversion as a library (?)
101: - more testing
102:
103:
104: o Special Thanks to
105:
106: PHP Japanese webpage maintainer, Hirokawa-san
107: http://www.cityfujisawa.ne.jp/%7Elouis/apps/phpfi/
108: PHP-JP ML's Yamamoto-san
109: http://sidecar.ics.es.osaka-u.ac.jp/php-jp/
110: Previous jp-patch developers
111:
112:
113:
114: ==========================================
115: Advantages of using I18N package
116: ==========================================
117:
118: - allows you to use various character encodings for script files and
119: http output
120: - distinguish character encoding in POST/GET/COOKIE
121: - proper mail output using JIS as body and MIME/Base64/JIS subject
122: - if http output's Content-Type is text/html, it will set proper charset
123: - stable character encoding conversion
124: - multibyte regex
125:
126:
127:
128: ==========================================
129: Installation
130: ==========================================
131:
132: o Summary
133:
134: Add --enable-i18n option when running configure. For your own setup,
135: add any other appropriate options as well.
136:
137: Don't forget to copy php3.ini-dist to desired location.
138: (ex. /usr/local/lib/php3.ini)
139:
140: If you have already installed PHP3, copy all the entries in php3.ini-dist
141: which start with "i18n.xxxx" to php3.ini.
142:
143:
144: o configure option
145: --enable-i18n
146: include i18n features
147:
148: --enable-mbregex
149: include multibyte regex library
150: (without i18n enabled, mbregex functions will not function)
151:
152:
153: o creating cgi version
154:
155: % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
156: % cd php-3.0.18-i18n-ja-2
157: % ./configure --enable-i18n --enable-mbregex
158: % make
159:
160:
161: o creating Apache version (regular module)
162:
163: % tar xvzf php-3.0.18-i18n-ja-2.tar.gz
164: % tar xvzf apache_1.3.x.tar.gz
165: % cd apache_1.3.x
166: % ./configure
167: % cd ../php-3.0.18-i18n-ja-2
168: % ./configure --with-apache=../apache_1.3.x --enable-i18n --enable-mbregex
169: % make
170: % make install
171: % cd ../apache_1.3.x
172: % ./configure --activate-module=src/modules/php3/libphp3.a
173: % make
174: % make install
175:
176:
177: o creating Apache DSO version
178:
179: create DSO capable Apache first
180: % tar xvzf apache_1.3.x.tar.gz
181: % cd apache-1.3.x
182: % ./configure --enable-shared=max
183: % make
184: % make install
185:
186: now create php3
187: % cd php-3.0.18-i18n-ja-2
188: % ./configure --with-apxs=/usr/local/apache/bin/apxs --enable-i18n \
189: --enable-mbregex
190: % make
191: % make install
192:
193:
194: ==========================================
195: Additional Notes
196: ==========================================
197:
198: o Multibyte regex library
199:
200: From beta4, we have included the multibyte (mb) regex library which comes with
201: Ruby. With this addition, you can now use regex in EUC, SJIS and UTF-8
202: encoding. To avoid any conflicts with HSREGEX included with Apache,
203: each function name has been changed. Therefore, mb regex functions are
204: named differently from the original ereg functions in PHP. The character
205: encoding used in mb regex is configured in i18n.internal_encoding.
206:
207:
208: o Binary Output
209:
210: If http output encoding is set to other than 'pass', conversion of encoding
211: from internal encoding to http output is done automatically. Thus,
212: if you prefer to spit out anything in raw binary format, your data
213: may be corrupted. In such event, set http_output to 'pass'.
214:
215: ex.
216: <?
217: i18n_http_output("pass");
218: ...
219: echo $the_binary_data_string;
220: ?>
221:
222:
223: o Content-Type
224:
225: Depending on the setting of http_output, PHP will output the proper charset.
226: ex. Content-Type: text/html; charset="..."
227:
228: Be aware of following:
229:
230: - If you set Content-Type header using header() function, that will
231: override the automatic addition of charset.
232: - Be cautious when you set i18n_http_output, since if any output is
233: made prior to this, proper header may have been sent out to the
234: client already.
235:
236:
237: o In the event of trouble
238:
239: If you find any bugs or trouble, please contact us at the above address.
240: It may help us to track the problem if you send us the script as well.
241:
242: If you encounter any memory related error such as segmentation violation,
243: add --enable-debug when you run configure. This will give you more
244: detail information on where error has occurred. The error is stored
245: in the server log or regular http output in CGI mode.
246:
247:
248: o About Japanese encodings
249:
250: Due to historical reason, there are multiple character encodings used
251: for Japanese. The most common encodings are: SJIS, EUC, JIS, and UTF-8.
252: Here are (very) brief description of them:
253:
254: EUC
255: commonly used in UNIX environment
256: 8bit-8bit combo
257: always >=0x80
258:
259: SJIS
260: commonly used in Mac or PCs
261: similar to EUC
262: mostly 8bit-8bit (some 8bit-7bit)
263: mostly >=0x80
264: there are some halfwidth (size of ASCII) multibytes
265:
266: JIS
267: commonly used in 7bit environment (nntp and smtp)
268: starts with escaping char, \033 and a few more characters
269:
270: UTF-8
271: 16bit+ encoding
272: defines many languages existing in this world
273: see http://www.unicode.org/ for more detail
274:
275: Because of having all these character encodings, PHP needs to translate
276: between these encodings on the fly. Also, the addition of the mb regex
277: library allows you to handle mb strings without fear of getting mb char
278: chopped in half.
279:
280: Since Japanese is not the only language with multiple encodings, we
281: encourage other developers to modify our code to suit your needs. We
282: definitely need people to work with Korean, Chinese (both traditional
283: and simplified), and Russian. Let us know if you are interested in
284: this project!
285:
286:
287:
288: ==========================================
289: php3.ini setting
290: ==========================================
291:
292: The following init options will allow you to change the default settings.
293: Define these settings in the global section of php3.ini.
294:
295: All keywords are case-insensitive.
296:
297: o Encoding naming
298:
299: For each encoding, there are three names: standarized, alias, MIME
300:
301: - UTF-8
302: standard: UTF-8
303: alias: N/A
304: mime: UTF-8
305:
306: - ASCII
307: standard: ASCII
308: alias: N/A
309: mime: US-ASCII
310:
311: - Japanese EUC
312: standard: EUC-JP
313: alias: EUC, EUC_JP, eucJP, x-euc-jp
314: mime: EUC-JP
315:
316: - Shift JIS
317: standard: SJIS
318: alias: x-sjis, MS_Kanji
319: mime: Shift_JIS
320:
321: - JIS
322: standard: JIS
323: alias: N/A
324: mime: ISO-2022-JP
325:
326: - Quoted-Printable
327: standard: Quoted-Printable
328: alias: qprint
329: mime: N/A
330:
331: - BASE64
332: standard: BASE64
333: alias: N/A
334: mime: N/A
335:
336: - no conversion
337: standard: pass
338: alias: none
339: mime: N/A
340:
341: - auto encoding detection
342: standard: auto
343: alias: unknown
344: mime: N/A
345:
346: * N/A - Not Applicapable
347:
348: o i18n.http_output - default http output encoding
349:
350: i18n.http_output = EUC-JP|SJIS|JIS|UTF-8|pass
351: EUC-JP : EUC
352: SJIS: SJIS
353: JIS : JIS
354: UTF-8: UTF-8
355: pass: no conversion
356:
357: The default is pass (internal encoding is used)
358: It can be re-configured on the fly using i18n_http_output().
359:
360:
361: o i18n.internal_encoding - internal encoding
362:
363: i18n.internal_encoding = EUC-JP|SJIS|UTF-8
364: EUC-JP : EUC
365: SJIS: SJIS
366: UTF-8: UTF-8
367:
368: The default is EUC-JP.
369:
370: PHP parser is designed based on using ISO-8859-1. For other
371: encodings, following conditions have to be satisfied in order
372: to use them:
373: - per byte encoding
374: - single byte charactor in range of 00h-7fh which is compatible
375: with ASCII
376: - multibyte without 00h-7fh
377: In case of Japanese, EUC-JP and UTF-8 are the only encoding that
378: meets this criteria.
379:
380: If i18n.internal_encoding and i18n.http_output differs, conversion
381: takes place at the time of output. If you convert any data within
382: PHP scripts to URL encoding, BASE64 or Quoted-Printable, encoding
383: stays as defined in i18n.internal_encoding. Thus, if you would
384: prefer to encode in compliance with i18n.http_output, you need
385: to manually convert encoding.
386:
387: ex. $str = urlencode( i18n_convert($str, i18n_http_output()) );
388:
389: Encoding such as ISO-2022-** and HZ encoding which uses escape
390: sequences can not be used as internal encoding. If used, they
391: result in following errors:
392: - parser pukes funky error
393: - magic_quotes_*** breaks encoding (SJIS may have similar problem)
394: - string manipulation and regex will malfunction
395:
396:
397: o i18n.script_encoding - script encoding
398:
399: i18n.script_encoding = auto|EUC-JP|SJIS|JIS|UTF-8
400: auto: automatic
401: EUC-JP : EUC
402: SJIS: SJIS
403: JIS : JIS
404: UTF-8: UTF-8
405:
406: The default is auto.
407: The script's encoding is converted to i18n.internal_encoding before
408: entering the script parser.
409:
410: Be aware that auto detection may fail under some conditions.
1.1.1.2 ! misho 411: For best auto detection, add multibyte charactor at beginning of
1.1 misho 412: script.
413:
414:
415: o i18n.http_input - handling of http input (GET/POST/COOKIE)
416:
417: i18n.http_input = pass|auto
418: auto: auto conversion
419: pass: no conversion
420:
421: The default is auto.
422: If set to pass, no conversion will take place.
423: If set to auto, it will automatically detect the encoding. If
424: detection is successful, it will convert to the proper internal
425: encoding. If not, it will assume the input as defined in
426: i18n.http_input_default.
427:
428: o i18n.http_input_default - default http input encoding
429:
430: i18n.http_input_default = pass|EUC-JP|SJIS|JIS|UTF-8
431: pass: no conversion
432: EUC-JP : EUC
433: SJIS: SJIS
434: JIS : JIS
435: UTF-8: UTF-8
436:
437: The default is pass.
438: This option is only effective as long as i18n.http_input is set to
439: auto. If the auto detection fails, this encoding is used as an
440: assumption to convert the http input to the internal encoding.
441: If set to pass, no conversion will take place.
442:
443: o sample settings
444:
445: 1) For most flexibility, we recommend using following example.
446: i18n.http_output = SJIS
447: i18n.internal_encoding = EUC-JP
448: i18n.script_encoding = auto
449: i18n.http_input = auto
450: i18n.http_input_default = SJIS
451:
452: 2) To avoid unexpected encoding problems, try these:
453:
454: i18n.http_output = pass
455: i18n.internal_encoding = EUC-JP
456: i18n.script_encoding = pass
457: i18n.http_input = pass
458: i18n.http_input_default = pass
459:
460:
461:
462: ==========================================
463: PHP functions
464: ==========================================
465:
466: The following describes the additional PHP functions.
467:
468: All keywords are case-insensitive.
469:
470: o i18n_http_output(encoding)
471: o encoding = i18n_http_output()
472:
473: This will set the http output encoding. Any output following this
474: function will be controlled by this function. If no argument is given,
475: the current http output encode setting is returned.
476:
477: encodings
478: EUC-JP : EUC
479: SJIS: SJIS
480: JIS : JIS
481: UTF-8: UTF-8
482: pass: no conversion
483:
484: NONE is not allowed
485:
486:
487: o encoding = i18n_internal_encoding()
488:
489: Returns the current internal encoding as a string.
490:
491: internal encoding
492: EUC-JP : EUC
493: SJIS: SJIS
494: UTF-8: UTF-8
495:
496:
497: o encoding = i18n_http_input()
498:
499: Returns http input encoding.
500:
501: encodings
502: EUC-JP : EUC
503: SJIS: SJIS
504: JIS : JIS
505: UTF-8: UTF-8
506: pass: no conversion (only if i18n.http_input is set to pass)
507:
508:
509: o string = i18n_convert(string, encoding)
510: string = i18n_convert(string, encoding, pre-conversion-encoding)
511:
512: Returns converted string in desired encoding. If
513: pre-conversion-encoding is not defined, the given
514: string is assumed to be in internal encoding.
515:
516: encoding
517: EUC-JP : EUC
518: SJIS: SJIS
519: JIS : JIS
520: UTF-8: UTF-8
521: pass: no conversion
522:
523: pre-conversion-encoding
524: EUC-JP : EUC
525: SJIS: SJIS
526: JIS : JIS
527: UTF-8: UTF-8
528: pass: no conversion
529: auto: auto detection
530:
531:
532: o encoding = i18n_discover_encoding(string)
533:
534: Encoding of the given string is returned (as a string).
535:
536: encoding
537: EUC-JP : EUC
538: SJIS: SJIS
539: JIS : JIS
540: UTF-8: UTF-8
541: ASCII: ASCII (only 09h, 0Ah, 0Dh, 20h-7Eh)
542: pass: unable to determine (text is too short to determine)
543: unknown: unknown or possible error
544:
545:
546: o int = mbstrlen(string)
547: o int = mbstrlen(string, encoding)
548:
549: Returns character length of a given string. If no encoding is defined,
550: the encoding of string is assumed to be the internal encoding.
551:
552: encoding
553: EUC-JP : EUC
554: SJIS: SJIS
555: JIS : JIS
556: UTF-8: UTF-8
557: auto: automatic
558:
559:
560: o int = mbstrpos(string1, string2)
561: o int = mbstrpos(string1, string2, start)
562: o int = mbstrpos(string1, string2, start, encoding)
563:
564: Same as strpos. If no encoding is defined, the encoding of string
565: is assumed to be the internal encoding.
566:
567: encoding
568: EUC-JP : EUC
569: SJIS: SJIS
570: JIS : JIS
571: UTF-8: UTF-8
572:
573:
574: o int = mbstrrpos(string1, string2)
575: o int = mbstrrpos(string1, string2, encoding)
576:
577: Same as strrpos. If no encoding is defined, the encoding of string
578: is assumed to be the internal encoding.
579:
580: encoding
581: EUC-JP : EUC
582: SJIS: SJIS
583: JIS : JIS
584: UTF-8: UTF-8
585:
586:
587: o string = mbsubstr(string, position)
588: o string = mbsubstr(string, position, length)
589: o string = mbsubstr(string, position, length, encoding)
590:
591: Same as substr. If no encoding is defined, the encoding of string
592: is assumed to be the internal encoding.
593:
594: encoding
595: EUC-JP : EUC
596: SJIS: SJIS
597: JIS : JIS
598: UTF-8: UTF-8
599:
600:
601: o string = mbstrcut(string, position)
602: o string = mbstrcut(string, position, length)
603: o string = mbstrcut(string, position, length, encoding)
604:
605: Same as subcut. If position is the 2nd byte of a mb character, it will cut
606: from the first byte of that character. It will cut the string without
607: chopping a single byte from a mb character. In another words, if you
608: set length to 5, you will only get two mb characters. If no encoding
609: is defined, the encoding of string is assumed to be the internal encoding.
610:
611: encoding
612: EUC-JP : EUC
613: SJIS: SJIS
614: JIS : JIS
615: UTF-8: UTF-8
616:
617:
618: o string = i18n_mime_header_encode(string)
619: MIME encode the string in the format of =?ISO-2022-JP?B?[string]?=.
620:
621:
622: o string = i18n_mime_header_decode(string)
623: MIME decodes the string.
624:
625:
626: o string = i18n_ja_jp_hantozen(string)
627: o string = i18n_ja_jp_hantozen(string, option)
628: o string = i18n_ja_jp_hantozen(string, option, encoding)
629:
630: Conversion between full width character and halfwidth character.
631:
632: option
633: The following options are allowed. The default is "KV".
634: Acronym: FW = fullwidth, HW = halfwidth
635:
636: "r" : FW alphabet -> HW alphabet
637:
638: "R" : HW alphabet -> FW alphabet
639:
640: "n" : FW number -> HW number
641:
642: "N" : HW number -> FW number
643:
644: "a" : FW alpha numeric (21h-7Eh) -> HW alpha numeric
645:
646: "A" : HW alpha numeric (21h-7Eh) -> FW alpha numeric
647:
648: "k" : FW katakana -> HW katakana
649:
650: "K" : HW katakana -> FW katakana
651:
652: "h" : FW hiragana -> HW hiragana
653:
654: "H" : HW hiragana -> FW katakana
655:
656: "c" : FW katakana -> FW hiragana
657:
658: "C" : FW hiragana -> FW katakana
659:
660: "V" : merge dakuon character. only works with "K" and "H" option
661:
662: encoding
663: If no encoding is defined, the encoding of string is assumed to be
664: the internal encoding.
665: EUC-JP : EUC
666: SJIS: SJIS
667: JIS : JIS
668: UTF-8: UTF-8
669:
670:
671: int = mbereg(regex_pattern, string, string)
672: int = mberegi(regex_pattern, string, string)
673: mb version of ereg() and eregi()
674:
675:
676: string = mbereg_replace(regex_pattern, string, string)
677: string = mberegi_replace(regex_pattern, string, string)
678: mb version of ereg_replace() and eregi_replace()
679:
680:
681: string_array = mbsplit(regex, string, limit)
682: mb version of split()
683:
684:
685:
686: ==========================================
687: FAQ
688: ==========================================
689:
690: Here, we have gathered some commonly asked questions on PHP-jp mailing
691: list.
692:
693: o To use Japanese in GET method
694:
695: If you need to assign Japanese text in GET method with argument, such as;
696: xxxx.php?data=<Japanese text>, use urlencode function in PHP. If not,
697: text may not be passed onto action php properly.
698:
699: ex: <a href="hoge.php?data=<? echo urlencode($data) ?>">Link</a>
700:
701:
702: o When passing data via GET/POST/COOKIE, \ character sneaks in
703:
704: When using SJIS as internal encoding, or passed-on data includes '"\,
705: PHP automatically inserts escaping character, \. Set magic_quotes_gpc
706: in php3.ini from On to Off. An alternative work around to this problem
707: is to use StripSlashes().
708:
709: If $quote_str is in SJIS and you would like to extract Japanese text,
710: use ereg_replace as follows:
711:
712: ereg_replace(sprintf("([%c-%c%c-%c]\\\\)\\\\",0x81,0x9f,0xe0,0xfc),
713: "\\1",$quote_str);
714:
715: This will effectively extract Japanese text out of $quote_str.
716:
717:
718: o Sometimes, encoding detection fails
719:
720: If i18n_http_input() returns 'pass', it's likely that PHP failed to
721: detect whether it's SJIS or EUC. In such case, use <input type=hidden
722: value="some Japanese text"> to properly detect the incoming text's
723: encoding.
724:
725:
726:
727: ==========================================
728: Japanese Manual
729: ==========================================
730: Translated manual done by "PHP Japanese Manual Project" :
731:
732: http://www.php.net/manual/ja/manual.php
733:
734: Starting 3.0.18-i18n-ja, we have removed doc-jp from tarball package.
735:
736:
737: ==========================================
738: Change Logs
739: ==========================================
740:
741: o 2000-10-28, Rui Hirokawa <hirokawa@php.net>
742:
743: This patch is derived from php-3.0.15-i18n-ja as well as php-3.0.16 by
744: Kuwamura applied to original php-3.0.18. It also includes following fixes:
745:
746: 1) allows you to set charset in mail().
747: 2) fixed mbregex definitions to avoid conflicts with system regex
748: 3) php3.ini-dist now uses PASS for http_output instead of SJIS
749:
750: o 2000-11-24, Hironori Sato <satoh@yyplanet.com>
751:
752: Applied above patched and added detection for gdImageStringTTF in configure.
753: Following setups are known to work:
754:
755: gd-1.3-6, gd-devel-1.3-6, freetype-1.3.1-5, freetype-devel-1.3.1-5
756: ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf",
757: i18n_convert("日本語", "UTF-8"));
758: ImageGif($im);
759:
760: gd-1.7.3-1k1, gd-devel-1.7.3-1k1, freetype-1.3.1-5, freetype-devel-1.3.1-5
761: ImageTTFText($im,$size,$angle,$x1,$y1,$color,"/path/to/font.ttf","日本語");
762: ImagePng($im);
763: * i18n_internal_encoding = EUC 又は SJIS
764:
765: For any gd libraries before 1.6.2, you need to use i18n_convert. For
766: gd-1.5.2/3, upgrade to anything above 1.7 to use ImageTTFText without
767: using i18n_convert. As long as you have internal_encoding set to EUC or
768: SJIS, ImageTTFText should work without mojibake. Again, make sure you
769: have i18n_http_output("pass") before calling ImageGif, ImagePng, ImageJpeg!
770:
771: o 2000-12-09, Rui Hirokawa <hirokawa@php.net>
772:
773: Fixed mail() which was causing segmentation fault when header was null.
774:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>