|
version 1.1, 2012/02/21 23:05:52
|
version 1.1.1.2, 2012/02/21 23:50:25
|
|
Line 1
|
Line 1
|
| .TH PCREUNICODE 3 |
.TH PCREUNICODE 3 |
| .SH NAME |
.SH NAME |
| PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
| .SH "UTF-8 AND UNICODE PROPERTY SUPPORT" | .SH "UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT" |
| .rs |
.rs |
| .sp |
.sp |
| In order process UTF-8 strings, you must build PCRE to include UTF-8 support in | From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
| the code, and, in addition, you must call | supports UTF-16 by means of a separate 16-bit library. This can be built as |
| | well as, or instead of, the 8-bit library. |
| | . |
| | . |
| | .SH "UTF-8 SUPPORT" |
| | .rs |
| | .sp |
| | In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF |
| | support, and, in addition, you must call |
| .\" HREF |
.\" HREF |
| \fBpcre_compile()\fP |
\fBpcre_compile()\fP |
| .\" |
.\" |
| with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
| (*UTF8). When either of these is the case, both the pattern and any subject |
(*UTF8). When either of these is the case, both the pattern and any subject |
| strings that are matched against it are treated as UTF-8 strings instead of |
strings that are matched against it are treated as UTF-8 strings instead of |
| strings of 1-byte characters. PCRE does not support any other formats (in | strings of 1-byte characters. |
| particular, it does not support UTF-16). | . |
| .P | . |
| If you compile PCRE with UTF-8 support, but do not use it at run time, the | .SH "UTF-16 SUPPORT" |
| | .rs |
| | .sp |
| | In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF |
| | support, and, in addition, you must call |
| | .\" HTML <a href="pcre_compile.html"> |
| | .\" </a> |
| | \fBpcre16_compile()\fP |
| | .\" |
| | with the PCRE_UTF16 option flag, or the pattern must start with the sequence |
| | (*UTF16). When either of these is the case, both the pattern and any subject |
| | strings that are matched against it are treated as UTF-16 strings instead of |
| | strings of 16-bit characters. |
| | . |
| | . |
| | .SH "UTF SUPPORT OVERHEAD" |
| | .rs |
| | .sp |
| | If you compile PCRE with UTF support, but do not use it at run time, the |
| library will be a bit bigger, but the additional run time overhead is limited |
library will be a bit bigger, but the additional run time overhead is limited |
| to testing the PCRE_UTF8 flag occasionally, so should not be very big. | to testing the PCRE_UTF8/16 flag occasionally, so should not be very big. |
| .P | . |
| If PCRE is built with Unicode character property support (which implies UTF-8 | . |
| support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. | .SH "UNICODE PROPERTY SUPPORT" |
| | .rs |
| | .sp |
| | If PCRE is built with Unicode character property support (which implies UTF |
| | support), the escape sequences \ep{..}, \eP{..}, and \eX can be used. |
| The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
| category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
| number, the Unicode script names such as Arabic or Han, and the derived |
number, the Unicode script names such as Arabic or Han, and the derived |
|
Line 38 compatibility with Perl 5.6. PCRE does not support thi
|
Line 68 compatibility with Perl 5.6. PCRE does not support thi
|
| .SS "Validity of UTF-8 strings" |
.SS "Validity of UTF-8 strings" |
| .rs |
.rs |
| .sp |
.sp |
| When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects | When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
| are (by default) checked for validity on entry to the relevant functions. From | subjects are (by default) checked for validity on entry to the relevant |
| release 7.3 of PCRE, the check is according the rules of RFC 3629, which are | functions. From release 7.3 of PCRE, the check is according the rules of RFC |
| themselves derived from the Unicode specification. Earlier releases of PCRE | 3629, which are themselves derived from the Unicode specification. Earlier |
| followed the rules of RFC 2279, which allows the full range of 31-bit values (0 | releases of PCRE followed the rules of RFC 2279, which allows the full range of |
| to 0x7FFFFFFF). The current check allows only values in the range U+0 to | 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the |
| U+10FFFF, excluding U+D800 to U+DFFF. | range U+0 to U+10FFFF, excluding U+D800 to U+DFFF. |
| .P |
.P |
| The excluded code points are the "Low Surrogate Area" of Unicode, of which the | The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
| Unicode Standard says this: "The Low Surrogate Area does not contain any | for use by UTF-16, where they are used in pairs to encode codepoints with |
| character assignments, consequently no character code charts or namelists are | values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
| provided for this area. Surrogates are reserved for use with UTF-16 and then | are available independently in the UTF-8 encoding. (In other words, the whole |
| must be used in pairs." The code points that are encoded by UTF-16 pairs are | surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
| available as independent code points in the UTF-8 encoding. (In other words, | |
| the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up | |
| UTF-8.) | |
| .P |
.P |
| If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
| compile time, the only additional information is the offset to the first byte |
compile time, the only additional information is the offset to the first byte |
|
Line 85 situation, you will have to apply your own validity ch
|
Line 112 situation, you will have to apply your own validity ch
|
| JIT optimization. |
JIT optimization. |
| . |
. |
| . |
. |
| .SS "General comments about UTF-8 mode" | .\" HTML <a name="utf16strings"></a> |
| | .SS "Validity of UTF-16 strings" |
| .rs |
.rs |
| .sp |
.sp |
| 1. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte | When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are |
| UTF-8 character if the value is greater than 127. | passed as patterns and subjects are (by default) checked for validity on entry |
| | to the relevant functions. Values other than those in the surrogate range |
| | U+D800 to U+DFFF are independent code points. Values in the surrogate range |
| | must be used in pairs in the correct manner. |
| .P |
.P |
| 2. Octal numbers up to \e777 are recognized, and match two-byte UTF-8 | If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
| characters for values greater than \e177. | compile time, the only additional information is the offset to the first data |
| | unit of the failing character. The runtime functions \fBpcre16_exec()\fP and |
| | \fBpcre16_dfa_exec()\fP also pass back this information, as well as a more |
| | detailed reason code if the caller has provided memory in which to do this. |
| .P |
.P |
| 3. Repeat quantifiers apply to complete UTF-8 characters, not to individual | In some situations, you may already know that your strings are valid, and |
| bytes, for example: \ex{100}{3}. | therefore want to skip these checks in order to improve performance. If you set |
| | the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
| | the pattern or subject it is given (respectively) contains only valid UTF-16 |
| | sequences. In this case, it does not diagnose an invalid UTF-16 string. |
| | . |
| | . |
| | .SS "General comments about UTF modes" |
| | .rs |
| | .sp |
| | 1. Codepoints less than 256 can be specified by either braced or unbraced |
| | hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger values |
| | have to use braced sequences. |
| .P |
.P |
| 4. The dot metacharacter matches one UTF-8 character instead of a single byte. | 2. Octal numbers up to \e777 are recognized, and in UTF-8 mode, they match |
| | two-byte characters for values greater than \e177. |
| .P |
.P |
| 5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, | 3. Repeat quantifiers apply to complete UTF characters, not to individual |
| but its use can lead to some strange effects because it breaks up multibyte | data units, for example: \ex{100}{3}. |
| characters (see the description of \eC in the | .P |
| | 4. The dot metacharacter matches one UTF character instead of a single data |
| | unit. |
| | .P |
| | 5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, or |
| | a single 16-bit data unit in UTF-16 mode, but its use can lead to some strange |
| | effects because it breaks up multi-unit characters (see the description of \eC |
| | in the |
| .\" HREF |
.\" HREF |
| \fBpcrepattern\fP |
\fBpcrepattern\fP |
| .\" |
.\" |
| documentation). The use of \eC is not supported in the alternative matching |
documentation). The use of \eC is not supported in the alternative matching |
| function \fBpcre_dfa_exec()\fP, nor is it supported in UTF-8 mode by the JIT | function \fBpcre[16]_dfa_exec()\fP, nor is it supported in UTF mode by the JIT |
| optimization of \fBpcre_exec()\fP. If JIT optimization is requested for a UTF-8 | optimization of \fBpcre[16]_exec()\fP. If JIT optimization is requested for a |
| pattern that contains \eC, it will not succeed, and so the matching will be | UTF pattern that contains \eC, it will not succeed, and so the matching will |
| carried out by the normal interpretive function. | be carried out by the normal interpretive function. |
| .P |
.P |
| 6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
| test characters of any code value, but, by default, the characters that PCRE |
test characters of any code value, but, by default, the characters that PCRE |
| recognizes as digits, spaces, or word characters remain the same set as before, | recognizes as digits, spaces, or word characters remain the same set as in |
| all with values less than 256. This remains true even when PCRE is built to | non-UTF mode, all with values less than 256. This remains true even when PCRE |
| include Unicode property support, because to do otherwise would slow down PCRE | is built to include Unicode property support, because to do otherwise would |
| in many common cases. Note in particular that this applies to \eb and \eB, | slow down PCRE in many common cases. Note in particular that this applies to |
| because they are defined in terms of \ew and \eW. If you really want to test | \eb and \eB, because they are defined in terms of \ew and \eW. If you really |
| for a wider sense of, say, "digit", you can use explicit Unicode property tests | want to test for a wider sense of, say, "digit", you can use explicit Unicode |
| such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, the way that | property tests such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, |
| the character escapes work is changed so that Unicode properties are used to | the way that the character escapes work is changed so that Unicode properties |
| determine which characters match. There are more details in the section on | are used to determine which characters match. There are more details in the |
| | section on |
| .\" HTML <a href="pcrepattern.html#genericchartypes"> |
.\" HTML <a href="pcrepattern.html#genericchartypes"> |
| .\" </a> |
.\" </a> |
| generic character types |
generic character types |
|
Line 163 Cambridge CB2 3QH, England.
|
Line 217 Cambridge CB2 3QH, England.
|
| .rs |
.rs |
| .sp |
.sp |
| .nf |
.nf |
| Last updated: 19 October 2011 | Last updated: 13 January 2012 |
| Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2012 University of Cambridge. |
| .fi |
.fi |