version 1.1, 2012/02/21 23:05:52
|
version 1.1.1.2, 2012/02/21 23:50:25
|
Line 1
|
Line 1
|
.TH PCREUNICODE 3 |
.TH PCREUNICODE 3 |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "UTF-8 AND UNICODE PROPERTY SUPPORT" | .SH "UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT" |
.rs |
.rs |
.sp |
.sp |
In order process UTF-8 strings, you must build PCRE to include UTF-8 support in | From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
the code, and, in addition, you must call | supports UTF-16 by means of a separate 16-bit library. This can be built as |
| well as, or instead of, the 8-bit library. |
| . |
| . |
| .SH "UTF-8 SUPPORT" |
| .rs |
| .sp |
| In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF |
| support, and, in addition, you must call |
.\" HREF |
.\" HREF |
\fBpcre_compile()\fP |
\fBpcre_compile()\fP |
.\" |
.\" |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
(*UTF8). When either of these is the case, both the pattern and any subject |
(*UTF8). When either of these is the case, both the pattern and any subject |
strings that are matched against it are treated as UTF-8 strings instead of |
strings that are matched against it are treated as UTF-8 strings instead of |
strings of 1-byte characters. PCRE does not support any other formats (in | strings of 1-byte characters. |
particular, it does not support UTF-16). | . |
.P | . |
If you compile PCRE with UTF-8 support, but do not use it at run time, the | .SH "UTF-16 SUPPORT" |
| .rs |
| .sp |
| In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF |
| support, and, in addition, you must call |
| .\" HTML <a href="pcre_compile.html"> |
| .\" </a> |
| \fBpcre16_compile()\fP |
| .\" |
| with the PCRE_UTF16 option flag, or the pattern must start with the sequence |
| (*UTF16). When either of these is the case, both the pattern and any subject |
| strings that are matched against it are treated as UTF-16 strings instead of |
| strings of 16-bit characters. |
| . |
| . |
| .SH "UTF SUPPORT OVERHEAD" |
| .rs |
| .sp |
| If you compile PCRE with UTF support, but do not use it at run time, the |
library will be a bit bigger, but the additional run time overhead is limited |
library will be a bit bigger, but the additional run time overhead is limited |
to testing the PCRE_UTF8 flag occasionally, so should not be very big. | to testing the PCRE_UTF8/16 flag occasionally, so should not be very big. |
.P | . |
If PCRE is built with Unicode character property support (which implies UTF-8 | . |
support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. | .SH "UNICODE PROPERTY SUPPORT" |
| .rs |
| .sp |
| If PCRE is built with Unicode character property support (which implies UTF |
| support), the escape sequences \ep{..}, \eP{..}, and \eX can be used. |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
number, the Unicode script names such as Arabic or Han, and the derived |
number, the Unicode script names such as Arabic or Han, and the derived |
Line 38 compatibility with Perl 5.6. PCRE does not support thi
|
Line 68 compatibility with Perl 5.6. PCRE does not support thi
|
.SS "Validity of UTF-8 strings" |
.SS "Validity of UTF-8 strings" |
.rs |
.rs |
.sp |
.sp |
When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects | When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
are (by default) checked for validity on entry to the relevant functions. From | subjects are (by default) checked for validity on entry to the relevant |
release 7.3 of PCRE, the check is according the rules of RFC 3629, which are | functions. From release 7.3 of PCRE, the check is according the rules of RFC |
themselves derived from the Unicode specification. Earlier releases of PCRE | 3629, which are themselves derived from the Unicode specification. Earlier |
followed the rules of RFC 2279, which allows the full range of 31-bit values (0 | releases of PCRE followed the rules of RFC 2279, which allows the full range of |
to 0x7FFFFFFF). The current check allows only values in the range U+0 to | 31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the |
U+10FFFF, excluding U+D800 to U+DFFF. | range U+0 to U+10FFFF, excluding U+D800 to U+DFFF. |
.P |
.P |
The excluded code points are the "Low Surrogate Area" of Unicode, of which the | The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
Unicode Standard says this: "The Low Surrogate Area does not contain any | for use by UTF-16, where they are used in pairs to encode codepoints with |
character assignments, consequently no character code charts or namelists are | values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
provided for this area. Surrogates are reserved for use with UTF-16 and then | are available independently in the UTF-8 encoding. (In other words, the whole |
must be used in pairs." The code points that are encoded by UTF-16 pairs are | surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
available as independent code points in the UTF-8 encoding. (In other words, | |
the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up | |
UTF-8.) | |
.P |
.P |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
compile time, the only additional information is the offset to the first byte |
compile time, the only additional information is the offset to the first byte |
Line 85 situation, you will have to apply your own validity ch
|
Line 112 situation, you will have to apply your own validity ch
|
JIT optimization. |
JIT optimization. |
. |
. |
. |
. |
.SS "General comments about UTF-8 mode" | .\" HTML <a name="utf16strings"></a> |
| .SS "Validity of UTF-16 strings" |
.rs |
.rs |
.sp |
.sp |
1. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte | When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are |
UTF-8 character if the value is greater than 127. | passed as patterns and subjects are (by default) checked for validity on entry |
| to the relevant functions. Values other than those in the surrogate range |
| U+D800 to U+DFFF are independent code points. Values in the surrogate range |
| must be used in pairs in the correct manner. |
.P |
.P |
2. Octal numbers up to \e777 are recognized, and match two-byte UTF-8 | If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
characters for values greater than \e177. | compile time, the only additional information is the offset to the first data |
| unit of the failing character. The runtime functions \fBpcre16_exec()\fP and |
| \fBpcre16_dfa_exec()\fP also pass back this information, as well as a more |
| detailed reason code if the caller has provided memory in which to do this. |
.P |
.P |
3. Repeat quantifiers apply to complete UTF-8 characters, not to individual | In some situations, you may already know that your strings are valid, and |
bytes, for example: \ex{100}{3}. | therefore want to skip these checks in order to improve performance. If you set |
| the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
| the pattern or subject it is given (respectively) contains only valid UTF-16 |
| sequences. In this case, it does not diagnose an invalid UTF-16 string. |
| . |
| . |
| .SS "General comments about UTF modes" |
| .rs |
| .sp |
| 1. Codepoints less than 256 can be specified by either braced or unbraced |
| hexadecimal escape sequences (for example, \ex{b3} or \exb3). Larger values |
| have to use braced sequences. |
.P |
.P |
4. The dot metacharacter matches one UTF-8 character instead of a single byte. | 2. Octal numbers up to \e777 are recognized, and in UTF-8 mode, they match |
| two-byte characters for values greater than \e177. |
.P |
.P |
5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, | 3. Repeat quantifiers apply to complete UTF characters, not to individual |
but its use can lead to some strange effects because it breaks up multibyte | data units, for example: \ex{100}{3}. |
characters (see the description of \eC in the | .P |
| 4. The dot metacharacter matches one UTF character instead of a single data |
| unit. |
| .P |
| 5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, or |
| a single 16-bit data unit in UTF-16 mode, but its use can lead to some strange |
| effects because it breaks up multi-unit characters (see the description of \eC |
| in the |
.\" HREF |
.\" HREF |
\fBpcrepattern\fP |
\fBpcrepattern\fP |
.\" |
.\" |
documentation). The use of \eC is not supported in the alternative matching |
documentation). The use of \eC is not supported in the alternative matching |
function \fBpcre_dfa_exec()\fP, nor is it supported in UTF-8 mode by the JIT | function \fBpcre[16]_dfa_exec()\fP, nor is it supported in UTF mode by the JIT |
optimization of \fBpcre_exec()\fP. If JIT optimization is requested for a UTF-8 | optimization of \fBpcre[16]_exec()\fP. If JIT optimization is requested for a |
pattern that contains \eC, it will not succeed, and so the matching will be | UTF pattern that contains \eC, it will not succeed, and so the matching will |
carried out by the normal interpretive function. | be carried out by the normal interpretive function. |
.P |
.P |
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
test characters of any code value, but, by default, the characters that PCRE |
test characters of any code value, but, by default, the characters that PCRE |
recognizes as digits, spaces, or word characters remain the same set as before, | recognizes as digits, spaces, or word characters remain the same set as in |
all with values less than 256. This remains true even when PCRE is built to | non-UTF mode, all with values less than 256. This remains true even when PCRE |
include Unicode property support, because to do otherwise would slow down PCRE | is built to include Unicode property support, because to do otherwise would |
in many common cases. Note in particular that this applies to \eb and \eB, | slow down PCRE in many common cases. Note in particular that this applies to |
because they are defined in terms of \ew and \eW. If you really want to test | \eb and \eB, because they are defined in terms of \ew and \eW. If you really |
for a wider sense of, say, "digit", you can use explicit Unicode property tests | want to test for a wider sense of, say, "digit", you can use explicit Unicode |
such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, the way that | property tests such as \ep{Nd}. Alternatively, if you set the PCRE_UCP option, |
the character escapes work is changed so that Unicode properties are used to | the way that the character escapes work is changed so that Unicode properties |
determine which characters match. There are more details in the section on | are used to determine which characters match. There are more details in the |
| section on |
.\" HTML <a href="pcrepattern.html#genericchartypes"> |
.\" HTML <a href="pcrepattern.html#genericchartypes"> |
.\" </a> |
.\" </a> |
generic character types |
generic character types |
Line 163 Cambridge CB2 3QH, England.
|
Line 217 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 19 October 2011 | Last updated: 13 January 2012 |
Copyright (c) 1997-2011 University of Cambridge. | Copyright (c) 1997-2012 University of Cambridge. |
.fi |
.fi |