version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.3, 2012/10/09 09:19:17
|
Line 13 from the original man page. If there is any nonsense i
|
Line 13 from the original man page. If there is any nonsense i
|
man page, in case the conversion went wrong. |
man page, in case the conversion went wrong. |
<br> |
<br> |
<br><b> |
<br><b> |
UTF-8 AND UNICODE PROPERTY SUPPORT | UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT |
</b><br> |
</b><br> |
<P> |
<P> |
In order process UTF-8 strings, you must build PCRE to include UTF-8 support in | From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
the code, and, in addition, you must call | supports UTF-16 by means of a separate 16-bit library. This can be built as |
| well as, or instead of, the 8-bit library. |
| </P> |
| <br><b> |
| UTF-8 SUPPORT |
| </b><br> |
| <P> |
| In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF |
| support, and, in addition, you must call |
<a href="pcre_compile.html"><b>pcre_compile()</b></a> |
<a href="pcre_compile.html"><b>pcre_compile()</b></a> |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
(*UTF8). When either of these is the case, both the pattern and any subject |
(*UTF8). When either of these is the case, both the pattern and any subject |
strings that are matched against it are treated as UTF-8 strings instead of |
strings that are matched against it are treated as UTF-8 strings instead of |
strings of 1-byte characters. PCRE does not support any other formats (in | strings of 1-byte characters. |
particular, it does not support UTF-16). | |
</P> |
</P> |
|
<br><b> |
|
UTF-16 SUPPORT |
|
</b><br> |
<P> |
<P> |
If you compile PCRE with UTF-8 support, but do not use it at run time, the | In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF |
| support, and, in addition, you must call |
| <a href="pcre_compile.html"><b>pcre16_compile()</b></a> |
| with the PCRE_UTF16 option flag, or the pattern must start with the sequence |
| (*UTF16). When either of these is the case, both the pattern and any subject |
| strings that are matched against it are treated as UTF-16 strings instead of |
| strings of 16-bit characters. |
| </P> |
| <br><b> |
| UTF SUPPORT OVERHEAD |
| </b><br> |
| <P> |
| If you compile PCRE with UTF support, but do not use it at run time, the |
library will be a bit bigger, but the additional run time overhead is limited |
library will be a bit bigger, but the additional run time overhead is limited |
to testing the PCRE_UTF8 flag occasionally, so should not be very big. | to testing the PCRE_UTF8/16 flag occasionally, so should not be very big. |
</P> |
</P> |
|
<br><b> |
|
UNICODE PROPERTY SUPPORT |
|
</b><br> |
<P> |
<P> |
If PCRE is built with Unicode character property support (which implies UTF-8 | If PCRE is built with Unicode character property support (which implies UTF |
support), the escape sequences \p{..}, \P{..}, and \X are supported. | support), the escape sequences \p{..}, \P{..}, and \X can be used. |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
number, the Unicode script names such as Arabic or Han, and the derived |
number, the Unicode script names such as Arabic or Han, and the derived |
Line 47 compatibility with Perl 5.6. PCRE does not support thi
|
Line 72 compatibility with Perl 5.6. PCRE does not support thi
|
Validity of UTF-8 strings |
Validity of UTF-8 strings |
</b><br> |
</b><br> |
<P> |
<P> |
When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects | When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
are (by default) checked for validity on entry to the relevant functions. From | subjects are (by default) checked for validity on entry to the relevant |
release 7.3 of PCRE, the check is according the rules of RFC 3629, which are | functions. The entire string is checked before any other processing takes |
themselves derived from the Unicode specification. Earlier releases of PCRE | place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, |
followed the rules of RFC 2279, which allows the full range of 31-bit values (0 | which are themselves derived from the Unicode specification. Earlier releases |
to 0x7FFFFFFF). The current check allows only values in the range U+0 to | of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit |
U+10FFFF, excluding U+D800 to U+DFFF. | values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 |
| to U+10FFFF, excluding U+D800 to U+DFFF. |
</P> |
</P> |
<P> |
<P> |
The excluded code points are the "Low Surrogate Area" of Unicode, of which the | The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
Unicode Standard says this: "The Low Surrogate Area does not contain any | for use by UTF-16, where they are used in pairs to encode codepoints with |
character assignments, consequently no character code charts or namelists are | values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
provided for this area. Surrogates are reserved for use with UTF-16 and then | are available independently in the UTF-8 encoding. (In other words, the whole |
must be used in pairs." The code points that are encoded by UTF-16 pairs are | surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
available as independent code points in the UTF-8 encoding. (In other words, | |
the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up | |
UTF-8.) | |
</P> |
</P> |
<P> |
<P> |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
compile time, the only additional information is the offset to the first byte |
compile time, the only additional information is the offset to the first byte |
of the failing character. The runtime functions <b>pcre_exec()</b> and | of the failing character. The run-time functions <b>pcre_exec()</b> and |
<b>pcre_dfa_exec()</b> also pass back this information, as well as a more |
<b>pcre_dfa_exec()</b> also pass back this information, as well as a more |
detailed reason code if the caller has provided memory in which to do this. |
detailed reason code if the caller has provided memory in which to do this. |
</P> |
</P> |
<P> |
<P> |
In some situations, you may already know that your strings are valid, and |
In some situations, you may already know that your strings are valid, and |
therefore want to skip these checks in order to improve performance. If you set | therefore want to skip these checks in order to improve performance, for |
the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that | example in the case of a long subject string that is being scanned repeatedly |
the pattern or subject it is given (respectively) contains only valid UTF-8 | with different patterns. If you set the PCRE_NO_UTF8_CHECK flag at compile time |
codes. In this case, it does not diagnose an invalid UTF-8 string. | or at run time, PCRE assumes that the pattern or subject it is given |
| (respectively) contains only valid UTF-8 codes. In this case, it does not |
| diagnose an invalid UTF-8 string. |
</P> |
</P> |
<P> |
<P> |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
Line 97 encoded in a UTF-8-like manner as per the old RFC, you
|
Line 122 encoded in a UTF-8-like manner as per the old RFC, you
|
PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this |
PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this |
situation, you will have to apply your own validity check, and avoid the use of |
situation, you will have to apply your own validity check, and avoid the use of |
JIT optimization. |
JIT optimization. |
|
<a name="utf16strings"></a></P> |
|
<br><b> |
|
Validity of UTF-16 strings |
|
</b><br> |
|
<P> |
|
When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are |
|
passed as patterns and subjects are (by default) checked for validity on entry |
|
to the relevant functions. Values other than those in the surrogate range |
|
U+D800 to U+DFFF are independent code points. Values in the surrogate range |
|
must be used in pairs in the correct manner. |
</P> |
</P> |
|
<P> |
|
If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
|
compile time, the only additional information is the offset to the first data |
|
unit of the failing character. The run-time functions <b>pcre16_exec()</b> and |
|
<b>pcre16_dfa_exec()</b> also pass back this information, as well as a more |
|
detailed reason code if the caller has provided memory in which to do this. |
|
</P> |
|
<P> |
|
In some situations, you may already know that your strings are valid, and |
|
therefore want to skip these checks in order to improve performance. If you set |
|
the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
|
the pattern or subject it is given (respectively) contains only valid UTF-16 |
|
sequences. In this case, it does not diagnose an invalid UTF-16 string. |
|
</P> |
<br><b> |
<br><b> |
General comments about UTF-8 mode | General comments about UTF modes |
</b><br> |
</b><br> |
<P> |
<P> |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte | 1. Codepoints less than 256 can be specified by either braced or unbraced |
UTF-8 character if the value is greater than 127. | hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger values |
| have to use braced sequences. |
</P> |
</P> |
<P> |
<P> |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 | 2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they match |
characters for values greater than \177. | two-byte characters for values greater than \177. |
</P> |
</P> |
<P> |
<P> |
3. Repeat quantifiers apply to complete UTF-8 characters, not to individual | 3. Repeat quantifiers apply to complete UTF characters, not to individual |
bytes, for example: \x{100}{3}. | data units, for example: \x{100}{3}. |
</P> |
</P> |
<P> |
<P> |
4. The dot metacharacter matches one UTF-8 character instead of a single byte. | 4. The dot metacharacter matches one UTF character instead of a single data |
| unit. |
</P> |
</P> |
<P> |
<P> |
5. The escape sequence \C can be used to match a single byte in UTF-8 mode, | 5. The escape sequence \C can be used to match a single byte in UTF-8 mode, or |
but its use can lead to some strange effects because it breaks up multibyte | a single 16-bit data unit in UTF-16 mode, but its use can lead to some strange |
characters (see the description of \C in the | effects because it breaks up multi-unit characters (see the description of \C |
| in the |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
documentation). The use of \C is not supported in the alternative matching |
documentation). The use of \C is not supported in the alternative matching |
function <b>pcre_dfa_exec()</b>, nor is it supported in UTF-8 mode by the JIT | function <b>pcre[16]_dfa_exec()</b>, nor is it supported in UTF mode by the JIT |
optimization of <b>pcre_exec()</b>. If JIT optimization is requested for a UTF-8 | optimization of <b>pcre[16]_exec()</b>. If JIT optimization is requested for a |
pattern that contains \C, it will not succeed, and so the matching will be | UTF pattern that contains \C, it will not succeed, and so the matching will |
carried out by the normal interpretive function. | be carried out by the normal interpretive function. |
</P> |
</P> |
<P> |
<P> |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
test characters of any code value, but, by default, the characters that PCRE |
test characters of any code value, but, by default, the characters that PCRE |
recognizes as digits, spaces, or word characters remain the same set as before, | recognizes as digits, spaces, or word characters remain the same set as in |
all with values less than 256. This remains true even when PCRE is built to | non-UTF mode, all with values less than 256. This remains true even when PCRE |
include Unicode property support, because to do otherwise would slow down PCRE | is built to include Unicode property support, because to do otherwise would |
in many common cases. Note in particular that this applies to \b and \B, | slow down PCRE in many common cases. Note in particular that this applies to |
because they are defined in terms of \w and \W. If you really want to test | \b and \B, because they are defined in terms of \w and \W. If you really |
for a wider sense of, say, "digit", you can use explicit Unicode property tests | want to test for a wider sense of, say, "digit", you can use explicit Unicode |
such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that | property tests such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, |
the character escapes work is changed so that Unicode properties are used to | the way that the character escapes work is changed so that Unicode properties |
determine which characters match. There are more details in the section on | are used to determine which characters match. There are more details in the |
| section on |
<a href="pcrepattern.html#genericchartypes">generic character types</a> |
<a href="pcrepattern.html#genericchartypes">generic character types</a> |
in the |
in the |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
Line 149 documentation.
|
Line 202 documentation.
|
low-valued characters, unless the PCRE_UCP option is set. |
low-valued characters, unless the PCRE_UCP option is set. |
</P> |
</P> |
<P> |
<P> |
8. However, the horizontal and vertical whitespace matching escapes (\h, \H, | 8. However, the horizontal and vertical white space matching escapes (\h, \H, |
\v, and \V) do match all the appropriate Unicode characters, whether or not |
\v, and \V) do match all the appropriate Unicode characters, whether or not |
PCRE_UCP is set. |
PCRE_UCP is set. |
</P> |
</P> |
Line 178 Cambridge CB2 3QH, England.
|
Line 231 Cambridge CB2 3QH, England.
|
REVISION |
REVISION |
</b><br> |
</b><br> |
<P> |
<P> |
Last updated: 19 October 2011 | Last updated: 14 April 2012 |
<br> |
<br> |
Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2012 University of Cambridge. |
<br> |
<br> |
<p> |
<p> |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |