|
version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.3, 2012/10/09 09:19:17
|
|
Line 13 from the original man page. If there is any nonsense i
|
Line 13 from the original man page. If there is any nonsense i
|
| man page, in case the conversion went wrong. |
man page, in case the conversion went wrong. |
| <br> |
<br> |
| <br><b> |
<br><b> |
| UTF-8 AND UNICODE PROPERTY SUPPORT | UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT |
| </b><br> |
</b><br> |
| <P> |
<P> |
| In order process UTF-8 strings, you must build PCRE to include UTF-8 support in | From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
| the code, and, in addition, you must call | supports UTF-16 by means of a separate 16-bit library. This can be built as |
| | well as, or instead of, the 8-bit library. |
| | </P> |
| | <br><b> |
| | UTF-8 SUPPORT |
| | </b><br> |
| | <P> |
| | In order process UTF-8 strings, you must build PCRE's 8-bit library with UTF |
| | support, and, in addition, you must call |
| <a href="pcre_compile.html"><b>pcre_compile()</b></a> |
<a href="pcre_compile.html"><b>pcre_compile()</b></a> |
| with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
| (*UTF8). When either of these is the case, both the pattern and any subject |
(*UTF8). When either of these is the case, both the pattern and any subject |
| strings that are matched against it are treated as UTF-8 strings instead of |
strings that are matched against it are treated as UTF-8 strings instead of |
| strings of 1-byte characters. PCRE does not support any other formats (in | strings of 1-byte characters. |
| particular, it does not support UTF-16). | |
| </P> |
</P> |
| |
<br><b> |
| |
UTF-16 SUPPORT |
| |
</b><br> |
| <P> |
<P> |
| If you compile PCRE with UTF-8 support, but do not use it at run time, the | In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF |
| | support, and, in addition, you must call |
| | <a href="pcre_compile.html"><b>pcre16_compile()</b></a> |
| | with the PCRE_UTF16 option flag, or the pattern must start with the sequence |
| | (*UTF16). When either of these is the case, both the pattern and any subject |
| | strings that are matched against it are treated as UTF-16 strings instead of |
| | strings of 16-bit characters. |
| | </P> |
| | <br><b> |
| | UTF SUPPORT OVERHEAD |
| | </b><br> |
| | <P> |
| | If you compile PCRE with UTF support, but do not use it at run time, the |
| library will be a bit bigger, but the additional run time overhead is limited |
library will be a bit bigger, but the additional run time overhead is limited |
| to testing the PCRE_UTF8 flag occasionally, so should not be very big. | to testing the PCRE_UTF8/16 flag occasionally, so should not be very big. |
| </P> |
</P> |
| |
<br><b> |
| |
UNICODE PROPERTY SUPPORT |
| |
</b><br> |
| <P> |
<P> |
| If PCRE is built with Unicode character property support (which implies UTF-8 | If PCRE is built with Unicode character property support (which implies UTF |
| support), the escape sequences \p{..}, \P{..}, and \X are supported. | support), the escape sequences \p{..}, \P{..}, and \X can be used. |
| The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
| category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
| number, the Unicode script names such as Arabic or Han, and the derived |
number, the Unicode script names such as Arabic or Han, and the derived |
|
Line 47 compatibility with Perl 5.6. PCRE does not support thi
|
Line 72 compatibility with Perl 5.6. PCRE does not support thi
|
| Validity of UTF-8 strings |
Validity of UTF-8 strings |
| </b><br> |
</b><br> |
| <P> |
<P> |
| When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects | When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
| are (by default) checked for validity on entry to the relevant functions. From | subjects are (by default) checked for validity on entry to the relevant |
| release 7.3 of PCRE, the check is according the rules of RFC 3629, which are | functions. The entire string is checked before any other processing takes |
| themselves derived from the Unicode specification. Earlier releases of PCRE | place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, |
| followed the rules of RFC 2279, which allows the full range of 31-bit values (0 | which are themselves derived from the Unicode specification. Earlier releases |
| to 0x7FFFFFFF). The current check allows only values in the range U+0 to | of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit |
| U+10FFFF, excluding U+D800 to U+DFFF. | values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 |
| | to U+10FFFF, excluding U+D800 to U+DFFF. |
| </P> |
</P> |
| <P> |
<P> |
| The excluded code points are the "Low Surrogate Area" of Unicode, of which the | The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
| Unicode Standard says this: "The Low Surrogate Area does not contain any | for use by UTF-16, where they are used in pairs to encode codepoints with |
| character assignments, consequently no character code charts or namelists are | values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
| provided for this area. Surrogates are reserved for use with UTF-16 and then | are available independently in the UTF-8 encoding. (In other words, the whole |
| must be used in pairs." The code points that are encoded by UTF-16 pairs are | surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
| available as independent code points in the UTF-8 encoding. (In other words, | |
| the whole surrogate thing is a fudge for UTF-16 which unfortunately messes up | |
| UTF-8.) | |
| </P> |
</P> |
| <P> |
<P> |
| If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
| compile time, the only additional information is the offset to the first byte |
compile time, the only additional information is the offset to the first byte |
| of the failing character. The runtime functions <b>pcre_exec()</b> and | of the failing character. The run-time functions <b>pcre_exec()</b> and |
| <b>pcre_dfa_exec()</b> also pass back this information, as well as a more |
<b>pcre_dfa_exec()</b> also pass back this information, as well as a more |
| detailed reason code if the caller has provided memory in which to do this. |
detailed reason code if the caller has provided memory in which to do this. |
| </P> |
</P> |
| <P> |
<P> |
| In some situations, you may already know that your strings are valid, and |
In some situations, you may already know that your strings are valid, and |
| therefore want to skip these checks in order to improve performance. If you set | therefore want to skip these checks in order to improve performance, for |
| the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that | example in the case of a long subject string that is being scanned repeatedly |
| the pattern or subject it is given (respectively) contains only valid UTF-8 | with different patterns. If you set the PCRE_NO_UTF8_CHECK flag at compile time |
| codes. In this case, it does not diagnose an invalid UTF-8 string. | or at run time, PCRE assumes that the pattern or subject it is given |
| | (respectively) contains only valid UTF-8 codes. In this case, it does not |
| | diagnose an invalid UTF-8 string. |
| </P> |
</P> |
| <P> |
<P> |
| If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
|
Line 97 encoded in a UTF-8-like manner as per the old RFC, you
|
Line 122 encoded in a UTF-8-like manner as per the old RFC, you
|
| PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this |
PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this |
| situation, you will have to apply your own validity check, and avoid the use of |
situation, you will have to apply your own validity check, and avoid the use of |
| JIT optimization. |
JIT optimization. |
| |
<a name="utf16strings"></a></P> |
| |
<br><b> |
| |
Validity of UTF-16 strings |
| |
</b><br> |
| |
<P> |
| |
When you set the PCRE_UTF16 flag, the strings of 16-bit data units that are |
| |
passed as patterns and subjects are (by default) checked for validity on entry |
| |
to the relevant functions. Values other than those in the surrogate range |
| |
U+D800 to U+DFFF are independent code points. Values in the surrogate range |
| |
must be used in pairs in the correct manner. |
| </P> |
</P> |
| |
<P> |
| |
If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
| |
compile time, the only additional information is the offset to the first data |
| |
unit of the failing character. The run-time functions <b>pcre16_exec()</b> and |
| |
<b>pcre16_dfa_exec()</b> also pass back this information, as well as a more |
| |
detailed reason code if the caller has provided memory in which to do this. |
| |
</P> |
| |
<P> |
| |
In some situations, you may already know that your strings are valid, and |
| |
therefore want to skip these checks in order to improve performance. If you set |
| |
the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
| |
the pattern or subject it is given (respectively) contains only valid UTF-16 |
| |
sequences. In this case, it does not diagnose an invalid UTF-16 string. |
| |
</P> |
| <br><b> |
<br><b> |
| General comments about UTF-8 mode | General comments about UTF modes |
| </b><br> |
</b><br> |
| <P> |
<P> |
| 1. An unbraced hexadecimal escape sequence (such as \xb3) matches a two-byte | 1. Codepoints less than 256 can be specified by either braced or unbraced |
| UTF-8 character if the value is greater than 127. | hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger values |
| | have to use braced sequences. |
| </P> |
</P> |
| <P> |
<P> |
| 2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 | 2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they match |
| characters for values greater than \177. | two-byte characters for values greater than \177. |
| </P> |
</P> |
| <P> |
<P> |
| 3. Repeat quantifiers apply to complete UTF-8 characters, not to individual | 3. Repeat quantifiers apply to complete UTF characters, not to individual |
| bytes, for example: \x{100}{3}. | data units, for example: \x{100}{3}. |
| </P> |
</P> |
| <P> |
<P> |
| 4. The dot metacharacter matches one UTF-8 character instead of a single byte. | 4. The dot metacharacter matches one UTF character instead of a single data |
| | unit. |
| </P> |
</P> |
| <P> |
<P> |
| 5. The escape sequence \C can be used to match a single byte in UTF-8 mode, | 5. The escape sequence \C can be used to match a single byte in UTF-8 mode, or |
| but its use can lead to some strange effects because it breaks up multibyte | a single 16-bit data unit in UTF-16 mode, but its use can lead to some strange |
| characters (see the description of \C in the | effects because it breaks up multi-unit characters (see the description of \C |
| | in the |
| <a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| documentation). The use of \C is not supported in the alternative matching |
documentation). The use of \C is not supported in the alternative matching |
| function <b>pcre_dfa_exec()</b>, nor is it supported in UTF-8 mode by the JIT | function <b>pcre[16]_dfa_exec()</b>, nor is it supported in UTF mode by the JIT |
| optimization of <b>pcre_exec()</b>. If JIT optimization is requested for a UTF-8 | optimization of <b>pcre[16]_exec()</b>. If JIT optimization is requested for a |
| pattern that contains \C, it will not succeed, and so the matching will be | UTF pattern that contains \C, it will not succeed, and so the matching will |
| carried out by the normal interpretive function. | be carried out by the normal interpretive function. |
| </P> |
</P> |
| <P> |
<P> |
| 6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
| test characters of any code value, but, by default, the characters that PCRE |
test characters of any code value, but, by default, the characters that PCRE |
| recognizes as digits, spaces, or word characters remain the same set as before, | recognizes as digits, spaces, or word characters remain the same set as in |
| all with values less than 256. This remains true even when PCRE is built to | non-UTF mode, all with values less than 256. This remains true even when PCRE |
| include Unicode property support, because to do otherwise would slow down PCRE | is built to include Unicode property support, because to do otherwise would |
| in many common cases. Note in particular that this applies to \b and \B, | slow down PCRE in many common cases. Note in particular that this applies to |
| because they are defined in terms of \w and \W. If you really want to test | \b and \B, because they are defined in terms of \w and \W. If you really |
| for a wider sense of, say, "digit", you can use explicit Unicode property tests | want to test for a wider sense of, say, "digit", you can use explicit Unicode |
| such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that | property tests such as \p{Nd}. Alternatively, if you set the PCRE_UCP option, |
| the character escapes work is changed so that Unicode properties are used to | the way that the character escapes work is changed so that Unicode properties |
| determine which characters match. There are more details in the section on | are used to determine which characters match. There are more details in the |
| | section on |
| <a href="pcrepattern.html#genericchartypes">generic character types</a> |
<a href="pcrepattern.html#genericchartypes">generic character types</a> |
| in the |
in the |
| <a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
|
Line 149 documentation.
|
Line 202 documentation.
|
| low-valued characters, unless the PCRE_UCP option is set. |
low-valued characters, unless the PCRE_UCP option is set. |
| </P> |
</P> |
| <P> |
<P> |
| 8. However, the horizontal and vertical whitespace matching escapes (\h, \H, | 8. However, the horizontal and vertical white space matching escapes (\h, \H, |
| \v, and \V) do match all the appropriate Unicode characters, whether or not |
\v, and \V) do match all the appropriate Unicode characters, whether or not |
| PCRE_UCP is set. |
PCRE_UCP is set. |
| </P> |
</P> |
|
Line 178 Cambridge CB2 3QH, England.
|
Line 231 Cambridge CB2 3QH, England.
|
| REVISION |
REVISION |
| </b><br> |
</b><br> |
| <P> |
<P> |
| Last updated: 19 October 2011 | Last updated: 14 April 2012 |
| <br> |
<br> |
| Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2012 University of Cambridge. |
| <br> |
<br> |
| <p> |
<p> |
| Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |