version 1.1.1.2, 2012/02/21 23:50:25
|
version 1.1.1.3, 2012/10/09 09:19:17
|
Line 1
|
Line 1
|
.TH PCREUNICODE 3 | .TH PCREUNICODE 3 "14 April 2012" "PCRE 8.30" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT" |
.SH "UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT" |
Line 70 compatibility with Perl 5.6. PCRE does not support thi
|
Line 70 compatibility with Perl 5.6. PCRE does not support thi
|
.sp |
.sp |
When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
When you set the PCRE_UTF8 flag, the byte strings passed as patterns and |
subjects are (by default) checked for validity on entry to the relevant |
subjects are (by default) checked for validity on entry to the relevant |
functions. From release 7.3 of PCRE, the check is according the rules of RFC | functions. The entire string is checked before any other processing takes |
3629, which are themselves derived from the Unicode specification. Earlier | place. From release 7.3 of PCRE, the check is according the rules of RFC 3629, |
releases of PCRE followed the rules of RFC 2279, which allows the full range of | which are themselves derived from the Unicode specification. Earlier releases |
31-bit values (0 to 0x7FFFFFFF). The current check allows only values in the | of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit |
range U+0 to U+10FFFF, excluding U+D800 to U+DFFF. | values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 |
| to U+10FFFF, excluding U+D800 to U+DFFF. |
.P |
.P |
The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
The excluded code points are the "Surrogate Area" of Unicode. They are reserved |
for use by UTF-16, where they are used in pairs to encode codepoints with |
for use by UTF-16, where they are used in pairs to encode codepoints with |
Line 84 surrogate thing is a fudge for UTF-16 which unfortunat
|
Line 85 surrogate thing is a fudge for UTF-16 which unfortunat
|
.P |
.P |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
If an invalid UTF-8 string is passed to PCRE, an error return is given. At |
compile time, the only additional information is the offset to the first byte |
compile time, the only additional information is the offset to the first byte |
of the failing character. The runtime functions \fBpcre_exec()\fP and | of the failing character. The run-time functions \fBpcre_exec()\fP and |
\fBpcre_dfa_exec()\fP also pass back this information, as well as a more |
\fBpcre_dfa_exec()\fP also pass back this information, as well as a more |
detailed reason code if the caller has provided memory in which to do this. |
detailed reason code if the caller has provided memory in which to do this. |
.P |
.P |
In some situations, you may already know that your strings are valid, and |
In some situations, you may already know that your strings are valid, and |
therefore want to skip these checks in order to improve performance. If you set | therefore want to skip these checks in order to improve performance, for |
the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE assumes that | example in the case of a long subject string that is being scanned repeatedly |
the pattern or subject it is given (respectively) contains only valid UTF-8 | with different patterns. If you set the PCRE_NO_UTF8_CHECK flag at compile time |
codes. In this case, it does not diagnose an invalid UTF-8 string. | or at run time, PCRE assumes that the pattern or subject it is given |
| (respectively) contains only valid UTF-8 codes. In this case, it does not |
| diagnose an invalid UTF-8 string. |
.P |
.P |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
happens depends on why the string is invalid. If the string conforms to the |
happens depends on why the string is invalid. If the string conforms to the |
Line 124 must be used in pairs in the correct manner.
|
Line 127 must be used in pairs in the correct manner.
|
.P |
.P |
If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
If an invalid UTF-16 string is passed to PCRE, an error return is given. At |
compile time, the only additional information is the offset to the first data |
compile time, the only additional information is the offset to the first data |
unit of the failing character. The runtime functions \fBpcre16_exec()\fP and | unit of the failing character. The run-time functions \fBpcre16_exec()\fP and |
\fBpcre16_dfa_exec()\fP also pass back this information, as well as a more |
\fBpcre16_dfa_exec()\fP also pass back this information, as well as a more |
detailed reason code if the caller has provided memory in which to do this. |
detailed reason code if the caller has provided memory in which to do this. |
.P |
.P |
Line 189 documentation.
|
Line 192 documentation.
|
7. Similarly, characters that match the POSIX named character classes are all |
7. Similarly, characters that match the POSIX named character classes are all |
low-valued characters, unless the PCRE_UCP option is set. |
low-valued characters, unless the PCRE_UCP option is set. |
.P |
.P |
8. However, the horizontal and vertical whitespace matching escapes (\eh, \eH, | 8. However, the horizontal and vertical white space matching escapes (\eh, \eH, |
\ev, and \eV) do match all the appropriate Unicode characters, whether or not |
\ev, and \eV) do match all the appropriate Unicode characters, whether or not |
PCRE_UCP is set. |
PCRE_UCP is set. |
.P |
.P |
Line 217 Cambridge CB2 3QH, England.
|
Line 220 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 13 January 2012 | Last updated: 14 April 2012 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
.fi |
.fi |