version 1.1.1.2, 2012/02/21 23:50:25
|
version 1.1.1.3, 2012/10/09 09:19:18
|
Line 227 backslash. All other characters (in particular, those
|
Line 227 backslash. All other characters (in particular, those
|
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
</P> |
</P> |
<P> |
<P> |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class) and characters between a # outside |
a character class and the next newline are ignored. An escaping backslash can |
a character class and the next newline are ignored. An escaping backslash can |
be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. |
</P> |
</P> |
<P> |
<P> |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
Line 264 one of the following escape sequences than the binary
|
Line 264 one of the following escape sequences than the binary
|
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
\cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
\e escape (hex 1B) |
\e escape (hex 1B) |
\f formfeed (hex 0C) | \f form feed (hex 0C) |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
\t tab (hex 09) |
\t tab (hex 09) |
Line 307 as just described only when it is followed by two hexa
|
Line 307 as just described only when it is followed by two hexa
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
code points greater than 256 is provided by \u, which must be followed by |
code points greater than 256 is provided by \u, which must be followed by |
four hexadecimal digits; otherwise it matches a literal "u" character. |
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \u in JavaScript mode are constrained in the same |
|
was as those specified by \x in non-JavaScript mode. |
</P> |
</P> |
<P> |
<P> |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
Line 406 Another use of backslash is for specifying generic cha
|
Line 408 Another use of backslash is for specifying generic cha
|
<pre> |
<pre> |
\d any decimal digit |
\d any decimal digit |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
\h any horizontal whitespace character | \h any horizontal white space character |
\H any character that is not a horizontal whitespace character | \H any character that is not a horizontal white space character |
\s any whitespace character | \s any white space character |
\S any character that is not a whitespace character | \S any character that is not a white space character |
\v any vertical whitespace character | \v any vertical white space character |
\V any character that is not a vertical whitespace character | \V any character that is not a vertical white space character |
\w any "word" character |
\w any "word" character |
\W any "non-word" character |
\W any "non-word" character |
</pre> |
</pre> |
Line 497 The vertical space characters are:
|
Line 499 The vertical space characters are:
|
<pre> |
<pre> |
U+000A Linefeed |
U+000A Linefeed |
U+000B Vertical tab |
U+000B Vertical tab |
U+000C Formfeed | U+000C Form feed |
U+000D Carriage return |
U+000D Carriage return |
U+0085 Next line |
U+0085 Next line |
U+2028 Line separator |
U+2028 Line separator |
Line 520 This is an example of an "atomic group", details of wh
|
Line 522 This is an example of an "atomic group", details of wh
|
<a href="#atomicgroup">below.</a> |
<a href="#atomicgroup">below.</a> |
This particular group matches either the two-character sequence CR followed by |
This particular group matches either the two-character sequence CR followed by |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next |
line, U+0085). The two-character sequence is treated as a single unit that |
line, U+0085). The two-character sequence is treated as a single unit that |
cannot be split. |
cannot be split. |
</P> |
</P> |
Line 596 Armenian,
|
Line 598 Armenian,
|
Avestan, |
Avestan, |
Balinese, |
Balinese, |
Bamum, |
Bamum, |
|
Batak, |
Bengali, |
Bengali, |
Bopomofo, |
Bopomofo, |
|
Brahmi, |
Braille, |
Braille, |
Buginese, |
Buginese, |
Buhid, |
Buhid, |
Canadian_Aboriginal, |
Canadian_Aboriginal, |
Carian, |
Carian, |
|
Chakma, |
Cham, |
Cham, |
Cherokee, |
Cherokee, |
Common, |
Common, |
Line 645 Lisu,
|
Line 650 Lisu,
|
Lycian, |
Lycian, |
Lydian, |
Lydian, |
Malayalam, |
Malayalam, |
|
Mandaic, |
Meetei_Mayek, |
Meetei_Mayek, |
|
Meroitic_Cursive, |
|
Meroitic_Hieroglyphs, |
|
Miao, |
Mongolian, |
Mongolian, |
Myanmar, |
Myanmar, |
New_Tai_Lue, |
New_Tai_Lue, |
Line 664 Rejang,
|
Line 673 Rejang,
|
Runic, |
Runic, |
Samaritan, |
Samaritan, |
Saurashtra, |
Saurashtra, |
|
Sharada, |
Shavian, |
Shavian, |
Sinhala, |
Sinhala, |
|
Sora_Sompeng, |
Sundanese, |
Sundanese, |
Syloti_Nagri, |
Syloti_Nagri, |
Syriac, |
Syriac, |
Line 674 Tagbanwa,
|
Line 685 Tagbanwa,
|
Tai_Le, |
Tai_Le, |
Tai_Tham, |
Tai_Tham, |
Tai_Viet, |
Tai_Viet, |
|
Takri, |
Tamil, |
Tamil, |
Telugu, |
Telugu, |
Thaana, |
Thaana, |
Line 812 PCRE_UCP is set. They are:
|
Line 824 PCRE_UCP is set. They are:
|
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
</pre> |
</pre> |
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
Line 1008 used. Because \C breaks up characters into individual
|
Line 1020 used. Because \C breaks up characters into individual
|
unit with \C in a UTF mode means that the rest of the string may start with a |
unit with \C in a UTF mode means that the rest of the string may start with a |
malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
start of processing unless the PCRE_NO_UTF8_CHECK option is used). | start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option |
| is used). |
</P> |
</P> |
<P> |
<P> |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
Line 1818 Because there may be many capturing parentheses in a p
|
Line 1831 Because there may be many capturing parentheses in a p
|
following a backslash are taken as part of a potential back reference number. |
following a backslash are taken as part of a potential back reference number. |
If the pattern continues with a digit character, some delimiter must be used to |
If the pattern continues with a digit character, some delimiter must be used to |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
whitespace. Otherwise, the \g{ syntax or an empty comment (see | white space. Otherwise, the \g{ syntax or an empty comment (see |
<a href="#comments">"Comments"</a> |
<a href="#comments">"Comments"</a> |
below) can be used. |
below) can be used. |
</P> |
</P> |
Line 2160 point in the pattern; the idea of DEFINE is that it ca
|
Line 2173 point in the pattern; the idea of DEFINE is that it ca
|
subroutines that can be referenced from elsewhere. (The use of |
subroutines that can be referenced from elsewhere. (The use of |
<a href="#subpatternsassubroutines">subroutines</a> |
<a href="#subpatternsassubroutines">subroutines</a> |
is described below.) For example, a pattern to match an IPv4 address such as |
is described below.) For example, a pattern to match an IPv4 address such as |
"192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line |
breaks): |
breaks): |
<pre> |
<pre> |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
Line 2554 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
Line 2567 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
a successful positive assertion <i>is</i> passed back when a match succeeds |
a successful positive assertion <i>is</i> passed back when a match succeeds |
(compare capturing parentheses in assertions). Note that such subpatterns are |
(compare capturing parentheses in assertions). Note that such subpatterns are |
processed as anchored at the point where they are tested. Note also that Perl's |
processed as anchored at the point where they are tested. Note also that Perl's |
treatment of subroutines is different in some cases. | treatment of subroutines and assertions is different in some cases. |
</P> |
</P> |
<P> |
<P> |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
parenthesis followed by an asterisk. They are generally of the form |
parenthesis followed by an asterisk. They are generally of the form |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
depending on whether or not an argument is present. A name is any sequence of |
depending on whether or not an argument is present. A name is any sequence of |
characters that does not include a closing parenthesis. If the name is empty, | characters that does not include a closing parenthesis. The maximum length of |
that is, if the closing parenthesis immediately follows the colon, the effect | name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name |
is as if the colon were not there. Any number of these verbs may occur in a | is empty, that is, if the closing parenthesis immediately follows the colon, |
pattern. | the effect is as if the colon were not there. Any number of these verbs may |
</P> | occur in a pattern. |
| <a name="nooptimize"></a></P> |
| <br><b> |
| Optimizations that affect backtracking verbs |
| </b><br> |
<P> |
<P> |
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
Line 2574 present. When one of these optimizations suppresses th
|
Line 2591 present. When one of these optimizations suppresses th
|
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the |
| section entitled |
| <a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a> |
| in the |
| <a href="pcreapi.html"><b>pcreapi</b></a> |
| documentation. |
</P> |
</P> |
<P> |
<P> |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Line 2662 After a partial match or a failed match, the name of t
|
Line 2684 After a partial match or a failed match, the name of t
|
No match, mark = B |
No match, mark = B |
</pre> |
</pre> |
Note that in this unanchored example the mark is retained from the match |
Note that in this unanchored example the mark is retained from the match |
attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match |
"P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the |
nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. |
</P> |
</P> |
|
<P> |
|
If you are interested in (*MARK) values after failed matches, you should |
|
probably set the PCRE_NO_START_OPTIMIZE option |
|
<a href="#nooptimize">(see above)</a> |
|
to ensure that the match is always attempted. |
|
</P> |
<br><b> |
<br><b> |
Verbs that act after backtracking |
Verbs that act after backtracking |
</b><br> |
</b><br> |
Line 2843 Cambridge CB2 3QH, England.
|
Line 2871 Cambridge CB2 3QH, England.
|
</P> |
</P> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<P> |
<P> |
Last updated: 09 January 2012 | Last updated: 17 June 2012 |
<br> |
<br> |
Copyright © 1997-2012 University of Cambridge. |
Copyright © 1997-2012 University of Cambridge. |
<br> |
<br> |