version 1.1.1.2, 2012/02/21 23:50:25
|
version 1.1.1.3, 2012/10/09 09:19:17
|
Line 1
|
Line 1
|
.TH PCREPATTERN 3 | .TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
Line 198 In a UTF mode, only ASCII numbers and letters have any
|
Line 198 In a UTF mode, only ASCII numbers and letters have any
|
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
.P |
.P |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class) and characters between a # outside |
a character class and the next newline are ignored. An escaping backslash can |
a character class and the next newline are ignored. An escaping backslash can |
be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. |
.P |
.P |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
can do so by putting them between \eQ and \eE. This is different from Perl in |
can do so by putting them between \eQ and \eE. This is different from Perl in |
Line 237 one of the following escape sequences than the binary
|
Line 237 one of the following escape sequences than the binary
|
\ea alarm, that is, the BEL character (hex 07) |
\ea alarm, that is, the BEL character (hex 07) |
\ecx "control-x", where x is any ASCII character |
\ecx "control-x", where x is any ASCII character |
\ee escape (hex 1B) |
\ee escape (hex 1B) |
\ef formfeed (hex 0C) | \ef form feed (hex 0C) |
\en linefeed (hex 0A) |
\en linefeed (hex 0A) |
\er carriage return (hex 0D) |
\er carriage return (hex 0D) |
\et tab (hex 09) |
\et tab (hex 09) |
Line 277 as just described only when it is followed by two hexa
|
Line 277 as just described only when it is followed by two hexa
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
code points greater than 256 is provided by \eu, which must be followed by |
code points greater than 256 is provided by \eu, which must be followed by |
four hexadecimal digits; otherwise it matches a literal "u" character. |
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \eu in JavaScript mode are constrained in the same |
|
was as those specified by \ex in non-JavaScript mode. |
.P |
.P |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
Line 399 Another use of backslash is for specifying generic cha
|
Line 401 Another use of backslash is for specifying generic cha
|
.sp |
.sp |
\ed any decimal digit |
\ed any decimal digit |
\eD any character that is not a decimal digit |
\eD any character that is not a decimal digit |
\eh any horizontal whitespace character | \eh any horizontal white space character |
\eH any character that is not a horizontal whitespace character | \eH any character that is not a horizontal white space character |
\es any whitespace character | \es any white space character |
\eS any character that is not a whitespace character | \eS any character that is not a white space character |
\ev any vertical whitespace character | \ev any vertical white space character |
\eV any character that is not a vertical whitespace character | \eV any character that is not a vertical white space character |
\ew any "word" character |
\ew any "word" character |
\eW any "non-word" character |
\eW any "non-word" character |
.sp |
.sp |
Line 493 The vertical space characters are:
|
Line 495 The vertical space characters are:
|
.sp |
.sp |
U+000A Linefeed |
U+000A Linefeed |
U+000B Vertical tab |
U+000B Vertical tab |
U+000C Formfeed | U+000C Form feed |
U+000D Carriage return |
U+000D Carriage return |
U+0085 Next line |
U+0085 Next line |
U+2028 Line separator |
U+2028 Line separator |
Line 520 below.
|
Line 522 below.
|
.\" |
.\" |
This particular group matches either the two-character sequence CR followed by |
This particular group matches either the two-character sequence CR followed by |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next |
line, U+0085). The two-character sequence is treated as a single unit that |
line, U+0085). The two-character sequence is treated as a single unit that |
cannot be split. |
cannot be split. |
.P |
.P |
Line 596 Armenian,
|
Line 598 Armenian,
|
Avestan, |
Avestan, |
Balinese, |
Balinese, |
Bamum, |
Bamum, |
|
Batak, |
Bengali, |
Bengali, |
Bopomofo, |
Bopomofo, |
|
Brahmi, |
Braille, |
Braille, |
Buginese, |
Buginese, |
Buhid, |
Buhid, |
Canadian_Aboriginal, |
Canadian_Aboriginal, |
Carian, |
Carian, |
|
Chakma, |
Cham, |
Cham, |
Cherokee, |
Cherokee, |
Common, |
Common, |
Line 645 Lisu,
|
Line 650 Lisu,
|
Lycian, |
Lycian, |
Lydian, |
Lydian, |
Malayalam, |
Malayalam, |
|
Mandaic, |
Meetei_Mayek, |
Meetei_Mayek, |
|
Meroitic_Cursive, |
|
Meroitic_Hieroglyphs, |
|
Miao, |
Mongolian, |
Mongolian, |
Myanmar, |
Myanmar, |
New_Tai_Lue, |
New_Tai_Lue, |
Line 664 Rejang,
|
Line 673 Rejang,
|
Runic, |
Runic, |
Samaritan, |
Samaritan, |
Saurashtra, |
Saurashtra, |
|
Sharada, |
Shavian, |
Shavian, |
Sinhala, |
Sinhala, |
|
Sora_Sompeng, |
Sundanese, |
Sundanese, |
Syloti_Nagri, |
Syloti_Nagri, |
Syriac, |
Syriac, |
Line 674 Tagbanwa,
|
Line 685 Tagbanwa,
|
Tai_Le, |
Tai_Le, |
Tai_Tham, |
Tai_Tham, |
Tai_Viet, |
Tai_Viet, |
|
Takri, |
Tamil, |
Tamil, |
Telugu, |
Telugu, |
Thaana, |
Thaana, |
Line 809 PCRE_UCP is set. They are:
|
Line 821 PCRE_UCP is set. They are:
|
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
.sp |
.sp |
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
Line 1010 used. Because \eC breaks up characters into individual
|
Line 1022 used. Because \eC breaks up characters into individual
|
unit with \eC in a UTF mode means that the rest of the string may start with a |
unit with \eC in a UTF mode means that the rest of the string may start with a |
malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
start of processing unless the PCRE_NO_UTF8_CHECK option is used). | start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option |
| is used). |
.P |
.P |
PCRE does not allow \eC to appear in lookbehind assertions |
PCRE does not allow \eC to appear in lookbehind assertions |
.\" HTML <a href="#lookbehind"> |
.\" HTML <a href="#lookbehind"> |
Line 1832 Because there may be many capturing parentheses in a p
|
Line 1845 Because there may be many capturing parentheses in a p
|
following a backslash are taken as part of a potential back reference number. |
following a backslash are taken as part of a potential back reference number. |
If the pattern continues with a digit character, some delimiter must be used to |
If the pattern continues with a digit character, some delimiter must be used to |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
whitespace. Otherwise, the \eg{ syntax or an empty comment (see | white space. Otherwise, the \eg{ syntax or an empty comment (see |
.\" HTML <a href="#comments"> |
.\" HTML <a href="#comments"> |
.\" </a> |
.\" </a> |
"Comments" |
"Comments" |
Line 2189 subroutines that can be referenced from elsewhere. (Th
|
Line 2202 subroutines that can be referenced from elsewhere. (Th
|
subroutines |
subroutines |
.\" |
.\" |
is described below.) For example, a pattern to match an IPv4 address such as |
is described below.) For example, a pattern to match an IPv4 address such as |
"192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line |
breaks): |
breaks): |
.sp |
.sp |
(?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) |
(?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) |
Line 2588 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
Line 2601 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
a successful positive assertion \fIis\fP passed back when a match succeeds |
a successful positive assertion \fIis\fP passed back when a match succeeds |
(compare capturing parentheses in assertions). Note that such subpatterns are |
(compare capturing parentheses in assertions). Note that such subpatterns are |
processed as anchored at the point where they are tested. Note also that Perl's |
processed as anchored at the point where they are tested. Note also that Perl's |
treatment of subroutines is different in some cases. | treatment of subroutines and assertions is different in some cases. |
.P |
.P |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
parenthesis followed by an asterisk. They are generally of the form |
parenthesis followed by an asterisk. They are generally of the form |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
depending on whether or not an argument is present. A name is any sequence of |
depending on whether or not an argument is present. A name is any sequence of |
characters that does not include a closing parenthesis. If the name is empty, | characters that does not include a closing parenthesis. The maximum length of |
that is, if the closing parenthesis immediately follows the colon, the effect | name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name |
is as if the colon were not there. Any number of these verbs may occur in a | is empty, that is, if the closing parenthesis immediately follows the colon, |
pattern. | the effect is as if the colon were not there. Any number of these verbs may |
.P | occur in a pattern. |
| . |
| . |
| .\" HTML <a name="nooptimize"></a> |
| .SS "Optimizations that affect backtracking verbs" |
| .rs |
| .sp |
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
Line 2606 present. When one of these optimizations suppresses th
|
Line 2625 present. When one of these optimizations suppresses th
|
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the |
| section entitled |
| .\" HTML <a href="pcreapi.html#execoptions"> |
| .\" </a> |
| "Option bits for \fBpcre_exec()\fP" |
| .\" |
| in the |
| .\" HREF |
| \fBpcreapi\fP |
| .\" |
| documentation. |
.P |
.P |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
leading to anomalous results. |
leading to anomalous results. |
Line 2695 After a partial match or a failed match, the name of t
|
Line 2724 After a partial match or a failed match, the name of t
|
No match, mark = B |
No match, mark = B |
.sp |
.sp |
Note that in this unanchored example the mark is retained from the match |
Note that in this unanchored example the mark is retained from the match |
attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match |
"P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the |
nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. |
| .P |
| If you are interested in (*MARK) values after failed matches, you should |
| probably set the PCRE_NO_START_OPTIMIZE option |
| .\" HTML <a href="#nooptimize"> |
| .\" </a> |
| (see above) |
| .\" |
| to ensure that the match is always attempted. |
. |
. |
. |
. |
.SS "Verbs that act after backtracking" |
.SS "Verbs that act after backtracking" |
Line 2876 Cambridge CB2 3QH, England.
|
Line 2913 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 09 January 2012 | Last updated: 17 June 2012 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
.fi |
.fi |