|
version 1.1.1.2, 2012/02/21 23:50:25
|
version 1.1.1.3, 2012/10/09 09:19:17
|
|
Line 1
|
Line 1
|
| .TH PCREPATTERN 3 | .TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" |
| .SH NAME |
.SH NAME |
| PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
| .SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
|
Line 198 In a UTF mode, only ASCII numbers and letters have any
|
Line 198 In a UTF mode, only ASCII numbers and letters have any
|
| backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
| greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
| .P |
.P |
| If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
| pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class) and characters between a # outside |
| a character class and the next newline are ignored. An escaping backslash can |
a character class and the next newline are ignored. An escaping backslash can |
| be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. |
| .P |
.P |
| If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
| can do so by putting them between \eQ and \eE. This is different from Perl in |
can do so by putting them between \eQ and \eE. This is different from Perl in |
|
Line 237 one of the following escape sequences than the binary
|
Line 237 one of the following escape sequences than the binary
|
| \ea alarm, that is, the BEL character (hex 07) |
\ea alarm, that is, the BEL character (hex 07) |
| \ecx "control-x", where x is any ASCII character |
\ecx "control-x", where x is any ASCII character |
| \ee escape (hex 1B) |
\ee escape (hex 1B) |
| \ef formfeed (hex 0C) | \ef form feed (hex 0C) |
| \en linefeed (hex 0A) |
\en linefeed (hex 0A) |
| \er carriage return (hex 0D) |
\er carriage return (hex 0D) |
| \et tab (hex 09) |
\et tab (hex 09) |
|
Line 277 as just described only when it is followed by two hexa
|
Line 277 as just described only when it is followed by two hexa
|
| Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
| code points greater than 256 is provided by \eu, which must be followed by |
code points greater than 256 is provided by \eu, which must be followed by |
| four hexadecimal digits; otherwise it matches a literal "u" character. |
four hexadecimal digits; otherwise it matches a literal "u" character. |
| |
Character codes specified by \eu in JavaScript mode are constrained in the same |
| |
was as those specified by \ex in non-JavaScript mode. |
| .P |
.P |
| Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
| syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
|
Line 399 Another use of backslash is for specifying generic cha
|
Line 401 Another use of backslash is for specifying generic cha
|
| .sp |
.sp |
| \ed any decimal digit |
\ed any decimal digit |
| \eD any character that is not a decimal digit |
\eD any character that is not a decimal digit |
| \eh any horizontal whitespace character | \eh any horizontal white space character |
| \eH any character that is not a horizontal whitespace character | \eH any character that is not a horizontal white space character |
| \es any whitespace character | \es any white space character |
| \eS any character that is not a whitespace character | \eS any character that is not a white space character |
| \ev any vertical whitespace character | \ev any vertical white space character |
| \eV any character that is not a vertical whitespace character | \eV any character that is not a vertical white space character |
| \ew any "word" character |
\ew any "word" character |
| \eW any "non-word" character |
\eW any "non-word" character |
| .sp |
.sp |
|
Line 493 The vertical space characters are:
|
Line 495 The vertical space characters are:
|
| .sp |
.sp |
| U+000A Linefeed |
U+000A Linefeed |
| U+000B Vertical tab |
U+000B Vertical tab |
| U+000C Formfeed | U+000C Form feed |
| U+000D Carriage return |
U+000D Carriage return |
| U+0085 Next line |
U+0085 Next line |
| U+2028 Line separator |
U+2028 Line separator |
|
Line 520 below.
|
Line 522 below.
|
| .\" |
.\" |
| This particular group matches either the two-character sequence CR followed by |
This particular group matches either the two-character sequence CR followed by |
| LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
| U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next |
| line, U+0085). The two-character sequence is treated as a single unit that |
line, U+0085). The two-character sequence is treated as a single unit that |
| cannot be split. |
cannot be split. |
| .P |
.P |
|
Line 596 Armenian,
|
Line 598 Armenian,
|
| Avestan, |
Avestan, |
| Balinese, |
Balinese, |
| Bamum, |
Bamum, |
| |
Batak, |
| Bengali, |
Bengali, |
| Bopomofo, |
Bopomofo, |
| |
Brahmi, |
| Braille, |
Braille, |
| Buginese, |
Buginese, |
| Buhid, |
Buhid, |
| Canadian_Aboriginal, |
Canadian_Aboriginal, |
| Carian, |
Carian, |
| |
Chakma, |
| Cham, |
Cham, |
| Cherokee, |
Cherokee, |
| Common, |
Common, |
|
Line 645 Lisu,
|
Line 650 Lisu,
|
| Lycian, |
Lycian, |
| Lydian, |
Lydian, |
| Malayalam, |
Malayalam, |
| |
Mandaic, |
| Meetei_Mayek, |
Meetei_Mayek, |
| |
Meroitic_Cursive, |
| |
Meroitic_Hieroglyphs, |
| |
Miao, |
| Mongolian, |
Mongolian, |
| Myanmar, |
Myanmar, |
| New_Tai_Lue, |
New_Tai_Lue, |
|
Line 664 Rejang,
|
Line 673 Rejang,
|
| Runic, |
Runic, |
| Samaritan, |
Samaritan, |
| Saurashtra, |
Saurashtra, |
| |
Sharada, |
| Shavian, |
Shavian, |
| Sinhala, |
Sinhala, |
| |
Sora_Sompeng, |
| Sundanese, |
Sundanese, |
| Syloti_Nagri, |
Syloti_Nagri, |
| Syriac, |
Syriac, |
|
Line 674 Tagbanwa,
|
Line 685 Tagbanwa,
|
| Tai_Le, |
Tai_Le, |
| Tai_Tham, |
Tai_Tham, |
| Tai_Viet, |
Tai_Viet, |
| |
Takri, |
| Tamil, |
Tamil, |
| Telugu, |
Telugu, |
| Thaana, |
Thaana, |
|
Line 809 PCRE_UCP is set. They are:
|
Line 821 PCRE_UCP is set. They are:
|
| Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
| .sp |
.sp |
| Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
| property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
| carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
| Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
| same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
|
Line 1010 used. Because \eC breaks up characters into individual
|
Line 1022 used. Because \eC breaks up characters into individual
|
| unit with \eC in a UTF mode means that the rest of the string may start with a |
unit with \eC in a UTF mode means that the rest of the string may start with a |
| malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
| it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
| start of processing unless the PCRE_NO_UTF8_CHECK option is used). | start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option |
| | is used). |
| .P |
.P |
| PCRE does not allow \eC to appear in lookbehind assertions |
PCRE does not allow \eC to appear in lookbehind assertions |
| .\" HTML <a href="#lookbehind"> |
.\" HTML <a href="#lookbehind"> |
|
Line 1832 Because there may be many capturing parentheses in a p
|
Line 1845 Because there may be many capturing parentheses in a p
|
| following a backslash are taken as part of a potential back reference number. |
following a backslash are taken as part of a potential back reference number. |
| If the pattern continues with a digit character, some delimiter must be used to |
If the pattern continues with a digit character, some delimiter must be used to |
| terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
| whitespace. Otherwise, the \eg{ syntax or an empty comment (see | white space. Otherwise, the \eg{ syntax or an empty comment (see |
| .\" HTML <a href="#comments"> |
.\" HTML <a href="#comments"> |
| .\" </a> |
.\" </a> |
| "Comments" |
"Comments" |
|
Line 2189 subroutines that can be referenced from elsewhere. (Th
|
Line 2202 subroutines that can be referenced from elsewhere. (Th
|
| subroutines |
subroutines |
| .\" |
.\" |
| is described below.) For example, a pattern to match an IPv4 address such as |
is described below.) For example, a pattern to match an IPv4 address such as |
| "192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line |
| breaks): |
breaks): |
| .sp |
.sp |
| (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) |
(?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) |
|
Line 2588 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
Line 2601 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
| a successful positive assertion \fIis\fP passed back when a match succeeds |
a successful positive assertion \fIis\fP passed back when a match succeeds |
| (compare capturing parentheses in assertions). Note that such subpatterns are |
(compare capturing parentheses in assertions). Note that such subpatterns are |
| processed as anchored at the point where they are tested. Note also that Perl's |
processed as anchored at the point where they are tested. Note also that Perl's |
| treatment of subroutines is different in some cases. | treatment of subroutines and assertions is different in some cases. |
| .P |
.P |
| The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
| parenthesis followed by an asterisk. They are generally of the form |
parenthesis followed by an asterisk. They are generally of the form |
| (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
| depending on whether or not an argument is present. A name is any sequence of |
depending on whether or not an argument is present. A name is any sequence of |
| characters that does not include a closing parenthesis. If the name is empty, | characters that does not include a closing parenthesis. The maximum length of |
| that is, if the closing parenthesis immediately follows the colon, the effect | name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name |
| is as if the colon were not there. Any number of these verbs may occur in a | is empty, that is, if the closing parenthesis immediately follows the colon, |
| pattern. | the effect is as if the colon were not there. Any number of these verbs may |
| .P | occur in a pattern. |
| | . |
| | . |
| | .\" HTML <a name="nooptimize"></a> |
| | .SS "Optimizations that affect backtracking verbs" |
| | .rs |
| | .sp |
| PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
| some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
| minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
|
Line 2606 present. When one of these optimizations suppresses th
|
Line 2625 present. When one of these optimizations suppresses th
|
| included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
| the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
| when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
| pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the |
| | section entitled |
| | .\" HTML <a href="pcreapi.html#execoptions"> |
| | .\" </a> |
| | "Option bits for \fBpcre_exec()\fP" |
| | .\" |
| | in the |
| | .\" HREF |
| | \fBpcreapi\fP |
| | .\" |
| | documentation. |
| .P |
.P |
| Experiments with Perl suggest that it too has similar optimizations, sometimes |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
| leading to anomalous results. |
leading to anomalous results. |
|
Line 2695 After a partial match or a failed match, the name of t
|
Line 2724 After a partial match or a failed match, the name of t
|
| No match, mark = B |
No match, mark = B |
| .sp |
.sp |
| Note that in this unanchored example the mark is retained from the match |
Note that in this unanchored example the mark is retained from the match |
| attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match |
| "P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the |
| nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. |
| | .P |
| | If you are interested in (*MARK) values after failed matches, you should |
| | probably set the PCRE_NO_START_OPTIMIZE option |
| | .\" HTML <a href="#nooptimize"> |
| | .\" </a> |
| | (see above) |
| | .\" |
| | to ensure that the match is always attempted. |
| . |
. |
| . |
. |
| .SS "Verbs that act after backtracking" |
.SS "Verbs that act after backtracking" |
|
Line 2876 Cambridge CB2 3QH, England.
|
Line 2913 Cambridge CB2 3QH, England.
|
| .rs |
.rs |
| .sp |
.sp |
| .nf |
.nf |
| Last updated: 09 January 2012 | Last updated: 17 June 2012 |
| Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
| .fi |
.fi |