--- embedaddon/pcre/doc/pcrepattern.3 2012/02/21 23:50:25 1.1.1.2 +++ embedaddon/pcre/doc/pcrepattern.3 2012/10/09 09:19:17 1.1.1.3 @@ -1,4 +1,4 @@ -.TH PCREPATTERN 3 +.TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" .SH NAME PCRE - Perl-compatible regular expressions .SH "PCRE REGULAR EXPRESSION DETAILS" @@ -198,10 +198,10 @@ In a UTF mode, only ASCII numbers and letters have any backslash. All other characters (in particular, those whose codepoints are greater than 127) are treated as literals. .P -If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the +If a pattern is compiled with the PCRE_EXTENDED option, white space in the pattern (other than in a character class) and characters between a # outside a character class and the next newline are ignored. An escaping backslash can -be used to include a whitespace or # character as part of the pattern. +be used to include a white space or # character as part of the pattern. .P If you want to remove the special meaning from a sequence of characters, you can do so by putting them between \eQ and \eE. This is different from Perl in @@ -237,7 +237,7 @@ one of the following escape sequences than the binary \ea alarm, that is, the BEL character (hex 07) \ecx "control-x", where x is any ASCII character \ee escape (hex 1B) - \ef formfeed (hex 0C) + \ef form feed (hex 0C) \en linefeed (hex 0A) \er carriage return (hex 0D) \et tab (hex 09) @@ -277,6 +277,8 @@ as just described only when it is followed by two hexa Otherwise, it matches a literal "x" character. In JavaScript mode, support for code points greater than 256 is provided by \eu, which must be followed by four hexadecimal digits; otherwise it matches a literal "u" character. +Character codes specified by \eu in JavaScript mode are constrained in the same +was as those specified by \ex in non-JavaScript mode. .P Characters whose value is less than 256 can be defined by either of the two syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the @@ -399,12 +401,12 @@ Another use of backslash is for specifying generic cha .sp \ed any decimal digit \eD any character that is not a decimal digit - \eh any horizontal whitespace character - \eH any character that is not a horizontal whitespace character - \es any whitespace character - \eS any character that is not a whitespace character - \ev any vertical whitespace character - \eV any character that is not a vertical whitespace character + \eh any horizontal white space character + \eH any character that is not a horizontal white space character + \es any white space character + \eS any character that is not a white space character + \ev any vertical white space character + \eV any character that is not a vertical white space character \ew any "word" character \eW any "non-word" character .sp @@ -493,7 +495,7 @@ The vertical space characters are: .sp U+000A Linefeed U+000B Vertical tab - U+000C Formfeed + U+000C Form feed U+000D Carriage return U+0085 Next line U+2028 Line separator @@ -520,7 +522,7 @@ below. .\" This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, -U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next +U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split. .P @@ -596,13 +598,16 @@ Armenian, Avestan, Balinese, Bamum, +Batak, Bengali, Bopomofo, +Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, +Chakma, Cham, Cherokee, Common, @@ -645,7 +650,11 @@ Lisu, Lycian, Lydian, Malayalam, +Mandaic, Meetei_Mayek, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, Mongolian, Myanmar, New_Tai_Lue, @@ -664,8 +673,10 @@ Rejang, Runic, Samaritan, Saurashtra, +Sharada, Shavian, Sinhala, +Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, @@ -674,6 +685,7 @@ Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, +Takri, Tamil, Telugu, Thaana, @@ -809,7 +821,7 @@ PCRE_UCP is set. They are: Xwd Any Perl "word" character .sp Xan matches characters that have either the L (letter) or the N (number) -property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or +property. Xps matches the characters tab, linefeed, vertical tab, form feed, or carriage return, and any other character that has the Z (separator) property. Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the same characters as Xan, plus underscore. @@ -1010,7 +1022,8 @@ used. Because \eC breaks up characters into individual unit with \eC in a UTF mode means that the rest of the string may start with a malformed UTF character. This has undefined results, because PCRE assumes that it is dealing with valid UTF strings (and by default it checks this at the -start of processing unless the PCRE_NO_UTF8_CHECK option is used). +start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option +is used). .P PCRE does not allow \eC to appear in lookbehind assertions .\" HTML @@ -1832,7 +1845,7 @@ Because there may be many capturing parentheses in a p following a backslash are taken as part of a potential back reference number. If the pattern continues with a digit character, some delimiter must be used to terminate the back reference. If the PCRE_EXTENDED option is set, this can be -whitespace. Otherwise, the \eg{ syntax or an empty comment (see +white space. Otherwise, the \eg{ syntax or an empty comment (see .\" HTML .\" "Comments" @@ -2189,7 +2202,7 @@ subroutines that can be referenced from elsewhere. (Th subroutines .\" is described below.) For example, a pattern to match an IPv4 address such as -"192.168.23.245" could be written like this (ignore whitespace and line +"192.168.23.245" could be written like this (ignore white space and line breaks): .sp (?(DEFINE) (? 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) @@ -2588,17 +2601,23 @@ exception: the name from a *(MARK), (*PRUNE), or (*THE a successful positive assertion \fIis\fP passed back when a match succeeds (compare capturing parentheses in assertions). Note that such subpatterns are processed as anchored at the point where they are tested. Note also that Perl's -treatment of subroutines is different in some cases. +treatment of subroutines and assertions is different in some cases. .P The new verbs make use of what was previously invalid syntax: an opening parenthesis followed by an asterisk. They are generally of the form (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, depending on whether or not an argument is present. A name is any sequence of -characters that does not include a closing parenthesis. If the name is empty, -that is, if the closing parenthesis immediately follows the colon, the effect -is as if the colon were not there. Any number of these verbs may occur in a -pattern. -.P +characters that does not include a closing parenthesis. The maximum length of +name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name +is empty, that is, if the closing parenthesis immediately follows the colon, +the effect is as if the colon were not there. Any number of these verbs may +occur in a pattern. +. +. +.\" HTML +.SS "Optimizations that affect backtracking verbs" +.rs +.sp PCRE contains some optimizations that are used to speed up matching by running some checks at the start of each match attempt. For example, it may know the minimum length of matching subject, or that a particular character must be @@ -2606,7 +2625,17 @@ present. When one of these optimizations suppresses th included backtracking verbs will not, of course, be processed. You can suppress the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the -pattern with (*NO_START_OPT). +pattern with (*NO_START_OPT). There is more discussion of this option in the +section entitled +.\" HTML +.\" +"Option bits for \fBpcre_exec()\fP" +.\" +in the +.\" HREF +\fBpcreapi\fP +.\" +documentation. .P Experiments with Perl suggest that it too has similar optimizations, sometimes leading to anomalous results. @@ -2695,9 +2724,17 @@ After a partial match or a failed match, the name of t No match, mark = B .sp Note that in this unanchored example the mark is retained from the match -attempt that started at the letter "X". Subsequent match attempts starting at -"P" and then with an empty string do not get as far as the (*MARK) item, but -nevertheless do not reset it. +attempt that started at the letter "X" in the subject. Subsequent match +attempts starting at "P" and then with an empty string do not get as far as the +(*MARK) item, but nevertheless do not reset it. +.P +If you are interested in (*MARK) values after failed matches, you should +probably set the PCRE_NO_START_OPTIMIZE option +.\" HTML +.\" +(see above) +.\" +to ensure that the match is always attempted. . . .SS "Verbs that act after backtracking" @@ -2876,6 +2913,6 @@ Cambridge CB2 3QH, England. .rs .sp .nf -Last updated: 09 January 2012 +Last updated: 17 June 2012 Copyright (c) 1997-2012 University of Cambridge. .fi