| version 1.1.1.2, 2012/02/21 23:50:25 | version 1.1.1.3, 2012/10/09 09:19:18 | 
| Line 227  backslash. All other characters (in particular, those | Line 227  backslash. All other characters (in particular, those | 
 | greater than 127) are treated as literals. | greater than 127) are treated as literals. | 
 | </P> | </P> | 
 | <P> | <P> | 
| If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the | 
 | pattern (other than in a character class) and characters between a # outside | pattern (other than in a character class) and characters between a # outside | 
 | a character class and the next newline are ignored. An escaping backslash can | a character class and the next newline are ignored. An escaping backslash can | 
| be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. | 
 | </P> | </P> | 
 | <P> | <P> | 
 | If you want to remove the special meaning from a sequence of characters, you | If you want to remove the special meaning from a sequence of characters, you | 
| Line 264  one of the following escape sequences than the binary | Line 264  one of the following escape sequences than the binary | 
 | \a        alarm, that is, the BEL character (hex 07) | \a        alarm, that is, the BEL character (hex 07) | 
 | \cx       "control-x", where x is any ASCII character | \cx       "control-x", where x is any ASCII character | 
 | \e        escape (hex 1B) | \e        escape (hex 1B) | 
| \f        formfeed (hex 0C) | \f        form feed (hex 0C) | 
 | \n        linefeed (hex 0A) | \n        linefeed (hex 0A) | 
 | \r        carriage return (hex 0D) | \r        carriage return (hex 0D) | 
 | \t        tab (hex 09) | \t        tab (hex 09) | 
| Line 307  as just described only when it is followed by two hexa | Line 307  as just described only when it is followed by two hexa | 
 | Otherwise, it matches a literal "x" character. In JavaScript mode, support for | Otherwise, it matches a literal "x" character. In JavaScript mode, support for | 
 | code points greater than 256 is provided by \u, which must be followed by | code points greater than 256 is provided by \u, which must be followed by | 
 | four hexadecimal digits; otherwise it matches a literal "u" character. | four hexadecimal digits; otherwise it matches a literal "u" character. | 
 |  | Character codes specified by \u in JavaScript mode are constrained in the same | 
 |  | was as those specified by \x in non-JavaScript mode. | 
 | </P> | </P> | 
 | <P> | <P> | 
 | Characters whose value is less than 256 can be defined by either of the two | Characters whose value is less than 256 can be defined by either of the two | 
| Line 406  Another use of backslash is for specifying generic cha | Line 408  Another use of backslash is for specifying generic cha | 
 | <pre> | <pre> | 
 | \d     any decimal digit | \d     any decimal digit | 
 | \D     any character that is not a decimal digit | \D     any character that is not a decimal digit | 
| \h     any horizontal whitespace character | \h     any horizontal white space character | 
| \H     any character that is not a horizontal whitespace character | \H     any character that is not a horizontal white space character | 
| \s     any whitespace character | \s     any white space character | 
| \S     any character that is not a whitespace character | \S     any character that is not a white space character | 
| \v     any vertical whitespace character | \v     any vertical white space character | 
| \V     any character that is not a vertical whitespace character | \V     any character that is not a vertical white space character | 
 | \w     any "word" character | \w     any "word" character | 
 | \W     any "non-word" character | \W     any "non-word" character | 
 | </pre> | </pre> | 
| Line 497  The vertical space characters are: | Line 499  The vertical space characters are: | 
 | <pre> | <pre> | 
 | U+000A     Linefeed | U+000A     Linefeed | 
 | U+000B     Vertical tab | U+000B     Vertical tab | 
| U+000C     Formfeed | U+000C     Form feed | 
 | U+000D     Carriage return | U+000D     Carriage return | 
 | U+0085     Next line | U+0085     Next line | 
 | U+2028     Line separator | U+2028     Line separator | 
| Line 520  This is an example of an "atomic group", details of wh | Line 522  This is an example of an "atomic group", details of wh | 
 | <a href="#atomicgroup">below.</a> | <a href="#atomicgroup">below.</a> | 
 | This particular group matches either the two-character sequence CR followed by | This particular group matches either the two-character sequence CR followed by | 
 | LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, | LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, | 
| U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next | 
 | line, U+0085). The two-character sequence is treated as a single unit that | line, U+0085). The two-character sequence is treated as a single unit that | 
 | cannot be split. | cannot be split. | 
 | </P> | </P> | 
| Line 596  Armenian, | Line 598  Armenian, | 
 | Avestan, | Avestan, | 
 | Balinese, | Balinese, | 
 | Bamum, | Bamum, | 
 |  | Batak, | 
 | Bengali, | Bengali, | 
 | Bopomofo, | Bopomofo, | 
 |  | Brahmi, | 
 | Braille, | Braille, | 
 | Buginese, | Buginese, | 
 | Buhid, | Buhid, | 
 | Canadian_Aboriginal, | Canadian_Aboriginal, | 
 | Carian, | Carian, | 
 |  | Chakma, | 
 | Cham, | Cham, | 
 | Cherokee, | Cherokee, | 
 | Common, | Common, | 
| Line 645  Lisu, | Line 650  Lisu, | 
 | Lycian, | Lycian, | 
 | Lydian, | Lydian, | 
 | Malayalam, | Malayalam, | 
 |  | Mandaic, | 
 | Meetei_Mayek, | Meetei_Mayek, | 
 |  | Meroitic_Cursive, | 
 |  | Meroitic_Hieroglyphs, | 
 |  | Miao, | 
 | Mongolian, | Mongolian, | 
 | Myanmar, | Myanmar, | 
 | New_Tai_Lue, | New_Tai_Lue, | 
| Line 664  Rejang, | Line 673  Rejang, | 
 | Runic, | Runic, | 
 | Samaritan, | Samaritan, | 
 | Saurashtra, | Saurashtra, | 
 |  | Sharada, | 
 | Shavian, | Shavian, | 
 | Sinhala, | Sinhala, | 
 |  | Sora_Sompeng, | 
 | Sundanese, | Sundanese, | 
 | Syloti_Nagri, | Syloti_Nagri, | 
 | Syriac, | Syriac, | 
| Line 674  Tagbanwa, | Line 685  Tagbanwa, | 
 | Tai_Le, | Tai_Le, | 
 | Tai_Tham, | Tai_Tham, | 
 | Tai_Viet, | Tai_Viet, | 
 |  | Takri, | 
 | Tamil, | Tamil, | 
 | Telugu, | Telugu, | 
 | Thaana, | Thaana, | 
| Line 812  PCRE_UCP is set. They are: | Line 824  PCRE_UCP is set. They are: | 
 | Xwd   Any Perl "word" character | Xwd   Any Perl "word" character | 
 | </pre> | </pre> | 
 | Xan matches characters that have either the L (letter) or the N (number) | Xan matches characters that have either the L (letter) or the N (number) | 
| property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or | 
 | carriage return, and any other character that has the Z (separator) property. | carriage return, and any other character that has the Z (separator) property. | 
 | Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the | Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the | 
 | same characters as Xan, plus underscore. | same characters as Xan, plus underscore. | 
| Line 1008  used. Because \C breaks up characters into individual | Line 1020  used. Because \C breaks up characters into individual | 
 | unit with \C in a UTF mode means that the rest of the string may start with a | unit with \C in a UTF mode means that the rest of the string may start with a | 
 | malformed UTF character. This has undefined results, because PCRE assumes that | malformed UTF character. This has undefined results, because PCRE assumes that | 
 | it is dealing with valid UTF strings (and by default it checks this at the | it is dealing with valid UTF strings (and by default it checks this at the | 
| start of processing unless the PCRE_NO_UTF8_CHECK option is used). | start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option | 
|  | is used). | 
 | </P> | </P> | 
 | <P> | <P> | 
 | PCRE does not allow \C to appear in lookbehind assertions | PCRE does not allow \C to appear in lookbehind assertions | 
| Line 1818  Because there may be many capturing parentheses in a p | Line 1831  Because there may be many capturing parentheses in a p | 
 | following a backslash are taken as part of a potential back reference number. | following a backslash are taken as part of a potential back reference number. | 
 | If the pattern continues with a digit character, some delimiter must be used to | If the pattern continues with a digit character, some delimiter must be used to | 
 | terminate the back reference. If the PCRE_EXTENDED option is set, this can be | terminate the back reference. If the PCRE_EXTENDED option is set, this can be | 
| whitespace. Otherwise, the \g{ syntax or an empty comment (see | white space. Otherwise, the \g{ syntax or an empty comment (see | 
 | <a href="#comments">"Comments"</a> | <a href="#comments">"Comments"</a> | 
 | below) can be used. | below) can be used. | 
 | </P> | </P> | 
| Line 2160  point in the pattern; the idea of DEFINE is that it ca | Line 2173  point in the pattern; the idea of DEFINE is that it ca | 
 | subroutines that can be referenced from elsewhere. (The use of | subroutines that can be referenced from elsewhere. (The use of | 
 | <a href="#subpatternsassubroutines">subroutines</a> | <a href="#subpatternsassubroutines">subroutines</a> | 
 | is described below.) For example, a pattern to match an IPv4 address such as | is described below.) For example, a pattern to match an IPv4 address such as | 
| "192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line | 
 | breaks): | breaks): | 
 | <pre> | <pre> | 
 | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | 
| Line 2554  exception: the name from a *(MARK), (*PRUNE), or (*THE | Line 2567  exception: the name from a *(MARK), (*PRUNE), or (*THE | 
 | a successful positive assertion <i>is</i> passed back when a match succeeds | a successful positive assertion <i>is</i> passed back when a match succeeds | 
 | (compare capturing parentheses in assertions). Note that such subpatterns are | (compare capturing parentheses in assertions). Note that such subpatterns are | 
 | processed as anchored at the point where they are tested. Note also that Perl's | processed as anchored at the point where they are tested. Note also that Perl's | 
| treatment of subroutines is different in some cases. | treatment of subroutines and assertions is different in some cases. | 
 | </P> | </P> | 
 | <P> | <P> | 
 | The new verbs make use of what was previously invalid syntax: an opening | The new verbs make use of what was previously invalid syntax: an opening | 
 | parenthesis followed by an asterisk. They are generally of the form | parenthesis followed by an asterisk. They are generally of the form | 
 | (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | 
 | depending on whether or not an argument is present. A name is any sequence of | depending on whether or not an argument is present. A name is any sequence of | 
| characters that does not include a closing parenthesis. If the name is empty, | characters that does not include a closing parenthesis. The maximum length of | 
| that is, if the closing parenthesis immediately follows the colon, the effect | name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name | 
| is as if the colon were not there. Any number of these verbs may occur in a | is empty, that is, if the closing parenthesis immediately follows the colon, | 
| pattern. | the effect is as if the colon were not there. Any number of these verbs may | 
| </P> | occur in a pattern. | 
|  | <a name="nooptimize"></a></P> | 
|  | <br><b> | 
|  | Optimizations that affect backtracking verbs | 
|  | </b><br> | 
 | <P> | <P> | 
 | PCRE contains some optimizations that are used to speed up matching by running | PCRE contains some optimizations that are used to speed up matching by running | 
 | some checks at the start of each match attempt. For example, it may know the | some checks at the start of each match attempt. For example, it may know the | 
| Line 2574  present. When one of these optimizations suppresses th | Line 2591  present. When one of these optimizations suppresses th | 
 | included backtracking verbs will not, of course, be processed. You can suppress | included backtracking verbs will not, of course, be processed. You can suppress | 
 | the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option | the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option | 
 | when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the | when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the | 
| pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the | 
|  | section entitled | 
|  | <a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a> | 
|  | in the | 
|  | <a href="pcreapi.html"><b>pcreapi</b></a> | 
|  | documentation. | 
 | </P> | </P> | 
 | <P> | <P> | 
 | Experiments with Perl suggest that it too has similar optimizations, sometimes | Experiments with Perl suggest that it too has similar optimizations, sometimes | 
| Line 2662  After a partial match or a failed match, the name of t | Line 2684  After a partial match or a failed match, the name of t | 
 | No match, mark = B | No match, mark = B | 
 | </pre> | </pre> | 
 | Note that in this unanchored example the mark is retained from the match | Note that in this unanchored example the mark is retained from the match | 
| attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match | 
| "P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the | 
| nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. | 
 | </P> | </P> | 
 |  | <P> | 
 |  | If you are interested in (*MARK) values after failed matches, you should | 
 |  | probably set the PCRE_NO_START_OPTIMIZE option | 
 |  | <a href="#nooptimize">(see above)</a> | 
 |  | to ensure that the match is always attempted. | 
 |  | </P> | 
 | <br><b> | <br><b> | 
 | Verbs that act after backtracking | Verbs that act after backtracking | 
 | </b><br> | </b><br> | 
| Line 2843  Cambridge CB2 3QH, England. | Line 2871  Cambridge CB2 3QH, England. | 
 | </P> | </P> | 
 | <br><a name="SEC28" href="#TOC1">REVISION</a><br> | <br><a name="SEC28" href="#TOC1">REVISION</a><br> | 
 | <P> | <P> | 
| Last updated: 09 January 2012 | Last updated: 17 June 2012 | 
 | <br> | <br> | 
 | Copyright © 1997-2012 University of Cambridge. | Copyright © 1997-2012 University of Cambridge. | 
 | <br> | <br> |