| 
version 1.1, 2012/02/21 23:05:52
 | 
version 1.1.1.5, 2014/06/15 19:46:05
 | 
| 
 Line 14  man page, in case the conversion went wrong.
 | 
 Line 14  man page, in case the conversion went wrong.
 | 
 |  <br> | 
  <br> | 
 |  <ul> | 
  <ul> | 
 |  <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a> | 
  <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a> | 
| <li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a> | <li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a> | 
| <li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a> | <li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a> | 
| <li><a name="TOC4" href="#SEC4">BACKSLASH</a> | <li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a> | 
| <li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> | <li><a name="TOC5" href="#SEC5">BACKSLASH</a> | 
| <li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> | <li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a> | 
| <li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> | <li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a> | 
| <li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> | <li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a> | 
| <li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> | <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a> | 
| <li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> | <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a> | 
| <li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a> | <li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a> | 
| <li><a name="TOC12" href="#SEC12">SUBPATTERNS</a> | <li><a name="TOC12" href="#SEC12">VERTICAL BAR</a> | 
| <li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a> | <li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a> | 
| <li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a> | <li><a name="TOC14" href="#SEC14">SUBPATTERNS</a> | 
| <li><a name="TOC15" href="#SEC15">REPETITION</a> | <li><a name="TOC15" href="#SEC15">DUPLICATE SUBPATTERN NUMBERS</a> | 
| <li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> | <li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a> | 
| <li><a name="TOC17" href="#SEC17">BACK REFERENCES</a> | <li><a name="TOC17" href="#SEC17">REPETITION</a> | 
| <li><a name="TOC18" href="#SEC18">ASSERTIONS</a> | <li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> | 
| <li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a> | <li><a name="TOC19" href="#SEC19">BACK REFERENCES</a> | 
| <li><a name="TOC20" href="#SEC20">COMMENTS</a> | <li><a name="TOC20" href="#SEC20">ASSERTIONS</a> | 
| <li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a> | <li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a> | 
| <li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a> | <li><a name="TOC22" href="#SEC22">COMMENTS</a> | 
| <li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a> | <li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a> | 
| <li><a name="TOC24" href="#SEC24">CALLOUTS</a> | <li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a> | 
| <li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a> | <li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a> | 
| <li><a name="TOC26" href="#SEC26">SEE ALSO</a> | <li><a name="TOC26" href="#SEC26">CALLOUTS</a> | 
| <li><a name="TOC27" href="#SEC27">AUTHOR</a> | <li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a> | 
| <li><a name="TOC28" href="#SEC28">REVISION</a> | <li><a name="TOC28" href="#SEC28">SEE ALSO</a> | 
|   | <li><a name="TOC29" href="#SEC29">AUTHOR</a> | 
|   | <li><a name="TOC30" href="#SEC30">REVISION</a> | 
 |  </ul> | 
  </ul> | 
 |  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> | 
  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> | 
 |  <P> | 
  <P> | 
| 
 Line 60  published by O'Reilly, covers regular expressions in g
 | 
 Line 62  published by O'Reilly, covers regular expressions in g
 | 
 |  description of PCRE's regular expressions is intended as reference material. | 
  description of PCRE's regular expressions is intended as reference material. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |   | 
  This document discusses the patterns that are supported by PCRE when one its | 
 |   | 
  main matching functions, <b>pcre_exec()</b> (8-bit) or <b>pcre[16|32]_exec()</b> | 
 |   | 
  (16- or 32-bit), is used. PCRE also has alternative matching functions, | 
 |   | 
  <b>pcre_dfa_exec()</b> and <b>pcre[16|32_dfa_exec()</b>, which match using a | 
 |   | 
  different algorithm that is not Perl-compatible. Some of the features discussed | 
 |   | 
  below are not available when DFA matching is used. The advantages and | 
 |   | 
  disadvantages of the alternative functions, and how they differ from the normal | 
 |   | 
  functions, are discussed in the | 
 |   | 
  <a href="pcrematching.html"><b>pcrematching</b></a> | 
 |   | 
  page. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br> | 
 |   | 
  <P> | 
 |   | 
  A number of options that can be passed to <b>pcre_compile()</b> can also be set | 
 |   | 
  by special items at the start of a pattern. These are not Perl-compatible, but | 
 |   | 
  are provided to make these options accessible to pattern writers who are not | 
 |   | 
  able to change the program that processes the pattern. Any number of these | 
 |   | 
  items may appear, but they must all be together right at the start of the | 
 |   | 
  pattern string, and the letters must be in upper case. | 
 |   | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  UTF support | 
 |   | 
  </b><br> | 
 |   | 
  <P> | 
 |  The original operation of PCRE was on strings of one-byte characters. However, | 
  The original operation of PCRE was on strings of one-byte characters. However, | 
| there is now also support for UTF-8 character strings. To use this, | there is now also support for UTF-8 strings in the original library, an | 
| PCRE must be built to include UTF-8 support, and you must call | extra library that supports 16-bit and UTF-16 character strings, and a | 
| <b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There | third library that supports 32-bit and UTF-32 character strings. To use these | 
| is also a special sequence that can be given at the start of a pattern: | features, PCRE must be built to include appropriate support. When using UTF | 
|   | strings you must either call the compiling function with the PCRE_UTF8, | 
|   | PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of | 
|   | these special sequences: | 
 |  <pre> | 
  <pre> | 
 |    (*UTF8) | 
    (*UTF8) | 
 |   | 
    (*UTF16) | 
 |   | 
    (*UTF32) | 
 |   | 
    (*UTF) | 
 |  </pre> | 
  </pre> | 
| Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8 | (*UTF) is a generic sequence that can be used with any of the libraries. | 
| option. This feature is not Perl-compatible. How setting UTF-8 mode affects | Starting a pattern with such a sequence is equivalent to setting the relevant | 
| pattern matching is mentioned in several places below. There is also a summary | option. How setting a UTF mode affects pattern matching is mentioned in several | 
| of UTF-8 features in the | places below. There is also a summary of features in the | 
 |  <a href="pcreunicode.html"><b>pcreunicode</b></a> | 
  <a href="pcreunicode.html"><b>pcreunicode</b></a> | 
 |  page. | 
  page. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Another special sequence that may appear at the start of a pattern or in | Some applications that allow their users to supply patterns may wish to | 
| combination with (*UTF8) is: | restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF | 
| <pre> | option is set at compile time, (*UTF) etc. are not allowed, and their | 
|   (*UCP) | appearance causes an error. | 
| </pre> | </P> | 
|   | <br><b> | 
|   | Unicode property support | 
|   | </b><br> | 
|   | <P> | 
|   | Another special sequence that may appear at the start of a pattern is (*UCP). | 
 |  This has the same effect as setting the PCRE_UCP option: it causes sequences | 
  This has the same effect as setting the PCRE_UCP option: it causes sequences | 
 |  such as \d and \w to use Unicode properties to determine character types, | 
  such as \d and \w to use Unicode properties to determine character types, | 
 |  instead of recognizing only characters with codes less than 128 via a lookup | 
  instead of recognizing only characters with codes less than 128 via a lookup | 
 |  table. | 
  table. | 
 |  </P> | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  Disabling auto-possessification | 
 |   | 
  </b><br> | 
 |  <P> | 
  <P> | 
| If a pattern starts with (*NO_START_OPT), it has the same effect as setting the | If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting | 
| PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are | the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making | 
| also some more of these special sequences that are concerned with the handling | quantifiers possessive when what follows cannot match the repeated item. For | 
| of newlines; they are described below. | example, by default a+b is treated as a++b. For more details, see the | 
|   | <a href="pcreapi.html"><b>pcreapi</b></a> | 
|   | documentation. | 
 |  </P> | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  Disabling start-up optimizations | 
 |   | 
  </b><br> | 
 |  <P> | 
  <P> | 
| The remainder of this document discusses the patterns that are supported by | If a pattern starts with (*NO_START_OPT), it has the same effect as setting the | 
| PCRE when its main matching function, <b>pcre_exec()</b>, is used. | PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables | 
| From release 6.0, PCRE offers a second matching function, | several optimizations for quickly reaching "no match" results. For more | 
| <b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not | details, see the | 
| Perl-compatible. Some of the features discussed below are not available when | <a href="pcreapi.html"><b>pcreapi</b></a> | 
| <b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the | documentation. | 
| alternative function, and how it differs from the normal function, are |   | 
| discussed in the |   | 
| <a href="pcrematching.html"><b>pcrematching</b></a> |   | 
| page. |   | 
 |  <a name="newlines"></a></P> | 
  <a name="newlines"></a></P> | 
| <br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br> | <br><b> | 
|   | Newline conventions | 
|   | </b><br> | 
 |  <P> | 
  <P> | 
 |  PCRE supports five different conventions for indicating line breaks in | 
  PCRE supports five different conventions for indicating line breaks in | 
 |  strings: a single CR (carriage return) character, a single LF (linefeed) | 
  strings: a single CR (carriage return) character, a single LF (linefeed) | 
| 
 Line 126  string with one of the following five sequences:
 | 
 Line 169  string with one of the following five sequences:
 | 
 |    (*ANYCRLF)   any of the three above | 
    (*ANYCRLF)   any of the three above | 
 |    (*ANY)       all Unicode newline sequences | 
    (*ANY)       all Unicode newline sequences | 
 |  </pre> | 
  </pre> | 
| These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function. For | 
| <b>pcre_compile2()</b>. For example, on a Unix system where LF is the default | example, on a Unix system where LF is the default newline sequence, the pattern | 
| newline sequence, the pattern |   | 
 |  <pre> | 
  <pre> | 
 |    (*CR)a.b | 
    (*CR)a.b | 
 |  </pre> | 
  </pre> | 
 |  changes the convention to CR. That pattern matches "a\nb" because LF is no | 
  changes the convention to CR. That pattern matches "a\nb" because LF is no | 
| longer a newline. Note that these special settings, which are not | longer a newline. If more than one of these settings is present, the last one | 
| Perl-compatible, are recognized only at the very start of a pattern, and that |   | 
| they must be in upper case. If more than one of them is present, the last one |   | 
 |  is used. | 
  is used. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| The newline convention affects the interpretation of the dot metacharacter when | The newline convention affects where the circumflex and dollar assertions are | 
| PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not | true. It also affects the interpretation of the dot metacharacter when | 
| affect what the \R escape sequence matches. By default, this is any Unicode | PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect | 
| newline sequence, for Perl compatibility. However, this can be changed; see the | what the \R escape sequence matches. By default, this is any Unicode newline | 
|   | sequence, for Perl compatibility. However, this can be changed; see the | 
 |  description of \R in the section entitled | 
  description of \R in the section entitled | 
 |  <a href="#newlineseq">"Newline sequences"</a> | 
  <a href="#newlineseq">"Newline sequences"</a> | 
 |  below. A change of \R setting can be combined with a change of newline | 
  below. A change of \R setting can be combined with a change of newline | 
 |  convention. | 
  convention. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> | <br><b> | 
|   | Setting match and recursion limits | 
|   | </b><br> | 
 |  <P> | 
  <P> | 
 |   | 
  The caller of <b>pcre_exec()</b> can set a limit on the number of times the | 
 |   | 
  internal <b>match()</b> function is called and on the maximum depth of | 
 |   | 
  recursive calls. These facilities are provided to catch runaway matches that | 
 |   | 
  are provoked by patterns with huge matching trees (a typical example is a | 
 |   | 
  pattern with nested unlimited repeats) and to avoid running out of system stack | 
 |   | 
  by too much recursion. When one of these limits is reached, <b>pcre_exec()</b> | 
 |   | 
  gives an error return. The limits can also be set by items at the start of the | 
 |   | 
  pattern of the form | 
 |   | 
  <pre> | 
 |   | 
    (*LIMIT_MATCH=d) | 
 |   | 
    (*LIMIT_RECURSION=d) | 
 |   | 
  </pre> | 
 |   | 
  where d is any number of decimal digits. However, the value of the setting must | 
 |   | 
  be less than the value set (or defaulted) by the caller of <b>pcre_exec()</b> | 
 |   | 
  for it to have any effect. In other words, the pattern writer can lower the | 
 |   | 
  limits set by the programmer, but not raise them. If there is more than one | 
 |   | 
  setting of one of these limits, the lower value is used. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> | 
 |   | 
  <P> | 
 |   | 
  PCRE can be compiled to run in an environment that uses EBCDIC as its character | 
 |   | 
  code rather than ASCII or Unicode (typically a mainframe system). In the | 
 |   | 
  sections below, character code values are ASCII or Unicode; in an EBCDIC | 
 |   | 
  environment these characters may have different code values, and there are no | 
 |   | 
  code points greater than 255. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> | 
 |   | 
  <P> | 
 |  A regular expression is a pattern that is matched against a subject string from | 
  A regular expression is a pattern that is matched against a subject string from | 
 |  left to right. Most characters stand for themselves in a pattern, and match the | 
  left to right. Most characters stand for themselves in a pattern, and match the | 
 |  corresponding characters in the subject. As a trivial example, the pattern | 
  corresponding characters in the subject. As a trivial example, the pattern | 
| 
 Line 158  corresponding characters in the subject. As a trivial 
 | 
 Line 229  corresponding characters in the subject. As a trivial 
 | 
 |  </pre> | 
  </pre> | 
 |  matches a portion of a subject string that is identical to itself. When | 
  matches a portion of a subject string that is identical to itself. When | 
 |  caseless matching is specified (the PCRE_CASELESS option), letters are matched | 
  caseless matching is specified (the PCRE_CASELESS option), letters are matched | 
| independently of case. In UTF-8 mode, PCRE always understands the concept of | independently of case. In a UTF mode, PCRE always understands the concept of | 
 |  case for characters whose values are less than 128, so caseless matching is | 
  case for characters whose values are less than 128, so caseless matching is | 
 |  always possible. For characters with higher values, the concept of case is | 
  always possible. For characters with higher values, the concept of case is | 
 |  supported if PCRE is compiled with Unicode property support, but not otherwise. | 
  supported if PCRE is compiled with Unicode property support, but not otherwise. | 
 |  If you want to use caseless matching for characters 128 and above, you must | 
  If you want to use caseless matching for characters 128 and above, you must | 
 |  ensure that PCRE is compiled with Unicode property support as well as with | 
  ensure that PCRE is compiled with Unicode property support as well as with | 
| UTF-8 support. | UTF support. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  The power of regular expressions comes from the ability to include alternatives | 
  The power of regular expressions comes from the ability to include alternatives | 
| 
 Line 205  a character class the only metacharacters are:
 | 
 Line 276  a character class the only metacharacters are:
 | 
 |  </pre> | 
  </pre> | 
 |  The following sections describe the use of each of the metacharacters. | 
  The following sections describe the use of each of the metacharacters. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC4" href="#TOC1">BACKSLASH</a><br> | <br><a name="SEC5" href="#TOC1">BACKSLASH</a><br> | 
 |  <P> | 
  <P> | 
 |  The backslash character has several uses. Firstly, if it is followed by a | 
  The backslash character has several uses. Firstly, if it is followed by a | 
 |  character that is not a number or a letter, it takes away any special meaning | 
  character that is not a number or a letter, it takes away any special meaning | 
| 
 Line 220  non-alphanumeric with backslash to specify that it sta
 | 
 Line 291  non-alphanumeric with backslash to specify that it sta
 | 
 |  particular, if you want to match a backslash, you write \\. | 
  particular, if you want to match a backslash, you write \\. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In UTF-8 mode, only ASCII numbers and letters have any special meaning after a | In a UTF mode, only ASCII numbers and letters have any special meaning after a | 
 |  backslash. All other characters (in particular, those whose codepoints are | 
  backslash. All other characters (in particular, those whose codepoints are | 
 |  greater than 127) are treated as literals. | 
  greater than 127) are treated as literals. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, most white space in the | 
| pattern (other than in a character class) and characters between a # outside | pattern (other than in a character class), and characters between a # outside a | 
| a character class and the next newline are ignored. An escaping backslash can | character class and the next newline, inclusive, are ignored. An escaping | 
| be used to include a whitespace or # character as part of the pattern. | backslash can be used to include a white space or # character as part of the | 
|   | pattern. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  If you want to remove the special meaning from a sequence of characters, you | 
  If you want to remove the special meaning from a sequence of characters, you | 
| 
 Line 262  one of the following escape sequences than the binary 
 | 
 Line 334  one of the following escape sequences than the binary 
 | 
 |    \a        alarm, that is, the BEL character (hex 07) | 
    \a        alarm, that is, the BEL character (hex 07) | 
 |    \cx       "control-x", where x is any ASCII character | 
    \cx       "control-x", where x is any ASCII character | 
 |    \e        escape (hex 1B) | 
    \e        escape (hex 1B) | 
|   \f        formfeed (hex 0C) |   \f        form feed (hex 0C) | 
 |    \n        linefeed (hex 0A) | 
    \n        linefeed (hex 0A) | 
 |    \r        carriage return (hex 0D) | 
    \r        carriage return (hex 0D) | 
 |    \t        tab (hex 09) | 
    \t        tab (hex 09) | 
 |   | 
    \0dd      character with octal code 0dd | 
 |    \ddd      character with octal code ddd, or back reference | 
    \ddd      character with octal code ddd, or back reference | 
 |   | 
    \o{ddd..} character with octal code ddd.. | 
 |    \xhh      character with hex code hh | 
    \xhh      character with hex code hh | 
 |    \x{hhh..} character with hex code hhh.. (non-JavaScript mode) | 
    \x{hhh..} character with hex code hhh.. (non-JavaScript mode) | 
 |    \uhhhh    character with hex code hhhh (JavaScript mode only) | 
    \uhhhh    character with hex code hhhh (JavaScript mode only) | 
 |  </pre> | 
  </pre> | 
| The precise effect of \cx is as follows: if x is a lower case letter, it | The precise effect of \cx on ASCII characters is as follows: if x is a lower | 
| is converted to upper case. Then bit 6 of the character (hex 40) is inverted. | case letter, it is converted to upper case. Then bit 6 of the character (hex | 
| Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while | 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), | 
| \c; becomes hex 7B (; is 3B). If the byte following \c has a value greater | but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the | 
| than 127, a compile-time error occurs. This locks out non-ASCII characters in | data item (byte or 16-bit value) following \c has a value greater than 127, a | 
| both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte | compile-time error occurs. This locks out non-ASCII characters in all modes. | 
| values are valid. A lower case letter is converted to upper case, and then the |   | 
| 0xc0 bits are flipped.) |   | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| By default, after \x, from zero to two hexadecimal digits are read (letters | The \c facility was designed for use with ASCII characters, but with the | 
| can be in upper or lower case). Any number of hexadecimal digits may appear | extension to Unicode it is even less useful than it once was. It is, however, | 
| between \x{ and }, but the value of the character code must be less than 256 | recognized when PCRE is compiled in EBCDIC mode, where data items are always | 
| in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum | bytes. In this mode, all values are valid after \c. If the next character is a | 
| value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest | lower case letter, it is converted to upper case. Then the 0xc0 bits of the | 
| Unicode code point, which is 10FFFF. | byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because | 
|   | the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other | 
|   | characters also generate different values. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  If characters other than hexadecimal digits appear between \x{ and }, or if | 
   | 
 |  there is no terminating }, this form of escape is not recognized. Instead, the | 
   | 
 |  initial \x will be interpreted as a basic hexadecimal escape, with no | 
   | 
 |  following digits, giving a character whose value is zero. | 
   | 
 |  </P> | 
   | 
 |  <P> | 
   | 
 |  If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is | 
   | 
 |  as just described only when it is followed by two hexadecimal digits. | 
   | 
 |  Otherwise, it matches a literal "x" character. In JavaScript mode, support for | 
   | 
 |  code points greater than 256 is provided by \u, which must be followed by | 
   | 
 |  four hexadecimal digits; otherwise it matches a literal "u" character. | 
   | 
 |  </P> | 
   | 
 |  <P> | 
   | 
 |  Characters whose value is less than 256 can be defined by either of the two | 
   | 
 |  syntaxes for \x (or by \u in JavaScript mode). There is no difference in the | 
   | 
 |  way they are handled. For example, \xdc is exactly the same as \x{dc} (or | 
   | 
 |  \u00dc in JavaScript mode). | 
   | 
 |  </P> | 
   | 
 |  <P> | 
   | 
 |  After \0 up to two further octal digits are read. If there are fewer than two | 
  After \0 up to two further octal digits are read. If there are fewer than two | 
 |  digits, just those that are present are used. Thus the sequence \0\x\07 | 
  digits, just those that are present are used. Thus the sequence \0\x\07 | 
 |  specifies two binary zeros followed by a BEL character (code value 7). Make | 
  specifies two binary zeros followed by a BEL character (code value 7). Make | 
| 
 Line 315  sure you supply two digits after the initial zero if t
 | 
 Line 370  sure you supply two digits after the initial zero if t
 | 
 |  follows is itself an octal digit. | 
  follows is itself an octal digit. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| The handling of a backslash followed by a digit other than 0 is complicated. | The escape \o must be followed by a sequence of octal digits, enclosed in | 
| Outside a character class, PCRE reads it and any following digits as a decimal | braces. An error occurs if this is not the case. This escape is a recent | 
| number. If the number is less than 10, or if there have been at least that many | addition to Perl; it provides way of specifying character code points as octal | 
|   | numbers greater than 0777, and it also allows octal numbers and back references | 
|   | to be unambiguously specified. | 
|   | </P> | 
|   | <P> | 
|   | For greater clarity and unambiguity, it is best to avoid following \ by a | 
|   | digit greater than zero. Instead, use \o{} or \x{} to specify character | 
|   | numbers, and \g{} to specify back references. The following paragraphs | 
|   | describe the old, ambiguous syntax. | 
|   | </P> | 
|   | <P> | 
|   | The handling of a backslash followed by a digit other than 0 is complicated, | 
|   | and Perl has changed in recent releases, causing PCRE also to change. Outside a | 
|   | character class, PCRE reads the digit and any following digits as a decimal | 
|   | number. If the number is less than 8, or if there have been at least that many | 
 |  previous capturing left parentheses in the expression, the entire sequence is | 
  previous capturing left parentheses in the expression, the entire sequence is | 
 |  taken as a <i>back reference</i>. A description of how this works is given | 
  taken as a <i>back reference</i>. A description of how this works is given | 
 |  <a href="#backreferences">later,</a> | 
  <a href="#backreferences">later,</a> | 
| 
 Line 325  following the discussion of
 | 
 Line 394  following the discussion of
 | 
 |  <a href="#subpattern">parenthesized subpatterns.</a> | 
  <a href="#subpattern">parenthesized subpatterns.</a> | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Inside a character class, or if the decimal number is greater than 9 and there | Inside a character class, or if the decimal number following \ is greater than | 
| have not been that many capturing subpatterns, PCRE re-reads up to three octal | 7 and there have not been that many capturing subpatterns, PCRE handles \8 and | 
| digits following the backslash, and uses them to generate a data character. Any | \9 as the literal characters "8" and "9", and otherwise re-reads up to three | 
| subsequent digits stand for themselves. In non-UTF-8 mode, the value of a | octal digits following the backslash, using them to generate a data character. | 
| character specified in octal must be less than \400. In UTF-8 mode, values up | Any subsequent digits stand for themselves. For example: | 
| to \777 are permitted. For example: |   | 
 |  <pre> | 
  <pre> | 
|   \040   is another way of writing a space |   \040   is another way of writing an ASCII space | 
 |    \40    is the same, provided there are fewer than 40 previous capturing subpatterns | 
    \40    is the same, provided there are fewer than 40 previous capturing subpatterns | 
 |    \7     is always a back reference | 
    \7     is always a back reference | 
 |    \11    might be a back reference, or another way of writing a tab | 
    \11    might be a back reference, or another way of writing a tab | 
 |    \011   is always a tab | 
    \011   is always a tab | 
 |    \0113  is a tab followed by the character "3" | 
    \0113  is a tab followed by the character "3" | 
 |    \113   might be a back reference, otherwise the character with octal code 113 | 
    \113   might be a back reference, otherwise the character with octal code 113 | 
|   \377   might be a back reference, otherwise the byte consisting entirely of 1 bits |   \377   might be a back reference, otherwise the value 255 (decimal) | 
|   \81    is either a back reference, or a binary zero followed by the two characters "8" and "1" |   \81    is either a back reference, or the two characters "8" and "1" | 
 |  </pre> | 
  </pre> | 
| Note that octal values of 100 or greater must not be introduced by a leading | Note that octal values of 100 or greater that are specified using this syntax | 
| zero, because no more than three octal digits are ever read. | must not be introduced by a leading zero, because no more than three octal | 
|   | digits are ever read. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |   | 
  By default, after \x that is not followed by {, from zero to two hexadecimal | 
 |   | 
  digits are read (letters can be in upper or lower case). Any number of | 
 |   | 
  hexadecimal digits may appear between \x{ and }. If a character other than | 
 |   | 
  a hexadecimal digit appears between \x{ and }, or if there is no terminating | 
 |   | 
  }, an error occurs. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is | 
 |   | 
  as just described only when it is followed by two hexadecimal digits. | 
 |   | 
  Otherwise, it matches a literal "x" character. In JavaScript mode, support for | 
 |   | 
  code points greater than 256 is provided by \u, which must be followed by | 
 |   | 
  four hexadecimal digits; otherwise it matches a literal "u" character. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  Characters whose value is less than 256 can be defined by either of the two | 
 |   | 
  syntaxes for \x (or by \u in JavaScript mode). There is no difference in the | 
 |   | 
  way they are handled. For example, \xdc is exactly the same as \x{dc} (or | 
 |   | 
  \u00dc in JavaScript mode). | 
 |   | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  Constraints on character values | 
 |   | 
  </b><br> | 
 |   | 
  <P> | 
 |   | 
  Characters that are specified using octal or hexadecimal numbers are | 
 |   | 
  limited to certain values, as follows: | 
 |   | 
  <pre> | 
 |   | 
    8-bit non-UTF mode    less than 0x100 | 
 |   | 
    8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint | 
 |   | 
    16-bit non-UTF mode   less than 0x10000 | 
 |   | 
    16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint | 
 |   | 
    32-bit non-UTF mode   less than 0x100000000 | 
 |   | 
    32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint | 
 |   | 
  </pre> | 
 |   | 
  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called | 
 |   | 
  "surrogate" codepoints), and 0xffef. | 
 |   | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  Escape sequences in character classes | 
 |   | 
  </b><br> | 
 |   | 
  <P> | 
 |  All the sequences that define a single character value can be used both inside | 
  All the sequences that define a single character value can be used both inside | 
 |  and outside character classes. In addition, inside a character class, \b is | 
  and outside character classes. In addition, inside a character class, \b is | 
 |  interpreted as the backspace character (hex 08). | 
  interpreted as the backspace character (hex 08). | 
| 
 Line 399  Another use of backslash is for specifying generic cha
 | 
 Line 508  Another use of backslash is for specifying generic cha
 | 
 |  <pre> | 
  <pre> | 
 |    \d     any decimal digit | 
    \d     any decimal digit | 
 |    \D     any character that is not a decimal digit | 
    \D     any character that is not a decimal digit | 
|   \h     any horizontal whitespace character |   \h     any horizontal white space character | 
|   \H     any character that is not a horizontal whitespace character |   \H     any character that is not a horizontal white space character | 
|   \s     any whitespace character |   \s     any white space character | 
|   \S     any character that is not a whitespace character |   \S     any character that is not a white space character | 
|   \v     any vertical whitespace character |   \v     any vertical white space character | 
|   \V     any character that is not a vertical whitespace character |   \V     any character that is not a vertical white space character | 
 |    \w     any "word" character | 
    \w     any "word" character | 
 |    \W     any "non-word" character | 
    \W     any "non-word" character | 
 |  </pre> | 
  </pre> | 
| 
 Line 423  matching point is at the end of the subject string, al
 | 
 Line 532  matching point is at the end of the subject string, al
 | 
 |  there is no character to match. | 
  there is no character to match. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| For compatibility with Perl, \s does not match the VT character (code 11). | For compatibility with Perl, \s did not used to match the VT character (code | 
| This makes it different from the the POSIX "space" class. The \s characters | 11), which made it different from the the POSIX "space" class. However, Perl | 
| are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is | added VT at release 5.18, and PCRE followed suit at release 8.34. The default | 
| included in a Perl script, \s may match the VT character. In PCRE, it never | \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space | 
| does. | (32), which are defined as white space in the "C" locale. This list may vary if | 
|   | locale-specific matching is taking place. For example, in some locales the | 
|   | "non-breaking space" character (\xA0) is recognized as white space, and in | 
|   | others the VT character is not. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  A "word" character is an underscore or any character that is a letter or digit. | 
  A "word" character is an underscore or any character that is a letter or digit. | 
| 
 Line 438  place (see
 | 
 Line 550  place (see
 | 
 |  in the | 
  in the | 
 |  <a href="pcreapi.html"><b>pcreapi</b></a> | 
  <a href="pcreapi.html"><b>pcreapi</b></a> | 
 |  page). For example, in a French locale such as "fr_FR" in Unix-like systems, | 
  page). For example, in a French locale such as "fr_FR" in Unix-like systems, | 
| or "french" in Windows, some character codes greater than 128 are used for | or "french" in Windows, some character codes greater than 127 are used for | 
 |  accented letters, and these are then matched by \w. The use of locales with | 
  accented letters, and these are then matched by \w. The use of locales with | 
 |  Unicode is discouraged. | 
  Unicode is discouraged. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| By default, in UTF-8 mode, characters with values greater than 128 never match | By default, characters whose code points are greater than 127 never match \d, | 
| \d, \s, or \w, and always match \D, \S, and \W. These sequences retain | \s, or \w, and always match \D, \S, and \W, although this may vary for | 
| their original meanings from before UTF-8 support was available, mainly for | characters in the range 128-255 when locale-specific matching is happening. | 
| efficiency reasons. However, if PCRE is compiled with Unicode property support, | These escape sequences retain their original meanings from before Unicode | 
| and the PCRE_UCP option is set, the behaviour is changed so that Unicode | support was available, mainly for efficiency reasons. If PCRE is compiled with | 
| properties are used to determine character types, as follows: | Unicode property support, and the PCRE_UCP option is set, the behaviour is | 
|   | changed so that Unicode properties are used to determine character types, as | 
|   | follows: | 
 |  <pre> | 
  <pre> | 
|   \d  any character that \p{Nd} matches (decimal digit) |   \d  any character that matches \p{Nd} (decimal digit) | 
|   \s  any character that \p{Z} matches, plus HT, LF, FF, CR |   \s  any character that matches \p{Z} or \h or \v | 
|   \w  any character that \p{L} or \p{N} matches, plus underscore |   \w  any character that matches \p{L} or \p{N}, plus underscore | 
 |  </pre> | 
  </pre> | 
 |  The upper case escapes match the inverse sets of characters. Note that \d | 
  The upper case escapes match the inverse sets of characters. Note that \d | 
 |  matches only decimal digits, whereas \w matches any Unicode digit, as well as | 
  matches only decimal digits, whereas \w matches any Unicode digit, as well as | 
| 
 Line 463  is noticeably slower when PCRE_UCP is set.
 | 
 Line 577  is noticeably slower when PCRE_UCP is set.
 | 
 |  <P> | 
  <P> | 
 |  The sequences \h, \H, \v, and \V are features that were added to Perl at | 
  The sequences \h, \H, \v, and \V are features that were added to Perl at | 
 |  release 5.10. In contrast to the other sequences, which match only ASCII | 
  release 5.10. In contrast to the other sequences, which match only ASCII | 
| characters by default, these always match certain high-valued codepoints in | characters by default, these always match certain high-valued code points, | 
| UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters | whether or not PCRE_UCP is set. The horizontal space characters are: | 
| are: |   | 
 |  <pre> | 
  <pre> | 
|   U+0009     Horizontal tab |   U+0009     Horizontal tab (HT) | 
 |    U+0020     Space | 
    U+0020     Space | 
 |    U+00A0     Non-break space | 
    U+00A0     Non-break space | 
 |    U+1680     Ogham space mark | 
    U+1680     Ogham space mark | 
| 
 Line 489  are:
 | 
 Line 602  are:
 | 
 |  </pre> | 
  </pre> | 
 |  The vertical space characters are: | 
  The vertical space characters are: | 
 |  <pre> | 
  <pre> | 
|   U+000A     Linefeed |   U+000A     Linefeed (LF) | 
|   U+000B     Vertical tab |   U+000B     Vertical tab (VT) | 
|   U+000C     Formfeed |   U+000C     Form feed (FF) | 
|   U+000D     Carriage return |   U+000D     Carriage return (CR) | 
|   U+0085     Next line |   U+0085     Next line (NEL) | 
 |    U+2028     Line separator | 
    U+2028     Line separator | 
 |    U+2029     Paragraph separator | 
    U+2029     Paragraph separator | 
| <a name="newlineseq"></a></PRE> | </pre> | 
| </P> | In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are | 
|   | relevant. | 
|   | <a name="newlineseq"></a></P> | 
 |  <br><b> | 
  <br><b> | 
 |  Newline sequences | 
  Newline sequences | 
 |  </b><br> | 
  </b><br> | 
 |  <P> | 
  <P> | 
 |  Outside a character class, by default, the escape sequence \R matches any | 
  Outside a character class, by default, the escape sequence \R matches any | 
| Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following: | Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the | 
|   | following: | 
 |  <pre> | 
  <pre> | 
 |    (?>\r\n|\n|\x0b|\f|\r|\x85) | 
    (?>\r\n|\n|\x0b|\f|\r|\x85) | 
 |  </pre> | 
  </pre> | 
| 
 Line 511  This is an example of an "atomic group", details of wh
 | 
 Line 627  This is an example of an "atomic group", details of wh
 | 
 |  <a href="#atomicgroup">below.</a> | 
  <a href="#atomicgroup">below.</a> | 
 |  This particular group matches either the two-character sequence CR followed by | 
  This particular group matches either the two-character sequence CR followed by | 
 |  LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, | 
  LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, | 
| U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next | 
 |  line, U+0085). The two-character sequence is treated as a single unit that | 
  line, U+0085). The two-character sequence is treated as a single unit that | 
 |  cannot be split. | 
  cannot be split. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In UTF-8 mode, two additional characters whose codepoints are greater than 255 | In other modes, two additional characters whose codepoints are greater than 255 | 
 |  are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). | 
  are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). | 
 |  Unicode character property support is not needed for these characters to be | 
  Unicode character property support is not needed for these characters to be | 
 |  recognized. | 
  recognized. | 
| 
 Line 533  one of the following sequences:
 | 
 Line 649  one of the following sequences:
 | 
 |    (*BSR_ANYCRLF)   CR, LF, or CRLF only | 
    (*BSR_ANYCRLF)   CR, LF, or CRLF only | 
 |    (*BSR_UNICODE)   any Unicode newline sequence | 
    (*BSR_UNICODE)   any Unicode newline sequence | 
 |  </pre> | 
  </pre> | 
| These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function, but | 
| <b>pcre_compile2()</b>, but they can be overridden by options given to | they can themselves be overridden by options given to a matching function. Note | 
| <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings, | that these special settings, which are not Perl-compatible, are recognized only | 
| which are not Perl-compatible, are recognized only at the very start of a | at the very start of a pattern, and that they must be in upper case. If more | 
| pattern, and that they must be in upper case. If more than one of them is | than one of them is present, the last one is used. They can be combined with a | 
| present, the last one is used. They can be combined with a change of newline | change of newline convention; for example, a pattern can start with: | 
| convention; for example, a pattern can start with: |   | 
 |  <pre> | 
  <pre> | 
 |    (*ANY)(*BSR_ANYCRLF) | 
    (*ANY)(*BSR_ANYCRLF) | 
 |  </pre> | 
  </pre> | 
| They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or | 
| a character class, \R is treated as an unrecognized escape sequence, and so | (*UCP) special sequences. Inside a character class, \R is treated as an | 
| matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. | unrecognized escape sequence, and so matches the letter "R" by default, but | 
|   | causes an error if PCRE_EXTRA is set. | 
 |  <a name="uniextseq"></a></P> | 
  <a name="uniextseq"></a></P> | 
 |  <br><b> | 
  <br><b> | 
 |  Unicode character properties | 
  Unicode character properties | 
| 
 Line 553  Unicode character properties
 | 
 Line 669  Unicode character properties
 | 
 |  <P> | 
  <P> | 
 |  When PCRE is built with Unicode character property support, three additional | 
  When PCRE is built with Unicode character property support, three additional | 
 |  escape sequences that match characters with specific properties are available. | 
  escape sequences that match characters with specific properties are available. | 
| When not in UTF-8 mode, these sequences are of course limited to testing | When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing | 
 |  characters whose codepoints are less than 256, but they do work in this mode. | 
  characters whose codepoints are less than 256, but they do work in this mode. | 
 |  The extra escape sequences are: | 
  The extra escape sequences are: | 
 |  <pre> | 
  <pre> | 
 |    \p{<i>xx</i>}   a character with the <i>xx</i> property | 
    \p{<i>xx</i>}   a character with the <i>xx</i> property | 
 |    \P{<i>xx</i>}   a character without the <i>xx</i> property | 
    \P{<i>xx</i>}   a character without the <i>xx</i> property | 
|   \X       an extended Unicode sequence |   \X       a Unicode extended grapheme cluster | 
 |  </pre> | 
  </pre> | 
 |  The property names represented by <i>xx</i> above are limited to the Unicode | 
  The property names represented by <i>xx</i> above are limited to the Unicode | 
 |  script names, the general category properties, "Any", which matches any | 
  script names, the general category properties, "Any", which matches any | 
| 
 Line 587  Armenian,
 | 
 Line 703  Armenian,
 | 
 |  Avestan, | 
  Avestan, | 
 |  Balinese, | 
  Balinese, | 
 |  Bamum, | 
  Bamum, | 
 |   | 
  Batak, | 
 |  Bengali, | 
  Bengali, | 
 |  Bopomofo, | 
  Bopomofo, | 
 |   | 
  Brahmi, | 
 |  Braille, | 
  Braille, | 
 |  Buginese, | 
  Buginese, | 
 |  Buhid, | 
  Buhid, | 
 |  Canadian_Aboriginal, | 
  Canadian_Aboriginal, | 
 |  Carian, | 
  Carian, | 
 |   | 
  Chakma, | 
 |  Cham, | 
  Cham, | 
 |  Cherokee, | 
  Cherokee, | 
 |  Common, | 
  Common, | 
| 
 Line 636  Lisu,
 | 
 Line 755  Lisu,
 | 
 |  Lycian, | 
  Lycian, | 
 |  Lydian, | 
  Lydian, | 
 |  Malayalam, | 
  Malayalam, | 
 |   | 
  Mandaic, | 
 |  Meetei_Mayek, | 
  Meetei_Mayek, | 
 |   | 
  Meroitic_Cursive, | 
 |   | 
  Meroitic_Hieroglyphs, | 
 |   | 
  Miao, | 
 |  Mongolian, | 
  Mongolian, | 
 |  Myanmar, | 
  Myanmar, | 
 |  New_Tai_Lue, | 
  New_Tai_Lue, | 
| 
 Line 655  Rejang,
 | 
 Line 778  Rejang,
 | 
 |  Runic, | 
  Runic, | 
 |  Samaritan, | 
  Samaritan, | 
 |  Saurashtra, | 
  Saurashtra, | 
 |   | 
  Sharada, | 
 |  Shavian, | 
  Shavian, | 
 |  Sinhala, | 
  Sinhala, | 
 |   | 
  Sora_Sompeng, | 
 |  Sundanese, | 
  Sundanese, | 
 |  Syloti_Nagri, | 
  Syloti_Nagri, | 
 |  Syriac, | 
  Syriac, | 
| 
 Line 665  Tagbanwa,
 | 
 Line 790  Tagbanwa,
 | 
 |  Tai_Le, | 
  Tai_Le, | 
 |  Tai_Tham, | 
  Tai_Tham, | 
 |  Tai_Viet, | 
  Tai_Viet, | 
 |   | 
  Takri, | 
 |  Tamil, | 
  Tamil, | 
 |  Telugu, | 
  Telugu, | 
 |  Thaana, | 
  Thaana, | 
| 
 Line 742  a modifier or "other".
 | 
 Line 868  a modifier or "other".
 | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  The Cs (Surrogate) property applies only to characters in the range U+D800 to | 
  The Cs (Surrogate) property applies only to characters in the range U+D800 to | 
| U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so | U+DFFF. Such characters are not valid in Unicode strings and so | 
| cannot be tested by PCRE, unless UTF-8 validity checking has been turned off | cannot be tested by PCRE, unless UTF validity checking has been turned off | 
| (see the discussion of PCRE_NO_UTF8_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and | 
|   | PCRE_NO_UTF32_CHECK in the | 
 |  <a href="pcreapi.html"><b>pcreapi</b></a> | 
  <a href="pcreapi.html"><b>pcreapi</b></a> | 
 |  page). Perl does not support the Cs property. | 
  page). Perl does not support the Cs property. | 
 |  </P> | 
  </P> | 
| 
 Line 760  Unicode table.
 | 
 Line 887  Unicode table.
 | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  Specifying caseless matching does not affect these escape sequences. For | 
  Specifying caseless matching does not affect these escape sequences. For | 
| example, \p{Lu} always matches only upper case letters. | example, \p{Lu} always matches only upper case letters. This is different from | 
|   | the behaviour of current versions of Perl. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| The \X escape matches any number of Unicode characters that form an extended | Matching characters by Unicode property is not fast, because PCRE has to do a | 
| Unicode sequence. \X is equivalent to | multistage table lookup in order to find a character's property. That is why | 
|   | the traditional escape sequences such as \d and \w do not use Unicode | 
|   | properties in PCRE by default, though you can make them do so by setting the | 
|   | PCRE_UCP option or by starting the pattern with (*UCP). | 
|   | </P> | 
|   | <br><b> | 
|   | Extended grapheme clusters | 
|   | </b><br> | 
|   | <P> | 
|   | The \X escape matches any number of Unicode characters that form an "extended | 
|   | grapheme cluster", and treats the sequence as an atomic group | 
|   | <a href="#atomicgroup">(see below).</a> | 
|   | Up to and including release 8.31, PCRE matched an earlier, simpler definition | 
|   | that was equivalent to | 
 |  <pre> | 
  <pre> | 
 |    (?>\PM\pM*) | 
    (?>\PM\pM*) | 
 |  </pre> | 
  </pre> | 
| That is, it matches a character without the "mark" property, followed by zero | That is, it matched a character without the "mark" property, followed by zero | 
| or more characters with the "mark" property, and treats the sequence as an | or more characters with the "mark" property. Characters with the "mark" | 
| atomic group | property are typically non-spacing accents that affect the preceding character. | 
| <a href="#atomicgroup">(see below).</a> |   | 
| Characters with the "mark" property are typically accents that affect the |   | 
| preceding character. None of them have codepoints less than 256, so in |   | 
| non-UTF-8 mode \X matches any one character. |   | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Note that recent versions of Perl have changed \X to match what Unicode calls | This simple definition was extended in Unicode to include more complicated | 
| an "extended grapheme cluster", which has a more complicated definition. | kinds of composite character by giving each character a grapheme breaking | 
|   | property, and creating rules that use these properties to define the boundaries | 
|   | of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches | 
|   | one of these clusters. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Matching characters by Unicode property is not fast, because PCRE has to search | \X always matches at least one character. Then it decides whether to add | 
| a structure that contains data for over fifteen thousand characters. That is | additional characters according to the following rules for ending a cluster: | 
| why the traditional escape sequences such as \d and \w do not use Unicode | </P> | 
| properties in PCRE by default, though you can make them do so by setting the | <P> | 
| PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with | 1. End at the end of the subject string. | 
| (*UCP). | </P> | 
|   | <P> | 
|   | 2. Do not end between CR and LF; otherwise end after any control character. | 
|   | </P> | 
|   | <P> | 
|   | 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters | 
|   | are of five types: L, V, T, LV, and LVT. An L character may be followed by an | 
|   | L, V, LV, or LVT character; an LV or V character may be followed by a V or T | 
|   | character; an LVT or T character may be follwed only by a T character. | 
|   | </P> | 
|   | <P> | 
|   | 4. Do not end before extending characters or spacing marks. Characters with | 
|   | the "mark" property always have the "extend" grapheme breaking property. | 
|   | </P> | 
|   | <P> | 
|   | 5. Do not end after prepend characters. | 
|   | </P> | 
|   | <P> | 
|   | 6. Otherwise, end the cluster. | 
 |  <a name="extraprops"></a></P> | 
  <a name="extraprops"></a></P> | 
 |  <br><b> | 
  <br><b> | 
 |  PCRE's additional properties | 
  PCRE's additional properties | 
 |  </b><br> | 
  </b><br> | 
 |  <P> | 
  <P> | 
| As well as the standard Unicode properties described in the previous | As well as the standard Unicode properties described above, PCRE supports four | 
| section, PCRE supports four more that make it possible to convert traditional | more that make it possible to convert traditional escape sequences such as \w | 
| escape sequences such as \w and \s and POSIX character classes to use Unicode | and \s to use Unicode properties. PCRE uses these non-standard, non-Perl | 
| properties. PCRE uses these non-standard, non-Perl properties internally when | properties internally when PCRE_UCP is set. However, they may also be used | 
| PCRE_UCP is set. They are: | explicitly. These properties are: | 
 |  <pre> | 
  <pre> | 
 |    Xan   Any alphanumeric character | 
    Xan   Any alphanumeric character | 
 |    Xps   Any POSIX space character | 
    Xps   Any POSIX space character | 
| 
 Line 804  PCRE_UCP is set. They are:
 | 
 Line 962  PCRE_UCP is set. They are:
 | 
 |    Xwd   Any Perl "word" character | 
    Xwd   Any Perl "word" character | 
 |  </pre> | 
  </pre> | 
 |  Xan matches characters that have either the L (letter) or the N (number) | 
  Xan matches characters that have either the L (letter) or the N (number) | 
| property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or | 
 |  carriage return, and any other character that has the Z (separator) property. | 
  carriage return, and any other character that has the Z (separator) property. | 
| Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the | Xsp is the same as Xps; it used to exclude vertical tab, for Perl | 
| same characters as Xan, plus underscore. | compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd | 
|   | matches the same characters as Xan, plus underscore. | 
|   | </P> | 
|   | <P> | 
|   | There is another non-standard property, Xuc, which matches any character that | 
|   | can be represented by a Universal Character Name in C++ and other programming | 
|   | languages. These are the characters $, @, ` (grave accent), and all characters | 
|   | with Unicode code points greater than or equal to U+00A0, except for the | 
|   | surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are | 
|   | excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH | 
|   | where H is a hexadecimal digit. Note that the Xuc property does not match these | 
|   | sequences but the characters that they represent.) | 
 |  <a name="resetmatchstart"></a></P> | 
  <a name="resetmatchstart"></a></P> | 
 |  <br><b> | 
  <br><b> | 
 |  Resetting the match start | 
  Resetting the match start | 
| 
 Line 865  escape sequence" error is generated instead.
 | 
 Line 1034  escape sequence" error is generated instead.
 | 
 |  A word boundary is a position in the subject string where the current character | 
  A word boundary is a position in the subject string where the current character | 
 |  and the previous character do not both match \w or \W (i.e. one matches | 
  and the previous character do not both match \w or \W (i.e. one matches | 
 |  \w and the other matches \W), or the start or end of the string if the | 
  \w and the other matches \W), or the start or end of the string if the | 
| first or last character matches \w, respectively. In UTF-8 mode, the meanings | first or last character matches \w, respectively. In a UTF mode, the meanings | 
 |  of \w and \W can be changed by setting the PCRE_UCP option. When this is | 
  of \w and \W can be changed by setting the PCRE_UCP option. When this is | 
 |  done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start | 
  done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start | 
 |  of word" or "end of word" metasequence. However, whatever follows \b normally | 
  of word" or "end of word" metasequence. However, whatever follows \b normally | 
| 
 Line 904  If all the alternatives of a pattern begin with \G, th
 | 
 Line 1073  If all the alternatives of a pattern begin with \G, th
 | 
 |  to the starting match position, and the "anchored" flag is set in the compiled | 
  to the starting match position, and the "anchored" flag is set in the compiled | 
 |  regular expression. | 
  regular expression. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> | <br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> | 
 |  <P> | 
  <P> | 
 |   | 
  The circumflex and dollar metacharacters are zero-width assertions. That is, | 
 |   | 
  they test for a particular condition being true without consuming any | 
 |   | 
  characters from the subject string. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |  Outside a character class, in the default matching mode, the circumflex | 
  Outside a character class, in the default matching mode, the circumflex | 
| character is an assertion that is true only if the current matching point is | character is an assertion that is true only if the current matching point is at | 
| at the start of the subject string. If the <i>startoffset</i> argument of | the start of the subject string. If the <i>startoffset</i> argument of | 
 |  <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE | 
  <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE | 
 |  option is unset. Inside a character class, circumflex has an entirely different | 
  option is unset. Inside a character class, circumflex has an entirely different | 
 |  meaning | 
  meaning | 
| 
 Line 924  constrained to match only at the start of the subject,
 | 
 Line 1098  constrained to match only at the start of the subject,
 | 
 |  to be anchored.) | 
  to be anchored.) | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| A dollar character is an assertion that is true only if the current matching | The dollar character is an assertion that is true only if the current matching | 
| point is at the end of the subject string, or immediately before a newline | point is at the end of the subject string, or immediately before a newline at | 
| at the end of the string (by default). Dollar need not be the last character of | the end of the string (by default). Note, however, that it does not actually | 
| the pattern if a number of alternatives are involved, but it should be the last | match the newline. Dollar need not be the last character of the pattern if a | 
| item in any branch in which it appears. Dollar has no special meaning in a | number of alternatives are involved, but it should be the last item in any | 
| character class. | branch in which it appears. Dollar has no special meaning in a character class. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  The meaning of dollar can be changed so that it matches only at the very end of | 
  The meaning of dollar can be changed so that it matches only at the very end of | 
| 
 Line 958  Note that the sequences \A, \Z, and \z can be used to 
 | 
 Line 1132  Note that the sequences \A, \Z, and \z can be used to 
 | 
 |  end of the subject in both modes, and if all branches of a pattern start with | 
  end of the subject in both modes, and if all branches of a pattern start with | 
 |  \A it is always anchored, whether or not PCRE_MULTILINE is set. | 
  \A it is always anchored, whether or not PCRE_MULTILINE is set. | 
 |  <a name="fullstopdot"></a></P> | 
  <a name="fullstopdot"></a></P> | 
| <br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> | <br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> | 
 |  <P> | 
  <P> | 
 |  Outside a character class, a dot in the pattern matches any one character in | 
  Outside a character class, a dot in the pattern matches any one character in | 
 |  the subject string except (by default) a character that signifies the end of a | 
  the subject string except (by default) a character that signifies the end of a | 
| line. In UTF-8 mode, the matched character may be more than one byte long. | line. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  When a line ending is defined as a single character, dot never matches that | 
  When a line ending is defined as a single character, dot never matches that | 
| 
 Line 989  the PCRE_DOTALL option. In other words, it matches any
 | 
 Line 1163  the PCRE_DOTALL option. In other words, it matches any
 | 
 |  that signifies the end of a line. Perl also uses \N to match characters by | 
  that signifies the end of a line. Perl also uses \N to match characters by | 
 |  name; PCRE does not support this. | 
  name; PCRE does not support this. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> | <br><a name="SEC8" href="#TOC1">MATCHING A SINGLE DATA UNIT</a><br> | 
 |  <P> | 
  <P> | 
| Outside a character class, the escape sequence \C matches any one byte, both | Outside a character class, the escape sequence \C matches any one data unit, | 
| in and out of UTF-8 mode. Unlike a dot, it always matches line-ending | whether or not a UTF mode is set. In the 8-bit library, one data unit is one | 
| characters. The feature is provided in Perl in order to match individual bytes | byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is | 
| in UTF-8 mode, but it is unclear how it can usefully be used. Because \C | a 32-bit unit. Unlike a dot, \C always | 
| breaks up characters into individual bytes, matching one byte with \C in UTF-8 | matches line-ending characters. The feature is provided in Perl in order to | 
| mode means that the rest of the string may start with a malformed UTF-8 | match individual bytes in UTF-8 mode, but it is unclear how it can usefully be | 
| character. This has undefined results, because PCRE assumes that it is dealing | used. Because \C breaks up characters into individual data units, matching one | 
| with valid UTF-8 strings (and by default it checks this at the start of | unit with \C in a UTF mode means that the rest of the string may start with a | 
| processing unless the PCRE_NO_UTF8_CHECK option is used). | malformed UTF character. This has undefined results, because PCRE assumes that | 
|   | it is dealing with valid UTF strings (and by default it checks this at the | 
|   | start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or | 
|   | PCRE_NO_UTF32_CHECK option is used). | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  PCRE does not allow \C to appear in lookbehind assertions | 
  PCRE does not allow \C to appear in lookbehind assertions | 
 |  <a href="#lookbehind">(described below)</a> | 
  <a href="#lookbehind">(described below)</a> | 
| in UTF-8 mode, because this would make it impossible to calculate the length of | in a UTF mode, because this would make it impossible to calculate the length of | 
 |  the lookbehind. | 
  the lookbehind. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In general, the \C escape sequence is best avoided in UTF-8 mode. However, one | In general, the \C escape sequence is best avoided. However, one | 
| way of using it that avoids the problem of malformed UTF-8 characters is to | way of using it that avoids the problem of malformed UTF characters is to use a | 
| use a lookahead to check the length of the next character, as in this pattern | lookahead to check the length of the next character, as in this pattern, which | 
| (ignore white space and line breaks): | could be used with a UTF-8 string (ignore white space and line breaks): | 
 |  <pre> | 
  <pre> | 
 |    (?| (?=[\x00-\x7f])(\C) | | 
    (?| (?=[\x00-\x7f])(\C) | | 
 |        (?=[\x80-\x{7ff}])(\C)(\C) | | 
        (?=[\x80-\x{7ff}])(\C)(\C) | | 
| 
 Line 1026  character for values whose encoding uses 1, 2, 3, or 4
 | 
 Line 1203  character for values whose encoding uses 1, 2, 3, or 4
 | 
 |  character's individual bytes are then captured by the appropriate number of | 
  character's individual bytes are then captured by the appropriate number of | 
 |  groups. | 
  groups. | 
 |  <a name="characterclass"></a></P> | 
  <a name="characterclass"></a></P> | 
| <br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> | <br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> | 
 |  <P> | 
  <P> | 
 |  An opening square bracket introduces a character class, terminated by a closing | 
  An opening square bracket introduces a character class, terminated by a closing | 
 |  square bracket. A closing square bracket on its own is not special by default. | 
  square bracket. A closing square bracket on its own is not special by default. | 
| 
 Line 1036  a member of the class, it should be the first data cha
 | 
 Line 1213  a member of the class, it should be the first data cha
 | 
 |  (after an initial circumflex, if present) or escaped with a backslash. | 
  (after an initial circumflex, if present) or escaped with a backslash. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| A character class matches a single character in the subject. In UTF-8 mode, the | A character class matches a single character in the subject. In a UTF mode, the | 
| character may be more than one byte long. A matched character must be in the | character may be more than one data unit long. A matched character must be in | 
| set of characters defined by the class, unless the first character in the class | the set of characters defined by the class, unless the first character in the | 
| definition is a circumflex, in which case the subject character must not be in | class definition is a circumflex, in which case the subject character must not | 
| the set defined by the class. If a circumflex is actually required as a member | be in the set defined by the class. If a circumflex is actually required as a | 
| of the class, ensure it is not the first character, or escape it with a | member of the class, ensure it is not the first character, or escape it with a | 
 |  backslash. | 
  backslash. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| 
 Line 1054  string, and therefore it fails if the current pointer 
 | 
 Line 1231  string, and therefore it fails if the current pointer 
 | 
 |  string. | 
  string. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In UTF-8 mode, characters with values greater than 255 can be included in a | In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) | 
| class as a literal string of bytes, or by using the \x{ escaping mechanism. | can be included in a class as a literal string of data units, or by using the | 
|   | \x{ escaping mechanism. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  When caseless matching is set, any letters in a class represent both their | 
  When caseless matching is set, any letters in a class represent both their | 
 |  upper case and lower case versions, so for example, a caseless [aeiou] matches | 
  upper case and lower case versions, so for example, a caseless [aeiou] matches | 
 |  "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a | 
  "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a | 
| caseful version would. In UTF-8 mode, PCRE always understands the concept of | caseful version would. In a UTF mode, PCRE always understands the concept of | 
 |  case for characters whose values are less than 128, so caseless matching is | 
  case for characters whose values are less than 128, so caseless matching is | 
 |  always possible. For characters with higher values, the concept of case is | 
  always possible. For characters with higher values, the concept of case is | 
 |  supported if PCRE is compiled with Unicode property support, but not otherwise. | 
  supported if PCRE is compiled with Unicode property support, but not otherwise. | 
| If you want to use caseless matching in UTF8-mode for characters 128 and above, | If you want to use caseless matching in a UTF mode for characters 128 and | 
| you must ensure that PCRE is compiled with Unicode property support as well as | above, you must ensure that PCRE is compiled with Unicode property support as | 
| with UTF-8 support. | well as with UTF support. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  Characters that might indicate line breaks are never treated in any special way | 
  Characters that might indicate line breaks are never treated in any special way | 
| 
 Line 1080  The minus (hyphen) character can be used to specify a 
 | 
 Line 1258  The minus (hyphen) character can be used to specify a 
 | 
 |  character class. For example, [d-m] matches any letter between d and m, | 
  character class. For example, [d-m] matches any letter between d and m, | 
 |  inclusive. If a minus character is required in a class, it must be escaped with | 
  inclusive. If a minus character is required in a class, it must be escaped with | 
 |  a backslash or appear in a position where it cannot be interpreted as | 
  a backslash or appear in a position where it cannot be interpreted as | 
| indicating a range, typically as the first or last character in the class. | indicating a range, typically as the first or last character in the class, or | 
|   | immediately after a range. For example, [b-d-z] matches letters in the range b | 
|   | to d, a hyphen character, or z. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  It is not possible to have the literal character "]" as the end character of a | 
  It is not possible to have the literal character "]" as the end character of a | 
| 
 Line 1092  followed by two other characters. The octal or hexadec
 | 
 Line 1272  followed by two other characters. The octal or hexadec
 | 
 |  "]" can also be used to end a range. | 
  "]" can also be used to end a range. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |   | 
  An error is generated if a POSIX character class (see below) or an escape | 
 |   | 
  sequence other than one that defines a single character appears at a point | 
 |   | 
  where a range ending character is expected. For example, [z-\xff] is valid, | 
 |   | 
  but [A-\d] and [A-[:digit:]] are not. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |  Ranges operate in the collating sequence of character values. They can also be | 
  Ranges operate in the collating sequence of character values. They can also be | 
| used for characters specified numerically, for example [\000-\037]. In UTF-8 | used for characters specified numerically, for example [\000-\037]. Ranges | 
| mode, ranges can include characters whose values are greater than 255, for | can include any characters that are valid for the current mode. | 
| example [\x{100}-\x{2ff}]. |   | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  If a range that includes letters is used when caseless matching is set, it | 
  If a range that includes letters is used when caseless matching is set, it | 
 |  matches the letters in either case. For example, [W-c] is equivalent to | 
  matches the letters in either case. For example, [W-c] is equivalent to | 
| [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character | [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character | 
 |  tables for a French locale are in use, [\xc8-\xcb] matches accented E | 
  tables for a French locale are in use, [\xc8-\xcb] matches accented E | 
| characters in both cases. In UTF-8 mode, PCRE supports the concept of case for | characters in both cases. In UTF modes, PCRE supports the concept of case for | 
 |  characters with values greater than 128 only when it is compiled with Unicode | 
  characters with values greater than 128 only when it is compiled with Unicode | 
 |  property support. | 
  property support. | 
 |  </P> | 
  </P> | 
| 
 Line 1110  property support.
 | 
 Line 1295  property support.
 | 
 |  The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, | 
  The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, | 
 |  \V, \w, and \W may appear in a character class, and add the characters that | 
  \V, \w, and \W may appear in a character class, and add the characters that | 
 |  they match to the class. For example, [\dABCDEF] matches any hexadecimal | 
  they match to the class. For example, [\dABCDEF] matches any hexadecimal | 
| digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w | digit. In UTF modes, the PCRE_UCP option affects the meanings of \d, \s, \w | 
 |  and their upper case partners, just as it does when they appear outside a | 
  and their upper case partners, just as it does when they appear outside a | 
 |  character class, as described in the section entitled | 
  character class, as described in the section entitled | 
 |  <a href="#genericchartypes">"Generic character types"</a> | 
  <a href="#genericchartypes">"Generic character types"</a> | 
| 
 Line 1132  something AND NOT ...".
 | 
 Line 1317  something AND NOT ...".
 | 
 |  The only metacharacters that are recognized in character classes are backslash, | 
  The only metacharacters that are recognized in character classes are backslash, | 
 |  hyphen (only where it can be interpreted as specifying a range), circumflex | 
  hyphen (only where it can be interpreted as specifying a range), circumflex | 
 |  (only at the start), opening square bracket (only when it can be interpreted as | 
  (only at the start), opening square bracket (only when it can be interpreted as | 
| introducing a POSIX class name - see the next section), and the terminating | introducing a POSIX class name, or for a special compatibility feature - see | 
| closing square bracket. However, escaping other non-alphanumeric characters | the next two sections), and the terminating closing square bracket. However, | 
| does no harm. | escaping other non-alphanumeric characters does no harm. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br> | <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br> | 
 |  <P> | 
  <P> | 
 |  Perl supports the POSIX notation for character classes. This uses names | 
  Perl supports the POSIX notation for character classes. This uses names | 
 |  enclosed by [: and :] within the enclosing square brackets. PCRE also supports | 
  enclosed by [: and :] within the enclosing square brackets. PCRE also supports | 
| 
 Line 1157  are:
 | 
 Line 1342  are:
 | 
 |    lower    lower case letters | 
    lower    lower case letters | 
 |    print    printing characters, including space | 
    print    printing characters, including space | 
 |    punct    printing characters, excluding letters and digits and space | 
    punct    printing characters, excluding letters and digits and space | 
|   space    white space (not quite the same as \s) |   space    white space (the same as \s from PCRE 8.34) | 
 |    upper    upper case letters | 
    upper    upper case letters | 
 |    word     "word" characters (same as \w) | 
    word     "word" characters (same as \w) | 
 |    xdigit   hexadecimal digits | 
    xdigit   hexadecimal digits | 
 |  </pre> | 
  </pre> | 
| The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and | The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), | 
| space (32). Notice that this list includes the VT character (code 11). This | and space (32). If locale-specific matching is taking place, the list of space | 
| makes "space" different to \s, which does not include VT (for Perl | characters may be different; there may be fewer or more of them. "Space" used | 
| compatibility). | to be different to \s, which did not include VT, for Perl compatibility. | 
|   | However, Perl changed at release 5.18, and PCRE followed at release 8.34. | 
|   | "Space" and \s now match the same set of characters. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl | 
  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl | 
| 
 Line 1179  syntax [.ch.] and [=ch=] where "ch" is a "collating el
 | 
 Line 1366  syntax [.ch.] and [=ch=] where "ch" is a "collating el
 | 
 |  supported, and an error is given if they are encountered. | 
  supported, and an error is given if they are encountered. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| By default, in UTF-8 mode, characters with values greater than 128 do not match | By default, characters with values greater than 128 do not match any of the | 
| any of the POSIX character classes. However, if the PCRE_UCP option is passed | POSIX character classes. However, if the PCRE_UCP option is passed to | 
| to <b>pcre_compile()</b>, some of the classes are changed so that Unicode | <b>pcre_compile()</b>, some of the classes are changed so that Unicode character | 
| character properties are used. This is achieved by replacing the POSIX classes | properties are used. This is achieved by replacing certain POSIX classes by | 
| by other sequences, as follows: | other sequences, as follows: | 
 |  <pre> | 
  <pre> | 
 |    [:alnum:]  becomes  \p{Xan} | 
    [:alnum:]  becomes  \p{Xan} | 
 |    [:alpha:]  becomes  \p{L} | 
    [:alpha:]  becomes  \p{L} | 
| 
 Line 1194  by other sequences, as follows:
 | 
 Line 1381  by other sequences, as follows:
 | 
 |    [:upper:]  becomes  \p{Lu} | 
    [:upper:]  becomes  \p{Lu} | 
 |    [:word:]   becomes  \p{Xwd} | 
    [:word:]   becomes  \p{Xwd} | 
 |  </pre> | 
  </pre> | 
| Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX | Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX | 
| classes are unchanged, and match only characters with code points less than | classes are handled specially in UCP mode: | 
| 128. |   | 
 |  </P> | 
  </P> | 
 |  <br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br> | 
   | 
 |  <P> | 
  <P> | 
 |   | 
  [:graph:] | 
 |   | 
  This matches characters that have glyphs that mark the page when printed. In | 
 |   | 
  Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf | 
 |   | 
  properties, except for: | 
 |   | 
  <pre> | 
 |   | 
    U+061C           Arabic Letter Mark | 
 |   | 
    U+180E           Mongolian Vowel Separator | 
 |   | 
    U+2066 - U+2069  Various "isolate"s | 
 |   | 
   | 
 |   | 
  </PRE> | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  [:print:] | 
 |   | 
  This matches the same characters as [:graph:] plus space characters that are | 
 |   | 
  not controls, that is, characters with the Zs property. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  [:punct:] | 
 |   | 
  This matches all characters that have the Unicode P (punctuation) property, | 
 |   | 
  plus those characters whose code points are less than 128 that have the S | 
 |   | 
  (Symbol) property. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  The other POSIX classes are unchanged, and match only characters with code | 
 |   | 
  points less than 128. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br> | 
 |   | 
  <P> | 
 |   | 
  In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly | 
 |   | 
  syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of | 
 |   | 
  word". PCRE treats these items as follows: | 
 |   | 
  <pre> | 
 |   | 
    [[:<:]]  is converted to  \b(?=\w) | 
 |   | 
    [[:>:]]  is converted to  \b(?<=\w) | 
 |   | 
  </pre> | 
 |   | 
  Only these exact character sequences are recognized. A sequence such as | 
 |   | 
  [a[:<:]b] provokes error for an unrecognized POSIX class name. This support is | 
 |   | 
  not compatible with Perl. It is provided to help migrations from other | 
 |   | 
  environments, and is best not used in any new patterns. Note that \b matches | 
 |   | 
  at the start and the end of a word (see | 
 |   | 
  <a href="#smallassertions">"Simple assertions"</a> | 
 |   | 
  above), and in a Perl-style pattern the preceding or following character | 
 |   | 
  normally shows which is wanted, without the need for the assertions that are | 
 |   | 
  used above in order to give exactly the POSIX behaviour. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br> | 
 |   | 
  <P> | 
 |  Vertical bar characters are used to separate alternative patterns. For example, | 
  Vertical bar characters are used to separate alternative patterns. For example, | 
 |  the pattern | 
  the pattern | 
 |  <pre> | 
  <pre> | 
| 
 Line 1213  that succeeds is used. If the alternatives are within 
 | 
 Line 1445  that succeeds is used. If the alternatives are within 
 | 
 |  "succeeds" means matching the rest of the main pattern as well as the | 
  "succeeds" means matching the rest of the main pattern as well as the | 
 |  alternative in the subpattern. | 
  alternative in the subpattern. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br> | <br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br> | 
 |  <P> | 
  <P> | 
 |  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | 
  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | 
 |  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within | 
  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within | 
| 
 Line 1264  behaviour otherwise.
 | 
 Line 1496  behaviour otherwise.
 | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  <b>Note:</b> There are other PCRE-specific options that can be set by the | 
  <b>Note:</b> There are other PCRE-specific options that can be set by the | 
| application when the compile or match functions are called. In some cases the | application when the compiling or matching functions are called. In some cases | 
| pattern can contain special leading sequences such as (*CRLF) to override what | the pattern can contain special leading sequences such as (*CRLF) to override | 
| the application has set or what has been defaulted. Details are given in the | what the application has set or what has been defaulted. Details are given in | 
| section entitled | the section entitled | 
 |  <a href="#newlineseq">"Newline sequences"</a> | 
  <a href="#newlineseq">"Newline sequences"</a> | 
| above. There are also the (*UTF8) and (*UCP) leading sequences that can be used | above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading | 
| to set UTF-8 and Unicode property modes; they are equivalent to setting the | sequences that can be used to set UTF and Unicode property modes; they are | 
| PCRE_UTF8 and the PCRE_UCP options, respectively. | equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP | 
|   | options, respectively. The (*UTF) sequence is a generic version that can be | 
|   | used with any of the libraries. However, the application can set the | 
|   | PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. | 
 |  <a name="subpattern"></a></P> | 
  <a name="subpattern"></a></P> | 
| <br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> | <br><a name="SEC14" href="#TOC1">SUBPATTERNS</a><br> | 
 |  <P> | 
  <P> | 
 |  Subpatterns are delimited by parentheses (round brackets), which can be nested. | 
  Subpatterns are delimited by parentheses (round brackets), which can be nested. | 
 |  Turning part of a pattern into a subpattern does two things: | 
  Turning part of a pattern into a subpattern does two things: | 
| 
 Line 1289  match "cataract", "erpillar" or an empty string.
 | 
 Line 1524  match "cataract", "erpillar" or an empty string.
 | 
 |  <br> | 
  <br> | 
 |  2. It sets up the subpattern as a capturing subpattern. This means that, when | 
  2. It sets up the subpattern as a capturing subpattern. This means that, when | 
 |  the whole pattern matches, that portion of the subject string that matched the | 
  the whole pattern matches, that portion of the subject string that matched the | 
| subpattern is passed back to the caller via the <i>ovector</i> argument of | subpattern is passed back to the caller via the <i>ovector</i> argument of the | 
| <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting | matching function. (This applies only to the traditional matching functions; | 
| from 1) to obtain numbers for the capturing subpatterns. For example, if the | the DFA matching functions do not support capturing.) | 
| string "the red king" is matched against the pattern | </P> | 
|   | <P> | 
|   | Opening parentheses are counted from left to right (starting from 1) to obtain | 
|   | numbers for the capturing subpatterns. For example, if the string "the red | 
|   | king" is matched against the pattern | 
 |  <pre> | 
  <pre> | 
 |    the ((red|white) (king|queen)) | 
    the ((red|white) (king|queen)) | 
 |  </pre> | 
  </pre> | 
| 
 Line 1325  from left to right, and options are not reset until th
 | 
 Line 1564  from left to right, and options are not reset until th
 | 
 |  is reached, an option setting in one branch does affect subsequent branches, so | 
  is reached, an option setting in one branch does affect subsequent branches, so | 
 |  the above patterns match "SUNDAY" as well as "Saturday". | 
  the above patterns match "SUNDAY" as well as "Saturday". | 
 |  <a name="dupsubpatternnumber"></a></P> | 
  <a name="dupsubpatternnumber"></a></P> | 
| <br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> | <br><a name="SEC15" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> | 
 |  <P> | 
  <P> | 
 |  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses | 
  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses | 
 |  the same numbers for its capturing parentheses. Such a subpattern starts with | 
  the same numbers for its capturing parentheses. Such a subpattern starts with | 
| 
 Line 1369  true if any of the subpatterns of that number have mat
 | 
 Line 1608  true if any of the subpatterns of that number have mat
 | 
 |  An alternative approach to using this "branch reset" feature is to use | 
  An alternative approach to using this "branch reset" feature is to use | 
 |  duplicate named subpatterns, as described in the next section. | 
  duplicate named subpatterns, as described in the next section. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br> | <br><a name="SEC16" href="#TOC1">NAMED SUBPATTERNS</a><br> | 
 |  <P> | 
  <P> | 
 |  Identifying capturing parentheses by number is simple, but it can be very hard | 
  Identifying capturing parentheses by number is simple, but it can be very hard | 
 |  to keep track of the numbers in complicated regular expressions. Furthermore, | 
  to keep track of the numbers in complicated regular expressions. Furthermore, | 
| 
 Line 1391  and
 | 
 Line 1630  and
 | 
 |  can be made by name as well as by number. | 
  can be made by name as well as by number. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Names consist of up to 32 alphanumeric characters and underscores. Named | Names consist of up to 32 alphanumeric characters and underscores, but must | 
| capturing parentheses are still allocated numbers as well as names, exactly as | start with a non-digit. Named capturing parentheses are still allocated numbers | 
| if the names were not present. The PCRE API provides function calls for | as well as names, exactly as if the names were not present. The PCRE API | 
| extracting the name-to-number translation table from a compiled pattern. There | provides function calls for extracting the name-to-number translation table | 
| is also a convenience function for extracting a captured substring by name. | from a compiled pattern. There is also a convenience function for extracting a | 
|   | captured substring by name. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  By default, a name must be unique within a pattern, but it is possible to relax | 
  By default, a name must be unique within a pattern, but it is possible to relax | 
| 
 Line 1424  matched. This saves searching to find which numbered s
 | 
 Line 1664  matched. This saves searching to find which numbered s
 | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  If you make a back reference to a non-unique named subpattern from elsewhere in | 
  If you make a back reference to a non-unique named subpattern from elsewhere in | 
| the pattern, the one that corresponds to the first occurrence of the name is | the pattern, the subpatterns to which the name refers are checked in the order | 
| used. In the absence of duplicate numbers (see the previous section) this is | in which they appear in the overall pattern. The first one that is set is used | 
| the one with the lowest number. If you use a named reference in a condition | for the reference. For example, this pattern matches both "foofoo" and | 
|   | "barbar" but not "foobar" or "barfoo": | 
|   | <pre> | 
|   |   (?:(?<n>foo)|(?<n>bar))\k<n> | 
|   |  | 
|   | </PRE> | 
|   | </P> | 
|   | <P> | 
|   | If you make a subroutine call to a non-unique named subpattern, the one that | 
|   | corresponds to the first occurrence of the name is used. In the absence of | 
|   | duplicate numbers (see the previous section) this is the one with the lowest | 
|   | number. | 
|   | </P> | 
|   | <P> | 
|   | If you use a named reference in a condition | 
 |  test (see the | 
  test (see the | 
 |  <a href="#conditions">section about conditions</a> | 
  <a href="#conditions">section about conditions</a> | 
 |  below), either to check whether a subpattern has matched, or to check for | 
  below), either to check whether a subpattern has matched, or to check for | 
| 
 Line 1441  documentation.
 | 
 Line 1695  documentation.
 | 
 |  <b>Warning:</b> You cannot use different names to distinguish between two | 
  <b>Warning:</b> You cannot use different names to distinguish between two | 
 |  subpatterns with the same number because PCRE uses only the numbers when | 
  subpatterns with the same number because PCRE uses only the numbers when | 
 |  matching. For this reason, an error is given at compile time if different names | 
  matching. For this reason, an error is given at compile time if different names | 
| are given to subpatterns with the same number. However, you can give the same | are given to subpatterns with the same number. However, you can always give the | 
| name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. | same name to subpatterns with the same number, even when PCRE_DUPNAMES is not | 
|   | set. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC15" href="#TOC1">REPETITION</a><br> | <br><a name="SEC17" href="#TOC1">REPETITION</a><br> | 
 |  <P> | 
  <P> | 
 |  Repetition is specified by quantifiers, which can follow any of the following | 
  Repetition is specified by quantifiers, which can follow any of the following | 
 |  items: | 
  items: | 
| 
 Line 1452  items:
 | 
 Line 1707  items:
 | 
 |    a literal data character | 
    a literal data character | 
 |    the dot metacharacter | 
    the dot metacharacter | 
 |    the \C escape sequence | 
    the \C escape sequence | 
|   the \X escape sequence (in UTF-8 mode with Unicode properties) |   the \X escape sequence | 
 |    the \R escape sequence | 
    the \R escape sequence | 
 |    an escape such as \d or \pL that matches a single character | 
    an escape such as \d or \pL that matches a single character | 
 |    a character class | 
    a character class | 
| 
 Line 1484  quantifier, is taken as a literal character. For examp
 | 
 Line 1739  quantifier, is taken as a literal character. For examp
 | 
 |  quantifier, but a literal string of four characters. | 
  quantifier, but a literal string of four characters. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual | In UTF modes, quantifiers apply to characters rather than to individual data | 
| bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of | units. Thus, for example, \x{100}{2} matches two characters, each of | 
| which is represented by a two-byte sequence. Similarly, when Unicode property | which is represented by a two-byte sequence in a UTF-8 string. Similarly, | 
| support is available, \X{3} matches three Unicode extended sequences, each of | \X{3} matches three Unicode extended grapheme clusters, each of which may be | 
| which may be several bytes long (and they may be of different lengths). | several data units long (and they may be of different lengths). | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  The quantifier {0} is permitted, causing the expression to behave as if the | 
  The quantifier {0} is permitted, causing the expression to behave as if the | 
| 
 Line 1577  worth setting PCRE_DOTALL in order to obtain this opti
 | 
 Line 1832  worth setting PCRE_DOTALL in order to obtain this opti
 | 
 |  alternatively using ^ to indicate anchoring explicitly. | 
  alternatively using ^ to indicate anchoring explicitly. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| However, there is one situation where the optimization cannot be used. When .* | However, there are some cases where the optimization cannot be used. When .* | 
 |  is inside capturing parentheses that are the subject of a back reference | 
  is inside capturing parentheses that are the subject of a back reference | 
 |  elsewhere in the pattern, a match at the start may fail where a later one | 
  elsewhere in the pattern, a match at the start may fail where a later one | 
 |  succeeds. Consider, for example: | 
  succeeds. Consider, for example: | 
| 
 Line 1588  If the subject is "xyz123abc123" the match point is th
 | 
 Line 1843  If the subject is "xyz123abc123" the match point is th
 | 
 |  this reason, such a pattern is not implicitly anchored. | 
  this reason, such a pattern is not implicitly anchored. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |   | 
  Another case where implicit anchoring is not applied is when the leading .* is | 
 |   | 
  inside an atomic group. Once again, a match at the start may fail where a later | 
 |   | 
  one succeeds. Consider this pattern: | 
 |   | 
  <pre> | 
 |   | 
    (?>.*?a)b | 
 |   | 
  </pre> | 
 |   | 
  It matches "ab" in the subject "aab". The use of the backtracking control verbs | 
 |   | 
  (*PRUNE) and (*SKIP) also disable this optimization. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |  When a capturing subpattern is repeated, the value captured is the substring | 
  When a capturing subpattern is repeated, the value captured is the substring | 
 |  that matched the final iteration. For example, after | 
  that matched the final iteration. For example, after | 
 |  <pre> | 
  <pre> | 
| 
 Line 1602  example, after
 | 
 Line 1867  example, after
 | 
 |  </pre> | 
  </pre> | 
 |  matches "aba" the value of the second captured substring is "b". | 
  matches "aba" the value of the second captured substring is "b". | 
 |  <a name="atomicgroup"></a></P> | 
  <a name="atomicgroup"></a></P> | 
| <br><a name="SEC16" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> | <br><a name="SEC18" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> | 
 |  <P> | 
  <P> | 
 |  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") | 
  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") | 
 |  repetition, failure of what follows normally causes the repeated item to be | 
  repetition, failure of what follows normally causes the repeated item to be | 
| 
 Line 1706  an atomic group, like this:
 | 
 Line 1971  an atomic group, like this:
 | 
 |  </pre> | 
  </pre> | 
 |  sequences of non-digits cannot be broken, and failure happens quickly. | 
  sequences of non-digits cannot be broken, and failure happens quickly. | 
 |  <a name="backreferences"></a></P> | 
  <a name="backreferences"></a></P> | 
| <br><a name="SEC17" href="#TOC1">BACK REFERENCES</a><br> | <br><a name="SEC19" href="#TOC1">BACK REFERENCES</a><br> | 
 |  <P> | 
  <P> | 
 |  Outside a character class, a backslash followed by a digit greater than 0 (and | 
  Outside a character class, a backslash followed by a digit greater than 0 (and | 
 |  possibly further digits) is a back reference to a capturing subpattern earlier | 
  possibly further digits) is a back reference to a capturing subpattern earlier | 
| 
 Line 1805  Because there may be many capturing parentheses in a p
 | 
 Line 2070  Because there may be many capturing parentheses in a p
 | 
 |  following a backslash are taken as part of a potential back reference number. | 
  following a backslash are taken as part of a potential back reference number. | 
 |  If the pattern continues with a digit character, some delimiter must be used to | 
  If the pattern continues with a digit character, some delimiter must be used to | 
 |  terminate the back reference. If the PCRE_EXTENDED option is set, this can be | 
  terminate the back reference. If the PCRE_EXTENDED option is set, this can be | 
| whitespace. Otherwise, the \g{ syntax or an empty comment (see | white space. Otherwise, the \g{ syntax or an empty comment (see | 
 |  <a href="#comments">"Comments"</a> | 
  <a href="#comments">"Comments"</a> | 
 |  below) can be used. | 
  below) can be used. | 
 |  </P> | 
  </P> | 
| 
 Line 1834  as an
 | 
 Line 2099  as an
 | 
 |  Once the whole group has been matched, a subsequent matching failure cannot | 
  Once the whole group has been matched, a subsequent matching failure cannot | 
 |  cause backtracking into the middle of the group. | 
  cause backtracking into the middle of the group. | 
 |  <a name="bigassertions"></a></P> | 
  <a name="bigassertions"></a></P> | 
| <br><a name="SEC18" href="#TOC1">ASSERTIONS</a><br> | <br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br> | 
 |  <P> | 
  <P> | 
 |  An assertion is a test on the characters following or preceding the current | 
  An assertion is a test on the characters following or preceding the current | 
 |  matching point that does not actually consume any characters. The simple | 
  matching point that does not actually consume any characters. The simple | 
| 
 Line 1851  except that it does not cause the current matching pos
 | 
 Line 2116  except that it does not cause the current matching pos
 | 
 |  Assertion subpatterns are not capturing subpatterns. If such an assertion | 
  Assertion subpatterns are not capturing subpatterns. If such an assertion | 
 |  contains capturing subpatterns within it, these are counted for the purposes of | 
  contains capturing subpatterns within it, these are counted for the purposes of | 
 |  numbering the capturing subpatterns in the whole pattern. However, substring | 
  numbering the capturing subpatterns in the whole pattern. However, substring | 
| capturing is carried out only for positive assertions, because it does not make | capturing is carried out only for positive assertions. (Perl sometimes, but not | 
| sense for negative assertions. | always, does do capturing in negative assertions.) | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  For compatibility with Perl, assertion subpatterns may be repeated; though | 
  For compatibility with Perl, assertion subpatterns may be repeated; though | 
| 
 Line 1950  match. If there are insufficient characters before the
 | 
 Line 2215  match. If there are insufficient characters before the
 | 
 |  assertion fails. | 
  assertion fails. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, | In a UTF mode, PCRE does not allow the \C escape (which matches a single data | 
| even in UTF-8 mode) to appear in lookbehind assertions, because it makes it | unit even in a UTF mode) to appear in lookbehind assertions, because it makes | 
| impossible to calculate the length of the lookbehind. The \X and \R escapes, | it impossible to calculate the length of the lookbehind. The \X and \R | 
| which can match different numbers of bytes, are also not permitted. | escapes, which can match different numbers of data units, are also not | 
|   | permitted. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  <a href="#subpatternsassubroutines">"Subroutine"</a> | 
  <a href="#subpatternsassubroutines">"Subroutine"</a> | 
| 
 Line 2023  preceded by "foo", while
 | 
 Line 2289  preceded by "foo", while
 | 
 |  is another pattern that matches "foo" preceded by three digits and any three | 
  is another pattern that matches "foo" preceded by three digits and any three | 
 |  characters that are not "999". | 
  characters that are not "999". | 
 |  <a name="conditions"></a></P> | 
  <a name="conditions"></a></P> | 
| <br><a name="SEC19" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> | <br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> | 
 |  <P> | 
  <P> | 
 |  It is possible to cause the matching process to obey a subpattern | 
  It is possible to cause the matching process to obey a subpattern | 
 |  conditionally or to choose between two alternative subpatterns, depending on | 
  conditionally or to choose between two alternative subpatterns, depending on | 
| 
 Line 2097  Checking for a used subpattern by name
 | 
 Line 2363  Checking for a used subpattern by name
 | 
 |  <P> | 
  <P> | 
 |  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used | 
  Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used | 
 |  subpattern by name. For compatibility with earlier versions of PCRE, which had | 
  subpattern by name. For compatibility with earlier versions of PCRE, which had | 
| this facility before Perl, the syntax (?(name)...) is also recognized. However, | this facility before Perl, the syntax (?(name)...) is also recognized. | 
| there is a possible ambiguity with this syntax, because subpattern names may |   | 
| consist entirely of digits. PCRE looks first for a named subpattern; if it |   | 
| cannot find one and the name consists entirely of digits, PCRE looks for a |   | 
| subpattern of that number, which must be greater than zero. Using subpattern |   | 
| names that consist entirely of digits is not recommended. |   | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  Rewriting the above example to use a named subpattern gives this: | 
  Rewriting the above example to use a named subpattern gives this: | 
| 
 Line 2146  point in the pattern; the idea of DEFINE is that it ca
 | 
 Line 2407  point in the pattern; the idea of DEFINE is that it ca
 | 
 |  subroutines that can be referenced from elsewhere. (The use of | 
  subroutines that can be referenced from elsewhere. (The use of | 
 |  <a href="#subpatternsassubroutines">subroutines</a> | 
  <a href="#subpatternsassubroutines">subroutines</a> | 
 |  is described below.) For example, a pattern to match an IPv4 address such as | 
  is described below.) For example, a pattern to match an IPv4 address such as | 
| "192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line | 
 |  breaks): | 
  breaks): | 
 |  <pre> | 
  <pre> | 
 |    (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | 
    (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | 
| 
 Line 2178  subject is matched against the first alternative; othe
 | 
 Line 2439  subject is matched against the first alternative; othe
 | 
 |  against the second. This pattern matches strings in one of the two forms | 
  against the second. This pattern matches strings in one of the two forms | 
 |  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. | 
  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. | 
 |  <a name="comments"></a></P> | 
  <a name="comments"></a></P> | 
| <br><a name="SEC20" href="#TOC1">COMMENTS</a><br> | <br><a name="SEC22" href="#TOC1">COMMENTS</a><br> | 
 |  <P> | 
  <P> | 
 |  There are two ways of including comments in patterns that are processed by | 
  There are two ways of including comments in patterns that are processed by | 
 |  PCRE. In both cases, the start of the comment must not be in a character class, | 
  PCRE. In both cases, the start of the comment must not be in a character class, | 
| 
 Line 2192  closing parenthesis. Nested parentheses are not permit
 | 
 Line 2453  closing parenthesis. Nested parentheses are not permit
 | 
 |  option is set, an unescaped # character also introduces a comment, which in | 
  option is set, an unescaped # character also introduces a comment, which in | 
 |  this case continues to immediately after the next newline character or | 
  this case continues to immediately after the next newline character or | 
 |  character sequence in the pattern. Which characters are interpreted as newlines | 
  character sequence in the pattern. Which characters are interpreted as newlines | 
| is controlled by the options passed to <b>pcre_compile()</b> or by a special | is controlled by the options passed to a compiling function or by a special | 
 |  sequence at the start of the pattern, as described in the section entitled | 
  sequence at the start of the pattern, as described in the section entitled | 
 |  <a href="#newlines">"Newline conventions"</a> | 
  <a href="#newlines">"Newline conventions"</a> | 
 |  above. Note that the end of this type of comment is a literal newline sequence | 
  above. Note that the end of this type of comment is a literal newline sequence | 
| 
 Line 2207  a newline in the pattern. The sequence \n is still lit
 | 
 Line 2468  a newline in the pattern. The sequence \n is still lit
 | 
 |  it does not terminate the comment. Only an actual character with the code value | 
  it does not terminate the comment. Only an actual character with the code value | 
 |  0x0a (the default newline) does so. | 
  0x0a (the default newline) does so. | 
 |  <a name="recursion"></a></P> | 
  <a name="recursion"></a></P> | 
| <br><a name="SEC21" href="#TOC1">RECURSIVE PATTERNS</a><br> | <br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br> | 
 |  <P> | 
  <P> | 
 |  Consider the problem of matching a string in parentheses, allowing for | 
  Consider the problem of matching a string in parentheses, allowing for | 
 |  unlimited nested parentheses. Without the use of recursion, the best that can | 
  unlimited nested parentheses. Without the use of recursion, the best that can | 
| 
 Line 2422  now match "b" and so the whole match succeeds. In Perl
 | 
 Line 2683  now match "b" and so the whole match succeeds. In Perl
 | 
 |  match because inside the recursive call \1 cannot access the externally set | 
  match because inside the recursive call \1 cannot access the externally set | 
 |  value. | 
  value. | 
 |  <a name="subpatternsassubroutines"></a></P> | 
  <a name="subpatternsassubroutines"></a></P> | 
| <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> | <br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> | 
 |  <P> | 
  <P> | 
 |  If the syntax for a recursive subpattern call (either by number or by | 
  If the syntax for a recursive subpattern call (either by number or by | 
 |  name) is used outside the parentheses to which it refers, it operates like a | 
  name) is used outside the parentheses to which it refers, it operates like a | 
| 
 Line 2463  different calls. For example, consider this pattern:
 | 
 Line 2724  different calls. For example, consider this pattern:
 | 
 |  It matches "abcabc". It does not match "abcABC" because the change of | 
  It matches "abcabc". It does not match "abcABC" because the change of | 
 |  processing option does not affect the called subpattern. | 
  processing option does not affect the called subpattern. | 
 |  <a name="onigurumasubroutines"></a></P> | 
  <a name="onigurumasubroutines"></a></P> | 
| <br><a name="SEC23" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> | <br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> | 
 |  <P> | 
  <P> | 
 |  For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or | 
  For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or | 
 |  a number enclosed either in angle brackets or single quotes, is an alternative | 
  a number enclosed either in angle brackets or single quotes, is an alternative | 
| 
 Line 2481  plus or a minus sign it is taken as a relative referen
 | 
 Line 2742  plus or a minus sign it is taken as a relative referen
 | 
 |  Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> | 
  Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> | 
 |  synonymous. The former is a back reference; the latter is a subroutine call. | 
  synonymous. The former is a back reference; the latter is a subroutine call. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> | <br><a name="SEC26" href="#TOC1">CALLOUTS</a><br> | 
 |  <P> | 
  <P> | 
 |  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl | 
  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl | 
 |  code to be obeyed in the middle of matching a regular expression. This makes it | 
  code to be obeyed in the middle of matching a regular expression. This makes it | 
| 
 Line 2491  same pair of parentheses when there is a repetition.
 | 
 Line 2752  same pair of parentheses when there is a repetition.
 | 
 |  <P> | 
  <P> | 
 |  PCRE provides a similar feature, but of course it cannot obey arbitrary Perl | 
  PCRE provides a similar feature, but of course it cannot obey arbitrary Perl | 
 |  code. The feature is called "callout". The caller of PCRE provides an external | 
  code. The feature is called "callout". The caller of PCRE provides an external | 
| function by putting its entry point in the global variable <i>pcre_callout</i>. | function by putting its entry point in the global variable <i>pcre_callout</i> | 
|   | (8-bit library) or <i>pcre[16|32]_callout</i> (16-bit or 32-bit library). | 
 |  By default, this variable contains NULL, which disables all calling out. | 
  By default, this variable contains NULL, which disables all calling out. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| 
 Line 2502  For example, this pattern has two callout points:
 | 
 Line 2764  For example, this pattern has two callout points:
 | 
 |  <pre> | 
  <pre> | 
 |    (?C1)abc(?C2)def | 
    (?C1)abc(?C2)def | 
 |  </pre> | 
  </pre> | 
| If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are | If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are | 
 |  automatically installed before each item in the pattern. They are all numbered | 
  automatically installed before each item in the pattern. They are all numbered | 
| 255. | 255. If there is a conditional group in the pattern whose condition is an | 
|   | assertion, an additional callout is inserted just before the condition. An | 
|   | explicit callout may also be set at this position, as in this example: | 
|   | <pre> | 
|   |   (?(?C9)(?=a)abc|def) | 
|   | </pre> | 
|   | Note that this applies only to assertion conditions, not to other types of | 
|   | condition. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is | During matching, when PCRE reaches a callout point, the external function is | 
| set), the external function is called. It is provided with the number of the | called. It is provided with the number of the callout, the position in the | 
| callout, the position in the pattern, and, optionally, one item of data | pattern, and, optionally, one item of data originally supplied by the caller of | 
| originally supplied by the caller of <b>pcre_exec()</b>. The callout function | the matching function. The callout function may cause matching to proceed, to | 
| may cause matching to proceed, to backtrack, or to fail altogether. A complete | backtrack, or to fail altogether. | 
| description of the interface to the callout function is given in the | </P> | 
|   | <P> | 
|   | By default, PCRE implements a number of optimizations at compile time and | 
|   | matching time, and one side-effect is that sometimes callouts are skipped. If | 
|   | you need all possible callouts to happen, you need to set options that disable | 
|   | the relevant optimizations. More details, and a complete description of the | 
|   | interface to the callout function, are given in the | 
 |  <a href="pcrecallout.html"><b>pcrecallout</b></a> | 
  <a href="pcrecallout.html"><b>pcrecallout</b></a> | 
 |  documentation. | 
  documentation. | 
 |  <a name="backtrackcontrol"></a></P> | 
  <a name="backtrackcontrol"></a></P> | 
| <br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br> | <br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br> | 
 |  <P> | 
  <P> | 
 |  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which | 
  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which | 
| are described in the Perl documentation as "experimental and subject to change | are still described in the Perl documentation as "experimental and subject to | 
| or removal in a future version of Perl". It goes on to say: "Their usage in | change or removal in a future version of Perl". It goes on to say: "Their usage | 
| production code should be noted to avoid problems during upgrades." The same | in production code should be noted to avoid problems during upgrades." The same | 
 |  remarks apply to the PCRE features described in this section. | 
  remarks apply to the PCRE features described in this section. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Since these verbs are specifically related to backtracking, most of them can be | The new verbs make use of what was previously invalid syntax: an opening | 
| used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses | parenthesis followed by an asterisk. They are generally of the form | 
| a backtracking algorithm. With the exception of (*FAIL), which behaves like a | (*VERB) or (*VERB:NAME). Some may take either form, possibly behaving | 
| failing negative assertion, they cause an error if encountered by | differently depending on whether or not a name is present. A name is any | 
| <b>pcre_dfa_exec()</b>. | sequence of characters that does not include a closing parenthesis. The maximum | 
|   | length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit | 
|   | libraries. If the name is empty, that is, if the closing parenthesis | 
|   | immediately follows the colon, the effect is as if the colon were not there. | 
|   | Any number of these verbs may occur in a pattern. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| If any of these verbs are used in an assertion or in a subpattern that is | Since these verbs are specifically related to backtracking, most of them can be | 
| called as a subroutine (whether or not recursively), their effect is confined | used only when the pattern is to be matched using one of the traditional | 
| to that subpattern; it does not extend to the surrounding pattern, with one | matching functions, because these use a backtracking algorithm. With the | 
| exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in | exception of (*FAIL), which behaves like a failing negative assertion, the | 
| a successful positive assertion <i>is</i> passed back when a match succeeds | backtracking control verbs cause an error if encountered by a DFA matching | 
| (compare capturing parentheses in assertions). Note that such subpatterns are | function. | 
| processed as anchored at the point where they are tested. Note also that Perl's |   | 
| treatment of subroutines is different in some cases. |   | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| The new verbs make use of what was previously invalid syntax: an opening | The behaviour of these verbs in | 
| parenthesis followed by an asterisk. They are generally of the form | <a href="#btrepeat">repeated groups,</a> | 
| (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | <a href="#btassert">assertions,</a> | 
| depending on whether or not an argument is present. A name is any sequence of | and in | 
| characters that does not include a closing parenthesis. If the name is empty, | <a href="#btsub">subpatterns called as subroutines</a> | 
| that is, if the closing parenthesis immediately follows the colon, the effect | (whether or not recursively) is documented below. | 
| is as if the colon were not there. Any number of these verbs may occur in a | <a name="nooptimize"></a></P> | 
| pattern. | <br><b> | 
| </P> | Optimizations that affect backtracking verbs | 
|   | </b><br> | 
 |  <P> | 
  <P> | 
 |  PCRE contains some optimizations that are used to speed up matching by running | 
  PCRE contains some optimizations that are used to speed up matching by running | 
 |  some checks at the start of each match attempt. For example, it may know the | 
  some checks at the start of each match attempt. For example, it may know the | 
 |  minimum length of matching subject, or that a particular character must be | 
  minimum length of matching subject, or that a particular character must be | 
| present. When one of these optimizations suppresses the running of a match, any | present. When one of these optimizations bypasses the running of a match, any | 
 |  included backtracking verbs will not, of course, be processed. You can suppress | 
  included backtracking verbs will not, of course, be processed. You can suppress | 
 |  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option | 
  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option | 
 |  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the | 
  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the | 
| pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the | 
|   | section entitled | 
|   | <a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a> | 
|   | in the | 
|   | <a href="pcreapi.html"><b>pcreapi</b></a> | 
|   | documentation. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  Experiments with Perl suggest that it too has similar optimizations, sometimes | 
  Experiments with Perl suggest that it too has similar optimizations, sometimes | 
| 
 Line 2577  followed by a name.
 | 
 Line 2860  followed by a name.
 | 
 |  This verb causes the match to end successfully, skipping the remainder of the | 
  This verb causes the match to end successfully, skipping the remainder of the | 
 |  pattern. However, when it is inside a subpattern that is called as a | 
  pattern. However, when it is inside a subpattern that is called as a | 
 |  subroutine, only that subpattern is ended successfully. Matching then continues | 
  subroutine, only that subpattern is ended successfully. Matching then continues | 
| at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so | at the outer level. If (*ACCEPT) in triggered in a positive assertion, the | 
| far is captured. For example: | assertion succeeds; in a negative assertion, the assertion fails. | 
|   | </P> | 
|   | <P> | 
|   | If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For | 
|   | example: | 
 |  <pre> | 
  <pre> | 
 |    A((?:A|B(*ACCEPT)|C)D) | 
    A((?:A|B(*ACCEPT)|C)D) | 
 |  </pre> | 
  </pre> | 
| 
 Line 2612  A name is always required with this verb. There may be
 | 
 Line 2899  A name is always required with this verb. There may be
 | 
 |  (*MARK) as you like in a pattern, and their names do not have to be unique. | 
  (*MARK) as you like in a pattern, and their names do not have to be unique. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| When a match succeeds, the name of the last-encountered (*MARK) on the matching | When a match succeeds, the name of the last-encountered (*MARK:NAME), | 
| path is passed back to the caller via the <i>pcre_extra</i> data structure, as | (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the | 
| described in the | caller as described in the section entitled | 
| <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> | <a href="pcreapi.html#extradata">"Extra data for <b>pcre_exec()</b>"</a> | 
 |  in the | 
  in the | 
 |  <a href="pcreapi.html"><b>pcreapi</b></a> | 
  <a href="pcreapi.html"><b>pcreapi</b></a> | 
 |  documentation. Here is an example of <b>pcretest</b> output, where the /K | 
  documentation. Here is an example of <b>pcretest</b> output, where the /K | 
| 
 Line 2635  of obtaining this information than putting each altern
 | 
 Line 2922  of obtaining this information than putting each altern
 | 
 |  capturing parentheses. | 
  capturing parentheses. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| If (*MARK) is encountered in a positive assertion, its name is recorded and | If a verb with a name is encountered in a positive assertion that is true, the | 
| passed back if it is the last-encountered. This does not happen for negative | name is recorded and passed back if it is the last-encountered. This does not | 
| assertions. | happen for negative assertions or failing positive assertions. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| After a partial match or a failed match, the name of the last encountered | After a partial match or a failed match, the last encountered name in the | 
| (*MARK) in the entire match process is returned. For example: | entire match process is returned. For example: | 
 |  <pre> | 
  <pre> | 
 |      re> /X(*MARK:A)Y|X(*MARK:B)Z/K | 
      re> /X(*MARK:A)Y|X(*MARK:B)Z/K | 
 |    data> XP | 
    data> XP | 
 |    No match, mark = B | 
    No match, mark = B | 
 |  </pre> | 
  </pre> | 
 |  Note that in this unanchored example the mark is retained from the match | 
  Note that in this unanchored example the mark is retained from the match | 
| attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match | 
| "P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the | 
| nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. | 
 |  </P> | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  If you are interested in (*MARK) values after failed matches, you should | 
 |   | 
  probably set the PCRE_NO_START_OPTIMIZE option | 
 |   | 
  <a href="#nooptimize">(see above)</a> | 
 |   | 
  to ensure that the match is always attempted. | 
 |   | 
  </P> | 
 |  <br><b> | 
  <br><b> | 
 |  Verbs that act after backtracking | 
  Verbs that act after backtracking | 
 |  </b><br> | 
  </b><br> | 
| 
 Line 2659  Verbs that act after backtracking
 | 
 Line 2952  Verbs that act after backtracking
 | 
 |  The following verbs do nothing when they are encountered. Matching continues | 
  The following verbs do nothing when they are encountered. Matching continues | 
 |  with what follows, but if there is no subsequent match, causing a backtrack to | 
  with what follows, but if there is no subsequent match, causing a backtrack to | 
 |  the verb, a failure is forced. That is, backtracking cannot pass to the left of | 
  the verb, a failure is forced. That is, backtracking cannot pass to the left of | 
| the verb. However, when one of these verbs appears inside an atomic group, its | the verb. However, when one of these verbs appears inside an atomic group or an | 
| effect is confined to that group, because once the group has been matched, | assertion that is true, its effect is confined to that group, because once the | 
| there is never any backtracking into it. In this situation, backtracking can | group has been matched, there is never any backtracking into it. In this | 
| "jump back" to the left of the entire atomic group. (Remember also, as stated | situation, backtracking can "jump back" to the left of the entire atomic group | 
| above, that this localization also applies in subroutine calls and assertions.) | or assertion. (Remember also, as stated above, that this localization also | 
|   | applies in subroutine calls.) | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |  These verbs differ in exactly what kind of failure occurs when backtracking | 
  These verbs differ in exactly what kind of failure occurs when backtracking | 
| reaches them. | reaches them. The behaviour described below is what happens when the verb is | 
|   | not in a subroutine or an assertion. Subsequent sections cover these special | 
|   | cases. | 
 |  <pre> | 
  <pre> | 
 |    (*COMMIT) | 
    (*COMMIT) | 
 |  </pre> | 
  </pre> | 
 |  This verb, which may not be followed by a name, causes the whole match to fail | 
  This verb, which may not be followed by a name, causes the whole match to fail | 
| outright if the rest of the pattern does not match. Even if the pattern is | outright if there is a later matching failure that causes backtracking to reach | 
| unanchored, no further attempts to find a match by advancing the starting point | it. Even if the pattern is unanchored, no further attempts to find a match by | 
| take place. Once (*COMMIT) has been passed, <b>pcre_exec()</b> is committed to | advancing the starting point take place. If (*COMMIT) is the only backtracking | 
| finding a match at the current starting point, or not at all. For example: | verb that is encountered, once it has been passed <b>pcre_exec()</b> is | 
|   | committed to finding a match at the current starting point, or not at all. For | 
|   | example: | 
 |  <pre> | 
  <pre> | 
 |    a+(*COMMIT)b | 
    a+(*COMMIT)b | 
 |  </pre> | 
  </pre> | 
| 
 Line 2685  recently passed (*MARK) in the path is passed back whe
 | 
 Line 2983  recently passed (*MARK) in the path is passed back whe
 | 
 |  match failure. | 
  match failure. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
 |   | 
  If there is more than one backtracking verb in a pattern, a different one that | 
 |   | 
  follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a | 
 |   | 
  match does not always guarantee that a match must be at this starting point. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |  Note that (*COMMIT) at the start of a pattern is not the same as an anchor, | 
  Note that (*COMMIT) at the start of a pattern is not the same as an anchor, | 
 |  unless PCRE's start-of-match optimizations are turned off, as shown in this | 
  unless PCRE's start-of-match optimizations are turned off, as shown in this | 
 |  <b>pcretest</b> example: | 
  <b>pcretest</b> example: | 
| 
 Line 2704  starting points.
 | 
 Line 3007  starting points.
 | 
 |    (*PRUNE) or (*PRUNE:NAME) | 
    (*PRUNE) or (*PRUNE:NAME) | 
 |  </pre> | 
  </pre> | 
 |  This verb causes the match to fail at the current starting position in the | 
  This verb causes the match to fail at the current starting position in the | 
| subject if the rest of the pattern does not match. If the pattern is | subject if there is a later matching failure that causes backtracking to reach | 
| unanchored, the normal "bumpalong" advance to the next starting character then | it. If the pattern is unanchored, the normal "bumpalong" advance to the next | 
| happens. Backtracking can occur as usual to the left of (*PRUNE), before it is | starting character then happens. Backtracking can occur as usual to the left of | 
| reached, or when matching to the right of (*PRUNE), but if there is no match to | (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but | 
| the right, backtracking cannot cross (*PRUNE). In simple cases, the use of | if there is no match to the right, backtracking cannot cross (*PRUNE). In | 
| (*PRUNE) is just an alternative to an atomic group or possessive quantifier, | simple cases, the use of (*PRUNE) is just an alternative to an atomic group or | 
| but there are some uses of (*PRUNE) that cannot be expressed in any other way. | possessive quantifier, but there are some uses of (*PRUNE) that cannot be | 
| The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an | expressed in any other way. In an anchored pattern (*PRUNE) has the same effect | 
| anchored pattern (*PRUNE) has the same effect as (*COMMIT). | as (*COMMIT). | 
|   | </P> | 
|   | <P> | 
|   | The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). | 
|   | It is like (*MARK:NAME) in that the name is remembered for passing back to the | 
|   | caller. However, (*SKIP:NAME) searches only for names set with (*MARK). | 
 |  <pre> | 
  <pre> | 
 |    (*SKIP) | 
    (*SKIP) | 
 |  </pre> | 
  </pre> | 
| 
 Line 2733  instead of skipping on to "c".
 | 
 Line 3041  instead of skipping on to "c".
 | 
 |  <pre> | 
  <pre> | 
 |    (*SKIP:NAME) | 
    (*SKIP:NAME) | 
 |  </pre> | 
  </pre> | 
| When (*SKIP) has an associated name, its behaviour is modified. If the | When (*SKIP) has an associated name, its behaviour is modified. When it is | 
| following pattern fails to match, the previous path through the pattern is | triggered, the previous path through the pattern is searched for the most | 
| searched for the most recent (*MARK) that has the same name. If one is found, | recent (*MARK) that has the same name. If one is found, the "bumpalong" advance | 
| the "bumpalong" advance is to the subject position that corresponds to that | is to the subject position that corresponds to that (*MARK) instead of to where | 
| (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a | (*SKIP) was encountered. If no (*MARK) with a matching name is found, the | 
| matching name is found, the (*SKIP) is ignored. | (*SKIP) is ignored. | 
|   | </P> | 
|   | <P> | 
|   | Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores | 
|   | names that are set by (*PRUNE:NAME) or (*THEN:NAME). | 
 |  <pre> | 
  <pre> | 
 |    (*THEN) or (*THEN:NAME) | 
    (*THEN) or (*THEN:NAME) | 
 |  </pre> | 
  </pre> | 
| This verb causes a skip to the next innermost alternative if the rest of the | This verb causes a skip to the next innermost alternative when backtracking | 
| pattern does not match. That is, it cancels pending backtracking, but only | reaches it. That is, it cancels any further backtracking within the current | 
| within the current alternative. Its name comes from the observation that it can | alternative. Its name comes from the observation that it can be used for a | 
| be used for a pattern-based if-then-else block: | pattern-based if-then-else block: | 
 |  <pre> | 
  <pre> | 
 |    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... | 
    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... | 
 |  </pre> | 
  </pre> | 
 |  If the COND1 pattern matches, FOO is tried (and possibly further items after | 
  If the COND1 pattern matches, FOO is tried (and possibly further items after | 
 |  the end of the group if FOO succeeds); on failure, the matcher skips to the | 
  the end of the group if FOO succeeds); on failure, the matcher skips to the | 
| second alternative and tries COND2, without backtracking into COND1. The | second alternative and tries COND2, without backtracking into COND1. If that | 
| behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). | succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no | 
| If (*THEN) is not inside an alternation, it acts like (*PRUNE). | more alternatives, so there is a backtrack to whatever came before the entire | 
|   | group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Note that a subpattern that does not contain a | character is just a part of | The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN). | 
| the enclosing alternative; it is not a nested alternation with only one | It is like (*MARK:NAME) in that the name is remembered for passing back to the | 
|   | caller. However, (*SKIP:NAME) searches only for names set with (*MARK). | 
|   | </P> | 
|   | <P> | 
|   | A subpattern that does not contain a | character is just a part of the | 
|   | enclosing alternative; it is not a nested alternation with only one | 
 |  alternative. The effect of (*THEN) extends beyond such a subpattern to the | 
  alternative. The effect of (*THEN) extends beyond such a subpattern to the | 
 |  enclosing alternative. Consider this pattern, where A, B, etc. are complex | 
  enclosing alternative. Consider this pattern, where A, B, etc. are complex | 
 |  pattern fragments that do not contain any | characters at this level: | 
  pattern fragments that do not contain any | characters at this level: | 
| 
 Line 2777  because there are no more alternatives to try. In this
 | 
 Line 3095  because there are no more alternatives to try. In this
 | 
 |  backtrack into A. | 
  backtrack into A. | 
 |  </P> | 
  </P> | 
 |  <P> | 
  <P> | 
| Note also that a conditional subpattern is not considered as having two | Note that a conditional subpattern is not considered as having two | 
 |  alternatives, because only one is ever used. In other words, the | character in | 
  alternatives, because only one is ever used. In other words, the | character in | 
 |  a conditional subpattern has a different meaning. Ignoring white space, | 
  a conditional subpattern has a different meaning. Ignoring white space, | 
 |  consider: | 
  consider: | 
| 
 Line 2801  unanchored pattern). (*SKIP) is similar, except that t
 | 
 Line 3119  unanchored pattern). (*SKIP) is similar, except that t
 | 
 |  than one character. (*COMMIT) is the strongest, causing the entire match to | 
  than one character. (*COMMIT) is the strongest, causing the entire match to | 
 |  fail. | 
  fail. | 
 |  </P> | 
  </P> | 
 |   | 
  <br><b> | 
 |   | 
  More than one backtracking verb | 
 |   | 
  </b><br> | 
 |  <P> | 
  <P> | 
| If more than one such verb is present in a pattern, the "strongest" one wins. | If more than one backtracking verb is present in a pattern, the one that is | 
| For example, consider this pattern, where A, B, etc. are complex pattern | backtracked onto first acts. For example, consider this pattern, where A, B, | 
| fragments: | etc. are complex pattern fragments: | 
 |  <pre> | 
  <pre> | 
|   (A(*COMMIT)B(*THEN)C|D) |   (A(*COMMIT)B(*THEN)C|ABD) | 
 |  </pre> | 
  </pre> | 
| Once A has matched, PCRE is committed to this match, at the current starting | If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to | 
| position. If subsequently B matches, but C does not, the normal (*THEN) action | fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes | 
| of trying the next alternative (that is, D) does not happen because (*COMMIT) | the next alternative (ABD) to be tried. This behaviour is consistent, but is | 
| overrides. | not always the same as Perl's. It means that if two or more backtracking verbs | 
|   | appear in succession, all the the last of them has no effect. Consider this | 
|   | example: | 
|   | <pre> | 
|   |   ...(*COMMIT)(*PRUNE)... | 
|   | </pre> | 
|   | If there is a matching failure to the right, backtracking onto (*PRUNE) causes | 
|   | it to be triggered, and its action is taken. There can never be a backtrack | 
|   | onto (*COMMIT). | 
|   | <a name="btrepeat"></a></P> | 
|   | <br><b> | 
|   | Backtracking verbs in repeated groups | 
|   | </b><br> | 
|   | <P> | 
|   | PCRE differs from Perl in its handling of backtracking verbs in repeated | 
|   | groups. For example, consider: | 
|   | <pre> | 
|   |   /(a(*COMMIT)b)+ac/ | 
|   | </pre> | 
|   | If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in | 
|   | the second repeat of the group acts. | 
|   | <a name="btassert"></a></P> | 
|   | <br><b> | 
|   | Backtracking verbs in assertions | 
|   | </b><br> | 
|   | <P> | 
|   | (*FAIL) in an assertion has its normal effect: it forces an immediate backtrack. | 
 |  </P> | 
  </P> | 
 |  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> | 
   | 
 |  <P> | 
  <P> | 
 |   | 
  (*ACCEPT) in a positive assertion causes the assertion to succeed without any | 
 |   | 
  further processing. In a negative assertion, (*ACCEPT) causes the assertion to | 
 |   | 
  fail without any further processing. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  The other backtracking verbs are not treated specially if they appear in a | 
 |   | 
  positive assertion. In particular, (*THEN) skips to the next alternative in the | 
 |   | 
  innermost enclosing group that has alternations, whether or not this is within | 
 |   | 
  the assertion. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  Negative assertions are, however, different, in order to ensure that changing a | 
 |   | 
  positive assertion into a negative assertion changes its result. Backtracking | 
 |   | 
  into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true, | 
 |   | 
  without considering any further alternative branches in the assertion. | 
 |   | 
  Backtracking into (*THEN) causes it to skip to the next enclosing alternative | 
 |   | 
  within the assertion (the normal behaviour), but if the assertion does not have | 
 |   | 
  such an alternative, (*THEN) behaves like (*PRUNE). | 
 |   | 
  <a name="btsub"></a></P> | 
 |   | 
  <br><b> | 
 |   | 
  Backtracking verbs in subroutines | 
 |   | 
  </b><br> | 
 |   | 
  <P> | 
 |   | 
  These behaviours occur whether or not the subpattern is called recursively. | 
 |   | 
  Perl's treatment of subroutines is different in some cases. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  (*FAIL) in a subpattern called as a subroutine has its normal effect: it forces | 
 |   | 
  an immediate backtrack. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  (*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to | 
 |   | 
  succeed without any further processing. Matching then continues after the | 
 |   | 
  subroutine call. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause | 
 |   | 
  the subroutine match to fail. | 
 |   | 
  </P> | 
 |   | 
  <P> | 
 |   | 
  (*THEN) skips to the next alternative in the innermost enclosing group within | 
 |   | 
  the subpattern that has alternatives. If there is no such group within the | 
 |   | 
  subpattern, (*THEN) causes the subroutine match to fail. | 
 |   | 
  </P> | 
 |   | 
  <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br> | 
 |   | 
  <P> | 
 |  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), | 
  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), | 
| <b>pcresyntax</b>(3), <b>pcre</b>(3). | <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>, <b>pcre32(3)</b>. | 
 |  </P> | 
  </P> | 
| <br><a name="SEC27" href="#TOC1">AUTHOR</a><br> | <br><a name="SEC29" href="#TOC1">AUTHOR</a><br> | 
 |  <P> | 
  <P> | 
 |  Philip Hazel | 
  Philip Hazel | 
 |  <br> | 
  <br> | 
| 
 Line 2827  University Computing Service
 | 
 Line 3219  University Computing Service
 | 
 |  Cambridge CB2 3QH, England. | 
  Cambridge CB2 3QH, England. | 
 |  <br> | 
  <br> | 
 |  </P> | 
  </P> | 
| <br><a name="SEC28" href="#TOC1">REVISION</a><br> | <br><a name="SEC30" href="#TOC1">REVISION</a><br> | 
 |  <P> | 
  <P> | 
| Last updated: 29 November 2011 | Last updated: 03 December 2013 | 
 |  <br> | 
  <br> | 
| Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2013 University of Cambridge. | 
 |  <br> | 
  <br> | 
 |  <p> | 
  <p> | 
 |  Return to the <a href="index.html">PCRE index page</a>. | 
  Return to the <a href="index.html">PCRE index page</a>. |