version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.4, 2013/07/22 08:25:57
|
Line 14 man page, in case the conversion went wrong.
|
Line 14 man page, in case the conversion went wrong.
|
<br> |
<br> |
<ul> |
<ul> |
<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a> |
<li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a> |
<li><a name="TOC2" href="#SEC2">NEWLINE CONVENTIONS</a> | <li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a> |
<li><a name="TOC3" href="#SEC3">CHARACTERS AND METACHARACTERS</a> | <li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a> |
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> | <li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> | <li><a name="TOC5" href="#SEC5">BACKSLASH</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> | <li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> | <li><a name="TOC7" href="#SEC7">FULL STOP (PERIOD, DOT) AND \N</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> | <li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> | <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> | <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC11" href="#SEC11">INTERNAL OPTION SETTING</a> | <li><a name="TOC11" href="#SEC11">VERTICAL BAR</a> |
<li><a name="TOC12" href="#SEC12">SUBPATTERNS</a> | <li><a name="TOC12" href="#SEC12">INTERNAL OPTION SETTING</a> |
<li><a name="TOC13" href="#SEC13">DUPLICATE SUBPATTERN NUMBERS</a> | <li><a name="TOC13" href="#SEC13">SUBPATTERNS</a> |
<li><a name="TOC14" href="#SEC14">NAMED SUBPATTERNS</a> | <li><a name="TOC14" href="#SEC14">DUPLICATE SUBPATTERN NUMBERS</a> |
<li><a name="TOC15" href="#SEC15">REPETITION</a> | <li><a name="TOC15" href="#SEC15">NAMED SUBPATTERNS</a> |
<li><a name="TOC16" href="#SEC16">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> | <li><a name="TOC16" href="#SEC16">REPETITION</a> |
<li><a name="TOC17" href="#SEC17">BACK REFERENCES</a> | <li><a name="TOC17" href="#SEC17">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
<li><a name="TOC18" href="#SEC18">ASSERTIONS</a> | <li><a name="TOC18" href="#SEC18">BACK REFERENCES</a> |
<li><a name="TOC19" href="#SEC19">CONDITIONAL SUBPATTERNS</a> | <li><a name="TOC19" href="#SEC19">ASSERTIONS</a> |
<li><a name="TOC20" href="#SEC20">COMMENTS</a> | <li><a name="TOC20" href="#SEC20">CONDITIONAL SUBPATTERNS</a> |
<li><a name="TOC21" href="#SEC21">RECURSIVE PATTERNS</a> | <li><a name="TOC21" href="#SEC21">COMMENTS</a> |
<li><a name="TOC22" href="#SEC22">SUBPATTERNS AS SUBROUTINES</a> | <li><a name="TOC22" href="#SEC22">RECURSIVE PATTERNS</a> |
<li><a name="TOC23" href="#SEC23">ONIGURUMA SUBROUTINE SYNTAX</a> | <li><a name="TOC23" href="#SEC23">SUBPATTERNS AS SUBROUTINES</a> |
<li><a name="TOC24" href="#SEC24">CALLOUTS</a> | <li><a name="TOC24" href="#SEC24">ONIGURUMA SUBROUTINE SYNTAX</a> |
<li><a name="TOC25" href="#SEC25">BACKTRACKING CONTROL</a> | <li><a name="TOC25" href="#SEC25">CALLOUTS</a> |
<li><a name="TOC26" href="#SEC26">SEE ALSO</a> | <li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a> |
<li><a name="TOC27" href="#SEC27">AUTHOR</a> | <li><a name="TOC27" href="#SEC27">SEE ALSO</a> |
<li><a name="TOC28" href="#SEC28">REVISION</a> | <li><a name="TOC28" href="#SEC28">AUTHOR</a> |
| <li><a name="TOC29" href="#SEC29">REVISION</a> |
</ul> |
</ul> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<P> |
<P> |
Line 60 published by O'Reilly, covers regular expressions in g
|
Line 61 published by O'Reilly, covers regular expressions in g
|
description of PCRE's regular expressions is intended as reference material. |
description of PCRE's regular expressions is intended as reference material. |
</P> |
</P> |
<P> |
<P> |
|
This document discusses the patterns that are supported by PCRE when one its |
|
main matching functions, <b>pcre_exec()</b> (8-bit) or <b>pcre[16|32]_exec()</b> |
|
(16- or 32-bit), is used. PCRE also has alternative matching functions, |
|
<b>pcre_dfa_exec()</b> and <b>pcre[16|32_dfa_exec()</b>, which match using a |
|
different algorithm that is not Perl-compatible. Some of the features discussed |
|
below are not available when DFA matching is used. The advantages and |
|
disadvantages of the alternative functions, and how they differ from the normal |
|
functions, are discussed in the |
|
<a href="pcrematching.html"><b>pcrematching</b></a> |
|
page. |
|
</P> |
|
<br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br> |
|
<P> |
|
A number of options that can be passed to <b>pcre_compile()</b> can also be set |
|
by special items at the start of a pattern. These are not Perl-compatible, but |
|
are provided to make these options accessible to pattern writers who are not |
|
able to change the program that processes the pattern. Any number of these |
|
items may appear, but they must all be together right at the start of the |
|
pattern string, and the letters must be in upper case. |
|
</P> |
|
<br><b> |
|
UTF support |
|
</b><br> |
|
<P> |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
there is now also support for UTF-8 character strings. To use this, | there is now also support for UTF-8 strings in the original library, an |
PCRE must be built to include UTF-8 support, and you must call | extra library that supports 16-bit and UTF-16 character strings, and a |
<b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There | third library that supports 32-bit and UTF-32 character strings. To use these |
is also a special sequence that can be given at the start of a pattern: | features, PCRE must be built to include appropriate support. When using UTF |
| strings you must either call the compiling function with the PCRE_UTF8, |
| PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of |
| these special sequences: |
<pre> |
<pre> |
(*UTF8) |
(*UTF8) |
|
(*UTF16) |
|
(*UTF32) |
|
(*UTF) |
</pre> |
</pre> |
Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8 | (*UTF) is a generic sequence that can be used with any of the libraries. |
option. This feature is not Perl-compatible. How setting UTF-8 mode affects | Starting a pattern with such a sequence is equivalent to setting the relevant |
pattern matching is mentioned in several places below. There is also a summary | option. How setting a UTF mode affects pattern matching is mentioned in several |
of UTF-8 features in the | places below. There is also a summary of features in the |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
page. |
page. |
</P> |
</P> |
<P> |
<P> |
Another special sequence that may appear at the start of a pattern or in | Some applications that allow their users to supply patterns may wish to |
combination with (*UTF8) is: | restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF |
| option is set at compile time, (*UTF) etc. are not allowed, and their |
| appearance causes an error. |
| </P> |
| <br><b> |
| Unicode property support |
| </b><br> |
| <P> |
| Another special sequence that may appear at the start of a pattern is |
<pre> |
<pre> |
(*UCP) |
(*UCP) |
</pre> |
</pre> |
Line 86 such as \d and \w to use Unicode properties to determi
|
Line 125 such as \d and \w to use Unicode properties to determi
|
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
table. |
table. |
</P> |
</P> |
|
<br><b> |
|
Disabling start-up optimizations |
|
</b><br> |
<P> |
<P> |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are | PCRE_NO_START_OPTIMIZE option either at compile or matching time. |
also some more of these special sequences that are concerned with the handling | |
of newlines; they are described below. | |
</P> | |
<P> | |
The remainder of this document discusses the patterns that are supported by | |
PCRE when its main matching function, <b>pcre_exec()</b>, is used. | |
From release 6.0, PCRE offers a second matching function, | |
<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not | |
Perl-compatible. Some of the features discussed below are not available when | |
<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the | |
alternative function, and how it differs from the normal function, are | |
discussed in the | |
<a href="pcrematching.html"><b>pcrematching</b></a> | |
page. | |
<a name="newlines"></a></P> |
<a name="newlines"></a></P> |
<br><a name="SEC2" href="#TOC1">NEWLINE CONVENTIONS</a><br> | <br><b> |
| Newline conventions |
| </b><br> |
<P> |
<P> |
PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
strings: a single CR (carriage return) character, a single LF (linefeed) |
strings: a single CR (carriage return) character, a single LF (linefeed) |
Line 126 string with one of the following five sequences:
|
Line 156 string with one of the following five sequences:
|
(*ANYCRLF) any of the three above |
(*ANYCRLF) any of the three above |
(*ANY) all Unicode newline sequences |
(*ANY) all Unicode newline sequences |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function. For |
<b>pcre_compile2()</b>. For example, on a Unix system where LF is the default | example, on a Unix system where LF is the default newline sequence, the pattern |
newline sequence, the pattern | |
<pre> |
<pre> |
(*CR)a.b |
(*CR)a.b |
</pre> |
</pre> |
changes the convention to CR. That pattern matches "a\nb" because LF is no |
changes the convention to CR. That pattern matches "a\nb" because LF is no |
longer a newline. Note that these special settings, which are not | longer a newline. If more than one of these settings is present, the last one |
Perl-compatible, are recognized only at the very start of a pattern, and that | |
they must be in upper case. If more than one of them is present, the last one | |
is used. |
is used. |
</P> |
</P> |
<P> |
<P> |
The newline convention affects the interpretation of the dot metacharacter when | The newline convention affects where the circumflex and dollar assertions are |
PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not | true. It also affects the interpretation of the dot metacharacter when |
affect what the \R escape sequence matches. By default, this is any Unicode | PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect |
newline sequence, for Perl compatibility. However, this can be changed; see the | what the \R escape sequence matches. By default, this is any Unicode newline |
| sequence, for Perl compatibility. However, this can be changed; see the |
description of \R in the section entitled |
description of \R in the section entitled |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
below. A change of \R setting can be combined with a change of newline |
below. A change of \R setting can be combined with a change of newline |
convention. |
convention. |
</P> |
</P> |
<br><a name="SEC3" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> | <br><b> |
| Setting match and recursion limits |
| </b><br> |
<P> |
<P> |
|
The caller of <b>pcre_exec()</b> can set a limit on the number of times the |
|
internal <b>match()</b> function is called and on the maximum depth of |
|
recursive calls. These facilities are provided to catch runaway matches that |
|
are provoked by patterns with huge matching trees (a typical example is a |
|
pattern with nested unlimited repeats) and to avoid running out of system stack |
|
by too much recursion. When one of these limits is reached, <b>pcre_exec()</b> |
|
gives an error return. The limits can also be set by items at the start of the |
|
pattern of the form |
|
<pre> |
|
(*LIMIT_MATCH=d) |
|
(*LIMIT_RECURSION=d) |
|
</pre> |
|
where d is any number of decimal digits. However, the value of the setting must |
|
be less than the value set by the caller of <b>pcre_exec()</b> for it to have |
|
any effect. In other words, the pattern writer can lower the limit set by the |
|
programmer, but not raise it. If there is more than one setting of one of these |
|
limits, the lower value is used. |
|
</P> |
|
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> |
|
<P> |
|
PCRE can be compiled to run in an environment that uses EBCDIC as its character |
|
code rather than ASCII or Unicode (typically a mainframe system). In the |
|
sections below, character code values are ASCII or Unicode; in an EBCDIC |
|
environment these characters may have different code values, and there are no |
|
code points greater than 255. |
|
</P> |
|
<br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br> |
|
<P> |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
left to right. Most characters stand for themselves in a pattern, and match the |
left to right. Most characters stand for themselves in a pattern, and match the |
corresponding characters in the subject. As a trivial example, the pattern |
corresponding characters in the subject. As a trivial example, the pattern |
Line 158 corresponding characters in the subject. As a trivial
|
Line 216 corresponding characters in the subject. As a trivial
|
</pre> |
</pre> |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
independently of case. In UTF-8 mode, PCRE always understands the concept of | independently of case. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching for characters 128 and above, you must |
If you want to use caseless matching for characters 128 and above, you must |
ensure that PCRE is compiled with Unicode property support as well as with |
ensure that PCRE is compiled with Unicode property support as well as with |
UTF-8 support. | UTF support. |
</P> |
</P> |
<P> |
<P> |
The power of regular expressions comes from the ability to include alternatives |
The power of regular expressions comes from the ability to include alternatives |
Line 205 a character class the only metacharacters are:
|
Line 263 a character class the only metacharacters are:
|
</pre> |
</pre> |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
</P> |
</P> |
<br><a name="SEC4" href="#TOC1">BACKSLASH</a><br> | <br><a name="SEC5" href="#TOC1">BACKSLASH</a><br> |
<P> |
<P> |
The backslash character has several uses. Firstly, if it is followed by a |
The backslash character has several uses. Firstly, if it is followed by a |
character that is not a number or a letter, it takes away any special meaning |
character that is not a number or a letter, it takes away any special meaning |
Line 220 non-alphanumeric with backslash to specify that it sta
|
Line 278 non-alphanumeric with backslash to specify that it sta
|
particular, if you want to match a backslash, you write \\. |
particular, if you want to match a backslash, you write \\. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, only ASCII numbers and letters have any special meaning after a | In a UTF mode, only ASCII numbers and letters have any special meaning after a |
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
</P> |
</P> |
<P> |
<P> |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class) and characters between a # outside |
a character class and the next newline are ignored. An escaping backslash can |
a character class and the next newline are ignored. An escaping backslash can |
be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. |
</P> |
</P> |
<P> |
<P> |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
Line 262 one of the following escape sequences than the binary
|
Line 320 one of the following escape sequences than the binary
|
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
\cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
\e escape (hex 1B) |
\e escape (hex 1B) |
\f formfeed (hex 0C) | \f form feed (hex 0C) |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
\t tab (hex 09) |
\t tab (hex 09) |
Line 271 one of the following escape sequences than the binary
|
Line 329 one of the following escape sequences than the binary
|
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\uhhhh character with hex code hhhh (JavaScript mode only) |
\uhhhh character with hex code hhhh (JavaScript mode only) |
</pre> |
</pre> |
The precise effect of \cx is as follows: if x is a lower case letter, it | The precise effect of \cx on ASCII characters is as follows: if x is a lower |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. | case letter, it is converted to upper case. Then bit 6 of the character (hex |
Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while | 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A (A is 41, Z is 5A), |
\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater | but \c{ becomes hex 3B ({ is 7B), and \c; becomes hex 7B (; is 3B). If the |
than 127, a compile-time error occurs. This locks out non-ASCII characters in | data item (byte or 16-bit value) following \c has a value greater than 127, a |
both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte | compile-time error occurs. This locks out non-ASCII characters in all modes. |
values are valid. A lower case letter is converted to upper case, and then the | |
0xc0 bits are flipped.) | |
</P> |
</P> |
<P> |
<P> |
|
The \c facility was designed for use with ASCII characters, but with the |
|
extension to Unicode it is even less useful than it once was. It is, however, |
|
recognized when PCRE is compiled in EBCDIC mode, where data items are always |
|
bytes. In this mode, all values are valid after \c. If the next character is a |
|
lower case letter, it is converted to upper case. Then the 0xc0 bits of the |
|
byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because |
|
the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other |
|
characters also generate different values. |
|
</P> |
|
<P> |
By default, after \x, from zero to two hexadecimal digits are read (letters |
By default, after \x, from zero to two hexadecimal digits are read (letters |
can be in upper or lower case). Any number of hexadecimal digits may appear |
can be in upper or lower case). Any number of hexadecimal digits may appear |
between \x{ and }, but the value of the character code must be less than 256 | between \x{ and }, but the character code is constrained as follows: |
in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum | <pre> |
value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest | 8-bit non-UTF mode less than 0x100 |
Unicode code point, which is 10FFFF. | 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
| 16-bit non-UTF mode less than 0x10000 |
| 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
| 32-bit non-UTF mode less than 0x80000000 |
| 32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
| </pre> |
| Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
| "surrogate" codepoints), and 0xffef. |
</P> |
</P> |
<P> |
<P> |
If characters other than hexadecimal digits appear between \x{ and }, or if |
If characters other than hexadecimal digits appear between \x{ and }, or if |
Line 300 as just described only when it is followed by two hexa
|
Line 373 as just described only when it is followed by two hexa
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
code points greater than 256 is provided by \u, which must be followed by |
code points greater than 256 is provided by \u, which must be followed by |
four hexadecimal digits; otherwise it matches a literal "u" character. |
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \u in JavaScript mode are constrained in the same |
|
was as those specified by \x in non-JavaScript mode. |
</P> |
</P> |
<P> |
<P> |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
Line 328 following the discussion of
|
Line 403 following the discussion of
|
Inside a character class, or if the decimal number is greater than 9 and there |
Inside a character class, or if the decimal number is greater than 9 and there |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
digits following the backslash, and uses them to generate a data character. Any |
digits following the backslash, and uses them to generate a data character. Any |
subsequent digits stand for themselves. In non-UTF-8 mode, the value of a | subsequent digits stand for themselves. The value of the character is |
character specified in octal must be less than \400. In UTF-8 mode, values up | constrained in the same way as characters specified in hexadecimal. |
to \777 are permitted. For example: | For example: |
<pre> |
<pre> |
\040 is another way of writing a space | \040 is another way of writing an ASCII space |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
\7 is always a back reference |
\7 is always a back reference |
\11 might be a back reference, or another way of writing a tab |
\11 might be a back reference, or another way of writing a tab |
\011 is always a tab |
\011 is always a tab |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
\113 might be a back reference, otherwise the character with octal code 113 |
\113 might be a back reference, otherwise the character with octal code 113 |
\377 might be a back reference, otherwise the byte consisting entirely of 1 bits | \377 might be a back reference, otherwise the value 255 (decimal) |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
</pre> |
</pre> |
Note that octal values of 100 or greater must not be introduced by a leading |
Note that octal values of 100 or greater must not be introduced by a leading |
Line 399 Another use of backslash is for specifying generic cha
|
Line 474 Another use of backslash is for specifying generic cha
|
<pre> |
<pre> |
\d any decimal digit |
\d any decimal digit |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
\h any horizontal whitespace character | \h any horizontal white space character |
\H any character that is not a horizontal whitespace character | \H any character that is not a horizontal white space character |
\s any whitespace character | \s any white space character |
\S any character that is not a whitespace character | \S any character that is not a white space character |
\v any vertical whitespace character | \v any vertical white space character |
\V any character that is not a vertical whitespace character | \V any character that is not a vertical white space character |
\w any "word" character |
\w any "word" character |
\W any "non-word" character |
\W any "non-word" character |
</pre> |
</pre> |
Line 443 accented letters, and these are then matched by \w. Th
|
Line 518 accented letters, and these are then matched by \w. Th
|
Unicode is discouraged. |
Unicode is discouraged. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 never match | By default, in a UTF mode, characters with values greater than 128 never match |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
their original meanings from before UTF-8 support was available, mainly for | their original meanings from before UTF support was available, mainly for |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
properties are used to determine character types, as follows: |
properties are used to determine character types, as follows: |
Line 463 is noticeably slower when PCRE_UCP is set.
|
Line 538 is noticeably slower when PCRE_UCP is set.
|
<P> |
<P> |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
characters by default, these always match certain high-valued codepoints in | characters by default, these always match certain high-valued codepoints, |
UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters | whether or not PCRE_UCP is set. The horizontal space characters are: |
are: | |
<pre> |
<pre> |
U+0009 Horizontal tab | U+0009 Horizontal tab (HT) |
U+0020 Space |
U+0020 Space |
U+00A0 Non-break space |
U+00A0 Non-break space |
U+1680 Ogham space mark |
U+1680 Ogham space mark |
Line 489 are:
|
Line 563 are:
|
</pre> |
</pre> |
The vertical space characters are: |
The vertical space characters are: |
<pre> |
<pre> |
U+000A Linefeed | U+000A Linefeed (LF) |
U+000B Vertical tab | U+000B Vertical tab (VT) |
U+000C Formfeed | U+000C Form feed (FF) |
U+000D Carriage return | U+000D Carriage return (CR) |
U+0085 Next line | U+0085 Next line (NEL) |
U+2028 Line separator |
U+2028 Line separator |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
<a name="newlineseq"></a></PRE> | </pre> |
</P> | In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are |
| relevant. |
| <a name="newlineseq"></a></P> |
<br><b> |
<br><b> |
Newline sequences |
Newline sequences |
</b><br> |
</b><br> |
<P> |
<P> |
Outside a character class, by default, the escape sequence \R matches any |
Outside a character class, by default, the escape sequence \R matches any |
Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following: | Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the |
| following: |
<pre> |
<pre> |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
</pre> |
</pre> |
Line 511 This is an example of an "atomic group", details of wh
|
Line 588 This is an example of an "atomic group", details of wh
|
<a href="#atomicgroup">below.</a> |
<a href="#atomicgroup">below.</a> |
This particular group matches either the two-character sequence CR followed by |
This particular group matches either the two-character sequence CR followed by |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next |
line, U+0085). The two-character sequence is treated as a single unit that |
line, U+0085). The two-character sequence is treated as a single unit that |
cannot be split. |
cannot be split. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, two additional characters whose codepoints are greater than 255 | In other modes, two additional characters whose codepoints are greater than 255 |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
Unicode character property support is not needed for these characters to be |
Unicode character property support is not needed for these characters to be |
recognized. |
recognized. |
Line 533 one of the following sequences:
|
Line 610 one of the following sequences:
|
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function, but |
<b>pcre_compile2()</b>, but they can be overridden by options given to | they can themselves be overridden by options given to a matching function. Note |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings, | that these special settings, which are not Perl-compatible, are recognized only |
which are not Perl-compatible, are recognized only at the very start of a | at the very start of a pattern, and that they must be in upper case. If more |
pattern, and that they must be in upper case. If more than one of them is | than one of them is present, the last one is used. They can be combined with a |
present, the last one is used. They can be combined with a change of newline | change of newline convention; for example, a pattern can start with: |
convention; for example, a pattern can start with: | |
<pre> |
<pre> |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
</pre> |
</pre> |
They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or |
a character class, \R is treated as an unrecognized escape sequence, and so | (*UCP) special sequences. Inside a character class, \R is treated as an |
matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. | unrecognized escape sequence, and so matches the letter "R" by default, but |
| causes an error if PCRE_EXTRA is set. |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
<br><b> |
<br><b> |
Unicode character properties |
Unicode character properties |
Line 553 Unicode character properties
|
Line 630 Unicode character properties
|
<P> |
<P> |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
escape sequences that match characters with specific properties are available. |
escape sequences that match characters with specific properties are available. |
When not in UTF-8 mode, these sequences are of course limited to testing | When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing |
characters whose codepoints are less than 256, but they do work in this mode. |
characters whose codepoints are less than 256, but they do work in this mode. |
The extra escape sequences are: |
The extra escape sequences are: |
<pre> |
<pre> |
\p{<i>xx</i>} a character with the <i>xx</i> property |
\p{<i>xx</i>} a character with the <i>xx</i> property |
\P{<i>xx</i>} a character without the <i>xx</i> property |
\P{<i>xx</i>} a character without the <i>xx</i> property |
\X an extended Unicode sequence | \X a Unicode extended grapheme cluster |
</pre> |
</pre> |
The property names represented by <i>xx</i> above are limited to the Unicode |
The property names represented by <i>xx</i> above are limited to the Unicode |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
Line 587 Armenian,
|
Line 664 Armenian,
|
Avestan, |
Avestan, |
Balinese, |
Balinese, |
Bamum, |
Bamum, |
|
Batak, |
Bengali, |
Bengali, |
Bopomofo, |
Bopomofo, |
|
Brahmi, |
Braille, |
Braille, |
Buginese, |
Buginese, |
Buhid, |
Buhid, |
Canadian_Aboriginal, |
Canadian_Aboriginal, |
Carian, |
Carian, |
|
Chakma, |
Cham, |
Cham, |
Cherokee, |
Cherokee, |
Common, |
Common, |
Line 636 Lisu,
|
Line 716 Lisu,
|
Lycian, |
Lycian, |
Lydian, |
Lydian, |
Malayalam, |
Malayalam, |
|
Mandaic, |
Meetei_Mayek, |
Meetei_Mayek, |
|
Meroitic_Cursive, |
|
Meroitic_Hieroglyphs, |
|
Miao, |
Mongolian, |
Mongolian, |
Myanmar, |
Myanmar, |
New_Tai_Lue, |
New_Tai_Lue, |
Line 655 Rejang,
|
Line 739 Rejang,
|
Runic, |
Runic, |
Samaritan, |
Samaritan, |
Saurashtra, |
Saurashtra, |
|
Sharada, |
Shavian, |
Shavian, |
Sinhala, |
Sinhala, |
|
Sora_Sompeng, |
Sundanese, |
Sundanese, |
Syloti_Nagri, |
Syloti_Nagri, |
Syriac, |
Syriac, |
Line 665 Tagbanwa,
|
Line 751 Tagbanwa,
|
Tai_Le, |
Tai_Le, |
Tai_Tham, |
Tai_Tham, |
Tai_Viet, |
Tai_Viet, |
|
Takri, |
Tamil, |
Tamil, |
Telugu, |
Telugu, |
Thaana, |
Thaana, |
Line 742 a modifier or "other".
|
Line 829 a modifier or "other".
|
</P> |
</P> |
<P> |
<P> |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so | U+DFFF. Such characters are not valid in Unicode strings and so |
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off | cannot be tested by PCRE, unless UTF validity checking has been turned off |
(see the discussion of PCRE_NO_UTF8_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and |
| PCRE_NO_UTF32_CHECK in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
page). Perl does not support the Cs property. |
page). Perl does not support the Cs property. |
</P> |
</P> |
Line 760 Unicode table.
|
Line 848 Unicode table.
|
</P> |
</P> |
<P> |
<P> |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
example, \p{Lu} always matches only upper case letters. | example, \p{Lu} always matches only upper case letters. This is different from |
| the behaviour of current versions of Perl. |
</P> |
</P> |
<P> |
<P> |
The \X escape matches any number of Unicode characters that form an extended | Matching characters by Unicode property is not fast, because PCRE has to do a |
Unicode sequence. \X is equivalent to | multistage table lookup in order to find a character's property. That is why |
| the traditional escape sequences such as \d and \w do not use Unicode |
| properties in PCRE by default, though you can make them do so by setting the |
| PCRE_UCP option or by starting the pattern with (*UCP). |
| </P> |
| <br><b> |
| Extended grapheme clusters |
| </b><br> |
| <P> |
| The \X escape matches any number of Unicode characters that form an "extended |
| grapheme cluster", and treats the sequence as an atomic group |
| <a href="#atomicgroup">(see below).</a> |
| Up to and including release 8.31, PCRE matched an earlier, simpler definition |
| that was equivalent to |
<pre> |
<pre> |
(?>\PM\pM*) |
(?>\PM\pM*) |
</pre> |
</pre> |
That is, it matches a character without the "mark" property, followed by zero | That is, it matched a character without the "mark" property, followed by zero |
or more characters with the "mark" property, and treats the sequence as an | or more characters with the "mark" property. Characters with the "mark" |
atomic group | property are typically non-spacing accents that affect the preceding character. |
<a href="#atomicgroup">(see below).</a> | |
Characters with the "mark" property are typically accents that affect the | |
preceding character. None of them have codepoints less than 256, so in | |
non-UTF-8 mode \X matches any one character. | |
</P> |
</P> |
<P> |
<P> |
Note that recent versions of Perl have changed \X to match what Unicode calls | This simple definition was extended in Unicode to include more complicated |
an "extended grapheme cluster", which has a more complicated definition. | kinds of composite character by giving each character a grapheme breaking |
| property, and creating rules that use these properties to define the boundaries |
| of extended grapheme clusters. In releases of PCRE later than 8.31, \X matches |
| one of these clusters. |
</P> |
</P> |
<P> |
<P> |
Matching characters by Unicode property is not fast, because PCRE has to search | \X always matches at least one character. Then it decides whether to add |
a structure that contains data for over fifteen thousand characters. That is | additional characters according to the following rules for ending a cluster: |
why the traditional escape sequences such as \d and \w do not use Unicode | </P> |
properties in PCRE by default, though you can make them do so by setting the | <P> |
PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with | 1. End at the end of the subject string. |
(*UCP). | </P> |
| <P> |
| 2. Do not end between CR and LF; otherwise end after any control character. |
| </P> |
| <P> |
| 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters |
| are of five types: L, V, T, LV, and LVT. An L character may be followed by an |
| L, V, LV, or LVT character; an LV or V character may be followed by a V or T |
| character; an LVT or T character may be follwed only by a T character. |
| </P> |
| <P> |
| 4. Do not end before extending characters or spacing marks. Characters with |
| the "mark" property always have the "extend" grapheme breaking property. |
| </P> |
| <P> |
| 5. Do not end after prepend characters. |
| </P> |
| <P> |
| 6. Otherwise, end the cluster. |
<a name="extraprops"></a></P> |
<a name="extraprops"></a></P> |
<br><b> |
<br><b> |
PCRE's additional properties |
PCRE's additional properties |
</b><br> |
</b><br> |
<P> |
<P> |
As well as the standard Unicode properties described in the previous | As well as the standard Unicode properties described above, PCRE supports four |
section, PCRE supports four more that make it possible to convert traditional | more that make it possible to convert traditional escape sequences such as \w |
escape sequences such as \w and \s and POSIX character classes to use Unicode | and \s and POSIX character classes to use Unicode properties. PCRE uses these |
properties. PCRE uses these non-standard, non-Perl properties internally when | non-standard, non-Perl properties internally when PCRE_UCP is set. However, |
PCRE_UCP is set. They are: | they may also be used explicitly. These properties are: |
<pre> |
<pre> |
Xan Any alphanumeric character |
Xan Any alphanumeric character |
Xps Any POSIX space character |
Xps Any POSIX space character |
Line 804 PCRE_UCP is set. They are:
|
Line 923 PCRE_UCP is set. They are:
|
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
</pre> |
</pre> |
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
|
</P> |
|
<P> |
|
There is another non-standard property, Xuc, which matches any character that |
|
can be represented by a Universal Character Name in C++ and other programming |
|
languages. These are the characters $, @, ` (grave accent), and all characters |
|
with Unicode code points greater than or equal to U+00A0, except for the |
|
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are |
|
excluded. (Universal Character Names are of the form \uHHHH or \UHHHHHHHH |
|
where H is a hexadecimal digit. Note that the Xuc property does not match these |
|
sequences but the characters that they represent.) |
<a name="resetmatchstart"></a></P> |
<a name="resetmatchstart"></a></P> |
<br><b> |
<br><b> |
Resetting the match start |
Resetting the match start |
Line 865 escape sequence" error is generated instead.
|
Line 994 escape sequence" error is generated instead.
|
A word boundary is a position in the subject string where the current character |
A word boundary is a position in the subject string where the current character |
and the previous character do not both match \w or \W (i.e. one matches |
and the previous character do not both match \w or \W (i.e. one matches |
\w and the other matches \W), or the start or end of the string if the |
\w and the other matches \W), or the start or end of the string if the |
first or last character matches \w, respectively. In UTF-8 mode, the meanings | first or last character matches \w, respectively. In a UTF mode, the meanings |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
of word" or "end of word" metasequence. However, whatever follows \b normally |
of word" or "end of word" metasequence. However, whatever follows \b normally |
Line 904 If all the alternatives of a pattern begin with \G, th
|
Line 1033 If all the alternatives of a pattern begin with \G, th
|
to the starting match position, and the "anchored" flag is set in the compiled |
to the starting match position, and the "anchored" flag is set in the compiled |
regular expression. |
regular expression. |
</P> |
</P> |
<br><a name="SEC5" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> | <br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br> |
<P> |
<P> |
|
The circumflex and dollar metacharacters are zero-width assertions. That is, |
|
they test for a particular condition being true without consuming any |
|
characters from the subject string. |
|
</P> |
|
<P> |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
character is an assertion that is true only if the current matching point is | character is an assertion that is true only if the current matching point is at |
at the start of the subject string. If the <i>startoffset</i> argument of | the start of the subject string. If the <i>startoffset</i> argument of |
<b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE |
<b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE |
option is unset. Inside a character class, circumflex has an entirely different |
option is unset. Inside a character class, circumflex has an entirely different |
meaning |
meaning |
Line 924 constrained to match only at the start of the subject,
|
Line 1058 constrained to match only at the start of the subject,
|
to be anchored.) |
to be anchored.) |
</P> |
</P> |
<P> |
<P> |
A dollar character is an assertion that is true only if the current matching | The dollar character is an assertion that is true only if the current matching |
point is at the end of the subject string, or immediately before a newline | point is at the end of the subject string, or immediately before a newline at |
at the end of the string (by default). Dollar need not be the last character of | the end of the string (by default). Note, however, that it does not actually |
the pattern if a number of alternatives are involved, but it should be the last | match the newline. Dollar need not be the last character of the pattern if a |
item in any branch in which it appears. Dollar has no special meaning in a | number of alternatives are involved, but it should be the last item in any |
character class. | branch in which it appears. Dollar has no special meaning in a character class. |
</P> |
</P> |
<P> |
<P> |
The meaning of dollar can be changed so that it matches only at the very end of |
The meaning of dollar can be changed so that it matches only at the very end of |
Line 958 Note that the sequences \A, \Z, and \z can be used to
|
Line 1092 Note that the sequences \A, \Z, and \z can be used to
|
end of the subject in both modes, and if all branches of a pattern start with |
end of the subject in both modes, and if all branches of a pattern start with |
\A it is always anchored, whether or not PCRE_MULTILINE is set. |
\A it is always anchored, whether or not PCRE_MULTILINE is set. |
<a name="fullstopdot"></a></P> |
<a name="fullstopdot"></a></P> |
<br><a name="SEC6" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> | <br><a name="SEC7" href="#TOC1">FULL STOP (PERIOD, DOT) AND \N</a><br> |
<P> |
<P> |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
the subject string except (by default) a character that signifies the end of a |
the subject string except (by default) a character that signifies the end of a |
line. In UTF-8 mode, the matched character may be more than one byte long. | line. |
</P> |
</P> |
<P> |
<P> |
When a line ending is defined as a single character, dot never matches that |
When a line ending is defined as a single character, dot never matches that |
Line 989 the PCRE_DOTALL option. In other words, it matches any
|
Line 1123 the PCRE_DOTALL option. In other words, it matches any
|
that signifies the end of a line. Perl also uses \N to match characters by |
that signifies the end of a line. Perl also uses \N to match characters by |
name; PCRE does not support this. |
name; PCRE does not support this. |
</P> |
</P> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> | <br><a name="SEC8" href="#TOC1">MATCHING A SINGLE DATA UNIT</a><br> |
<P> |
<P> |
Outside a character class, the escape sequence \C matches any one byte, both | Outside a character class, the escape sequence \C matches any one data unit, |
in and out of UTF-8 mode. Unlike a dot, it always matches line-ending | whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
characters. The feature is provided in Perl in order to match individual bytes | byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is |
in UTF-8 mode, but it is unclear how it can usefully be used. Because \C | a 32-bit unit. Unlike a dot, \C always |
breaks up characters into individual bytes, matching one byte with \C in UTF-8 | matches line-ending characters. The feature is provided in Perl in order to |
mode means that the rest of the string may start with a malformed UTF-8 | match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
character. This has undefined results, because PCRE assumes that it is dealing | used. Because \C breaks up characters into individual data units, matching one |
with valid UTF-8 strings (and by default it checks this at the start of | unit with \C in a UTF mode means that the rest of the string may start with a |
processing unless the PCRE_NO_UTF8_CHECK option is used). | malformed UTF character. This has undefined results, because PCRE assumes that |
| it is dealing with valid UTF strings (and by default it checks this at the |
| start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or |
| PCRE_NO_UTF32_CHECK option is used). |
</P> |
</P> |
<P> |
<P> |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
<a href="#lookbehind">(described below)</a> |
<a href="#lookbehind">(described below)</a> |
in UTF-8 mode, because this would make it impossible to calculate the length of | in a UTF mode, because this would make it impossible to calculate the length of |
the lookbehind. |
the lookbehind. |
</P> |
</P> |
<P> |
<P> |
In general, the \C escape sequence is best avoided in UTF-8 mode. However, one | In general, the \C escape sequence is best avoided. However, one |
way of using it that avoids the problem of malformed UTF-8 characters is to | way of using it that avoids the problem of malformed UTF characters is to use a |
use a lookahead to check the length of the next character, as in this pattern | lookahead to check the length of the next character, as in this pattern, which |
(ignore white space and line breaks): | could be used with a UTF-8 string (ignore white space and line breaks): |
<pre> |
<pre> |
(?| (?=[\x00-\x7f])(\C) | |
(?| (?=[\x00-\x7f])(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
Line 1026 character for values whose encoding uses 1, 2, 3, or 4
|
Line 1163 character for values whose encoding uses 1, 2, 3, or 4
|
character's individual bytes are then captured by the appropriate number of |
character's individual bytes are then captured by the appropriate number of |
groups. |
groups. |
<a name="characterclass"></a></P> |
<a name="characterclass"></a></P> |
<br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> | <br><a name="SEC9" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br> |
<P> |
<P> |
An opening square bracket introduces a character class, terminated by a closing |
An opening square bracket introduces a character class, terminated by a closing |
square bracket. A closing square bracket on its own is not special by default. |
square bracket. A closing square bracket on its own is not special by default. |
Line 1036 a member of the class, it should be the first data cha
|
Line 1173 a member of the class, it should be the first data cha
|
(after an initial circumflex, if present) or escaped with a backslash. |
(after an initial circumflex, if present) or escaped with a backslash. |
</P> |
</P> |
<P> |
<P> |
A character class matches a single character in the subject. In UTF-8 mode, the | A character class matches a single character in the subject. In a UTF mode, the |
character may be more than one byte long. A matched character must be in the | character may be more than one data unit long. A matched character must be in |
set of characters defined by the class, unless the first character in the class | the set of characters defined by the class, unless the first character in the |
definition is a circumflex, in which case the subject character must not be in | class definition is a circumflex, in which case the subject character must not |
the set defined by the class. If a circumflex is actually required as a member | be in the set defined by the class. If a circumflex is actually required as a |
of the class, ensure it is not the first character, or escape it with a | member of the class, ensure it is not the first character, or escape it with a |
backslash. |
backslash. |
</P> |
</P> |
<P> |
<P> |
Line 1054 string, and therefore it fails if the current pointer
|
Line 1191 string, and therefore it fails if the current pointer
|
string. |
string. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, characters with values greater than 255 can be included in a | In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) |
class as a literal string of bytes, or by using the \x{ escaping mechanism. | can be included in a class as a literal string of data units, or by using the |
| \x{ escaping mechanism. |
</P> |
</P> |
<P> |
<P> |
When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
caseful version would. In UTF-8 mode, PCRE always understands the concept of | caseful version would. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching in UTF8-mode for characters 128 and above, | If you want to use caseless matching in a UTF mode for characters 128 and |
you must ensure that PCRE is compiled with Unicode property support as well as | above, you must ensure that PCRE is compiled with Unicode property support as |
with UTF-8 support. | well as with UTF support. |
</P> |
</P> |
<P> |
<P> |
Characters that might indicate line breaks are never treated in any special way |
Characters that might indicate line breaks are never treated in any special way |
Line 1093 followed by two other characters. The octal or hexadec
|
Line 1231 followed by two other characters. The octal or hexadec
|
</P> |
</P> |
<P> |
<P> |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
used for characters specified numerically, for example [\000-\037]. In UTF-8 | used for characters specified numerically, for example [\000-\037]. Ranges |
mode, ranges can include characters whose values are greater than 255, for | can include any characters that are valid for the current mode. |
example [\x{100}-\x{2ff}]. | |
</P> |
</P> |
<P> |
<P> |
If a range that includes letters is used when caseless matching is set, it |
If a range that includes letters is used when caseless matching is set, it |
matches the letters in either case. For example, [W-c] is equivalent to |
matches the letters in either case. For example, [W-c] is equivalent to |
[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character | [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for | characters in both cases. In UTF modes, PCRE supports the concept of case for |
characters with values greater than 128 only when it is compiled with Unicode |
characters with values greater than 128 only when it is compiled with Unicode |
property support. |
property support. |
</P> |
</P> |
Line 1110 property support.
|
Line 1247 property support.
|
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
\V, \w, and \W may appear in a character class, and add the characters that |
\V, \w, and \W may appear in a character class, and add the characters that |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w | digit. In UTF modes, the PCRE_UCP option affects the meanings of \d, \s, \w |
and their upper case partners, just as it does when they appear outside a |
and their upper case partners, just as it does when they appear outside a |
character class, as described in the section entitled |
character class, as described in the section entitled |
<a href="#genericchartypes">"Generic character types"</a> |
<a href="#genericchartypes">"Generic character types"</a> |
Line 1136 introducing a POSIX class name - see the next section)
|
Line 1273 introducing a POSIX class name - see the next section)
|
closing square bracket. However, escaping other non-alphanumeric characters |
closing square bracket. However, escaping other non-alphanumeric characters |
does no harm. |
does no harm. |
</P> |
</P> |
<br><a name="SEC9" href="#TOC1">POSIX CHARACTER CLASSES</a><br> | <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br> |
<P> |
<P> |
Perl supports the POSIX notation for character classes. This uses names |
Perl supports the POSIX notation for character classes. This uses names |
enclosed by [: and :] within the enclosing square brackets. PCRE also supports |
enclosed by [: and :] within the enclosing square brackets. PCRE also supports |
Line 1179 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
Line 1316 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 do not match | By default, in UTF modes, characters with values greater than 128 do not match |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
character properties are used. This is achieved by replacing the POSIX classes |
character properties are used. This is achieved by replacing the POSIX classes |
Line 1198 Negated versions, such as [:^alpha:] use \P instead of
|
Line 1335 Negated versions, such as [:^alpha:] use \P instead of
|
classes are unchanged, and match only characters with code points less than |
classes are unchanged, and match only characters with code points less than |
128. |
128. |
</P> |
</P> |
<br><a name="SEC10" href="#TOC1">VERTICAL BAR</a><br> | <br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br> |
<P> |
<P> |
Vertical bar characters are used to separate alternative patterns. For example, |
Vertical bar characters are used to separate alternative patterns. For example, |
the pattern |
the pattern |
Line 1213 that succeeds is used. If the alternatives are within
|
Line 1350 that succeeds is used. If the alternatives are within
|
"succeeds" means matching the rest of the main pattern as well as the |
"succeeds" means matching the rest of the main pattern as well as the |
alternative in the subpattern. |
alternative in the subpattern. |
</P> |
</P> |
<br><a name="SEC11" href="#TOC1">INTERNAL OPTION SETTING</a><br> | <br><a name="SEC12" href="#TOC1">INTERNAL OPTION SETTING</a><br> |
<P> |
<P> |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from within |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from within |
Line 1264 behaviour otherwise.
|
Line 1401 behaviour otherwise.
|
</P> |
</P> |
<P> |
<P> |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
application when the compile or match functions are called. In some cases the | application when the compiling or matching functions are called. In some cases |
pattern can contain special leading sequences such as (*CRLF) to override what | the pattern can contain special leading sequences such as (*CRLF) to override |
the application has set or what has been defaulted. Details are given in the | what the application has set or what has been defaulted. Details are given in |
section entitled | the section entitled |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
above. There are also the (*UTF8) and (*UCP) leading sequences that can be used | above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading |
to set UTF-8 and Unicode property modes; they are equivalent to setting the | sequences that can be used to set UTF and Unicode property modes; they are |
PCRE_UTF8 and the PCRE_UCP options, respectively. | equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP |
| options, respectively. The (*UTF) sequence is a generic version that can be |
| used with any of the libraries. However, the application can set the |
| PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. |
<a name="subpattern"></a></P> |
<a name="subpattern"></a></P> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> | <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br> |
<P> |
<P> |
Subpatterns are delimited by parentheses (round brackets), which can be nested. |
Subpatterns are delimited by parentheses (round brackets), which can be nested. |
Turning part of a pattern into a subpattern does two things: |
Turning part of a pattern into a subpattern does two things: |
Line 1289 match "cataract", "erpillar" or an empty string.
|
Line 1429 match "cataract", "erpillar" or an empty string.
|
<br> |
<br> |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
the whole pattern matches, that portion of the subject string that matched the |
the whole pattern matches, that portion of the subject string that matched the |
subpattern is passed back to the caller via the <i>ovector</i> argument of | subpattern is passed back to the caller via the <i>ovector</i> argument of the |
<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting | matching function. (This applies only to the traditional matching functions; |
from 1) to obtain numbers for the capturing subpatterns. For example, if the | the DFA matching functions do not support capturing.) |
string "the red king" is matched against the pattern | </P> |
| <P> |
| Opening parentheses are counted from left to right (starting from 1) to obtain |
| numbers for the capturing subpatterns. For example, if the string "the red |
| king" is matched against the pattern |
<pre> |
<pre> |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
</pre> |
</pre> |
Line 1325 from left to right, and options are not reset until th
|
Line 1469 from left to right, and options are not reset until th
|
is reached, an option setting in one branch does affect subsequent branches, so |
is reached, an option setting in one branch does affect subsequent branches, so |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
<a name="dupsubpatternnumber"></a></P> |
<a name="dupsubpatternnumber"></a></P> |
<br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> | <br><a name="SEC14" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> |
<P> |
<P> |
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
the same numbers for its capturing parentheses. Such a subpattern starts with |
the same numbers for its capturing parentheses. Such a subpattern starts with |
Line 1369 true if any of the subpatterns of that number have mat
|
Line 1513 true if any of the subpatterns of that number have mat
|
An alternative approach to using this "branch reset" feature is to use |
An alternative approach to using this "branch reset" feature is to use |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
</P> |
</P> |
<br><a name="SEC14" href="#TOC1">NAMED SUBPATTERNS</a><br> | <br><a name="SEC15" href="#TOC1">NAMED SUBPATTERNS</a><br> |
<P> |
<P> |
Identifying capturing parentheses by number is simple, but it can be very hard |
Identifying capturing parentheses by number is simple, but it can be very hard |
to keep track of the numbers in complicated regular expressions. Furthermore, |
to keep track of the numbers in complicated regular expressions. Furthermore, |
Line 1444 matching. For this reason, an error is given at compil
|
Line 1588 matching. For this reason, an error is given at compil
|
are given to subpatterns with the same number. However, you can give the same |
are given to subpatterns with the same number. However, you can give the same |
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. |
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. |
</P> |
</P> |
<br><a name="SEC15" href="#TOC1">REPETITION</a><br> | <br><a name="SEC16" href="#TOC1">REPETITION</a><br> |
<P> |
<P> |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
items: |
items: |
Line 1452 items:
|
Line 1596 items:
|
a literal data character |
a literal data character |
the dot metacharacter |
the dot metacharacter |
the \C escape sequence |
the \C escape sequence |
the \X escape sequence (in UTF-8 mode with Unicode properties) | the \X escape sequence |
the \R escape sequence |
the \R escape sequence |
an escape such as \d or \pL that matches a single character |
an escape such as \d or \pL that matches a single character |
a character class |
a character class |
Line 1484 quantifier, is taken as a literal character. For examp
|
Line 1628 quantifier, is taken as a literal character. For examp
|
quantifier, but a literal string of four characters. |
quantifier, but a literal string of four characters. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual | In UTF modes, quantifiers apply to characters rather than to individual data |
bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of | units. Thus, for example, \x{100}{2} matches two characters, each of |
which is represented by a two-byte sequence. Similarly, when Unicode property | which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
support is available, \X{3} matches three Unicode extended sequences, each of | \X{3} matches three Unicode extended grapheme clusters, each of which may be |
which may be several bytes long (and they may be of different lengths). | several data units long (and they may be of different lengths). |
</P> |
</P> |
<P> |
<P> |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
Line 1577 worth setting PCRE_DOTALL in order to obtain this opti
|
Line 1721 worth setting PCRE_DOTALL in order to obtain this opti
|
alternatively using ^ to indicate anchoring explicitly. |
alternatively using ^ to indicate anchoring explicitly. |
</P> |
</P> |
<P> |
<P> |
However, there is one situation where the optimization cannot be used. When .* | However, there are some cases where the optimization cannot be used. When .* |
is inside capturing parentheses that are the subject of a back reference |
is inside capturing parentheses that are the subject of a back reference |
elsewhere in the pattern, a match at the start may fail where a later one |
elsewhere in the pattern, a match at the start may fail where a later one |
succeeds. Consider, for example: |
succeeds. Consider, for example: |
Line 1588 If the subject is "xyz123abc123" the match point is th
|
Line 1732 If the subject is "xyz123abc123" the match point is th
|
this reason, such a pattern is not implicitly anchored. |
this reason, such a pattern is not implicitly anchored. |
</P> |
</P> |
<P> |
<P> |
|
Another case where implicit anchoring is not applied is when the leading .* is |
|
inside an atomic group. Once again, a match at the start may fail where a later |
|
one succeeds. Consider this pattern: |
|
<pre> |
|
(?>.*?a)b |
|
</pre> |
|
It matches "ab" in the subject "aab". The use of the backtracking control verbs |
|
(*PRUNE) and (*SKIP) also disable this optimization. |
|
</P> |
|
<P> |
When a capturing subpattern is repeated, the value captured is the substring |
When a capturing subpattern is repeated, the value captured is the substring |
that matched the final iteration. For example, after |
that matched the final iteration. For example, after |
<pre> |
<pre> |
Line 1602 example, after
|
Line 1756 example, after
|
</pre> |
</pre> |
matches "aba" the value of the second captured substring is "b". |
matches "aba" the value of the second captured substring is "b". |
<a name="atomicgroup"></a></P> |
<a name="atomicgroup"></a></P> |
<br><a name="SEC16" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> | <br><a name="SEC17" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
<P> |
<P> |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
repetition, failure of what follows normally causes the repeated item to be |
repetition, failure of what follows normally causes the repeated item to be |
Line 1706 an atomic group, like this:
|
Line 1860 an atomic group, like this:
|
</pre> |
</pre> |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
<a name="backreferences"></a></P> |
<a name="backreferences"></a></P> |
<br><a name="SEC17" href="#TOC1">BACK REFERENCES</a><br> | <br><a name="SEC18" href="#TOC1">BACK REFERENCES</a><br> |
<P> |
<P> |
Outside a character class, a backslash followed by a digit greater than 0 (and |
Outside a character class, a backslash followed by a digit greater than 0 (and |
possibly further digits) is a back reference to a capturing subpattern earlier |
possibly further digits) is a back reference to a capturing subpattern earlier |
Line 1805 Because there may be many capturing parentheses in a p
|
Line 1959 Because there may be many capturing parentheses in a p
|
following a backslash are taken as part of a potential back reference number. |
following a backslash are taken as part of a potential back reference number. |
If the pattern continues with a digit character, some delimiter must be used to |
If the pattern continues with a digit character, some delimiter must be used to |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
whitespace. Otherwise, the \g{ syntax or an empty comment (see | white space. Otherwise, the \g{ syntax or an empty comment (see |
<a href="#comments">"Comments"</a> |
<a href="#comments">"Comments"</a> |
below) can be used. |
below) can be used. |
</P> |
</P> |
Line 1834 as an
|
Line 1988 as an
|
Once the whole group has been matched, a subsequent matching failure cannot |
Once the whole group has been matched, a subsequent matching failure cannot |
cause backtracking into the middle of the group. |
cause backtracking into the middle of the group. |
<a name="bigassertions"></a></P> |
<a name="bigassertions"></a></P> |
<br><a name="SEC18" href="#TOC1">ASSERTIONS</a><br> | <br><a name="SEC19" href="#TOC1">ASSERTIONS</a><br> |
<P> |
<P> |
An assertion is a test on the characters following or preceding the current |
An assertion is a test on the characters following or preceding the current |
matching point that does not actually consume any characters. The simple |
matching point that does not actually consume any characters. The simple |
Line 1851 except that it does not cause the current matching pos
|
Line 2005 except that it does not cause the current matching pos
|
Assertion subpatterns are not capturing subpatterns. If such an assertion |
Assertion subpatterns are not capturing subpatterns. If such an assertion |
contains capturing subpatterns within it, these are counted for the purposes of |
contains capturing subpatterns within it, these are counted for the purposes of |
numbering the capturing subpatterns in the whole pattern. However, substring |
numbering the capturing subpatterns in the whole pattern. However, substring |
capturing is carried out only for positive assertions, because it does not make | capturing is carried out only for positive assertions. (Perl sometimes, but not |
sense for negative assertions. | always, does do capturing in negative assertions.) |
</P> |
</P> |
<P> |
<P> |
For compatibility with Perl, assertion subpatterns may be repeated; though |
For compatibility with Perl, assertion subpatterns may be repeated; though |
Line 1950 match. If there are insufficient characters before the
|
Line 2104 match. If there are insufficient characters before the
|
assertion fails. |
assertion fails. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, | In a UTF mode, PCRE does not allow the \C escape (which matches a single data |
even in UTF-8 mode) to appear in lookbehind assertions, because it makes it | unit even in a UTF mode) to appear in lookbehind assertions, because it makes |
impossible to calculate the length of the lookbehind. The \X and \R escapes, | it impossible to calculate the length of the lookbehind. The \X and \R |
which can match different numbers of bytes, are also not permitted. | escapes, which can match different numbers of data units, are also not |
| permitted. |
</P> |
</P> |
<P> |
<P> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
Line 2023 preceded by "foo", while
|
Line 2178 preceded by "foo", while
|
is another pattern that matches "foo" preceded by three digits and any three |
is another pattern that matches "foo" preceded by three digits and any three |
characters that are not "999". |
characters that are not "999". |
<a name="conditions"></a></P> |
<a name="conditions"></a></P> |
<br><a name="SEC19" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> | <br><a name="SEC20" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
<P> |
<P> |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
Line 2146 point in the pattern; the idea of DEFINE is that it ca
|
Line 2301 point in the pattern; the idea of DEFINE is that it ca
|
subroutines that can be referenced from elsewhere. (The use of |
subroutines that can be referenced from elsewhere. (The use of |
<a href="#subpatternsassubroutines">subroutines</a> |
<a href="#subpatternsassubroutines">subroutines</a> |
is described below.) For example, a pattern to match an IPv4 address such as |
is described below.) For example, a pattern to match an IPv4 address such as |
"192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line |
breaks): |
breaks): |
<pre> |
<pre> |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
Line 2178 subject is matched against the first alternative; othe
|
Line 2333 subject is matched against the first alternative; othe
|
against the second. This pattern matches strings in one of the two forms |
against the second. This pattern matches strings in one of the two forms |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
<a name="comments"></a></P> |
<a name="comments"></a></P> |
<br><a name="SEC20" href="#TOC1">COMMENTS</a><br> | <br><a name="SEC21" href="#TOC1">COMMENTS</a><br> |
<P> |
<P> |
There are two ways of including comments in patterns that are processed by |
There are two ways of including comments in patterns that are processed by |
PCRE. In both cases, the start of the comment must not be in a character class, |
PCRE. In both cases, the start of the comment must not be in a character class, |
Line 2192 closing parenthesis. Nested parentheses are not permit
|
Line 2347 closing parenthesis. Nested parentheses are not permit
|
option is set, an unescaped # character also introduces a comment, which in |
option is set, an unescaped # character also introduces a comment, which in |
this case continues to immediately after the next newline character or |
this case continues to immediately after the next newline character or |
character sequence in the pattern. Which characters are interpreted as newlines |
character sequence in the pattern. Which characters are interpreted as newlines |
is controlled by the options passed to <b>pcre_compile()</b> or by a special | is controlled by the options passed to a compiling function or by a special |
sequence at the start of the pattern, as described in the section entitled |
sequence at the start of the pattern, as described in the section entitled |
<a href="#newlines">"Newline conventions"</a> |
<a href="#newlines">"Newline conventions"</a> |
above. Note that the end of this type of comment is a literal newline sequence |
above. Note that the end of this type of comment is a literal newline sequence |
Line 2207 a newline in the pattern. The sequence \n is still lit
|
Line 2362 a newline in the pattern. The sequence \n is still lit
|
it does not terminate the comment. Only an actual character with the code value |
it does not terminate the comment. Only an actual character with the code value |
0x0a (the default newline) does so. |
0x0a (the default newline) does so. |
<a name="recursion"></a></P> |
<a name="recursion"></a></P> |
<br><a name="SEC21" href="#TOC1">RECURSIVE PATTERNS</a><br> | <br><a name="SEC22" href="#TOC1">RECURSIVE PATTERNS</a><br> |
<P> |
<P> |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
unlimited nested parentheses. Without the use of recursion, the best that can |
unlimited nested parentheses. Without the use of recursion, the best that can |
Line 2422 now match "b" and so the whole match succeeds. In Perl
|
Line 2577 now match "b" and so the whole match succeeds. In Perl
|
match because inside the recursive call \1 cannot access the externally set |
match because inside the recursive call \1 cannot access the externally set |
value. |
value. |
<a name="subpatternsassubroutines"></a></P> |
<a name="subpatternsassubroutines"></a></P> |
<br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> | <br><a name="SEC23" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
<P> |
<P> |
If the syntax for a recursive subpattern call (either by number or by |
If the syntax for a recursive subpattern call (either by number or by |
name) is used outside the parentheses to which it refers, it operates like a |
name) is used outside the parentheses to which it refers, it operates like a |
Line 2463 different calls. For example, consider this pattern:
|
Line 2618 different calls. For example, consider this pattern:
|
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
<a name="onigurumasubroutines"></a></P> |
<a name="onigurumasubroutines"></a></P> |
<br><a name="SEC23" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> | <br><a name="SEC24" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br> |
<P> |
<P> |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or |
a number enclosed either in angle brackets or single quotes, is an alternative |
a number enclosed either in angle brackets or single quotes, is an alternative |
Line 2481 plus or a minus sign it is taken as a relative referen
|
Line 2636 plus or a minus sign it is taken as a relative referen
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are <i>not</i> |
synonymous. The former is a back reference; the latter is a subroutine call. |
synonymous. The former is a back reference; the latter is a subroutine call. |
</P> |
</P> |
<br><a name="SEC24" href="#TOC1">CALLOUTS</a><br> | <br><a name="SEC25" href="#TOC1">CALLOUTS</a><br> |
<P> |
<P> |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
code to be obeyed in the middle of matching a regular expression. This makes it |
code to be obeyed in the middle of matching a regular expression. This makes it |
Line 2491 same pair of parentheses when there is a repetition.
|
Line 2646 same pair of parentheses when there is a repetition.
|
<P> |
<P> |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
function by putting its entry point in the global variable <i>pcre_callout</i>. | function by putting its entry point in the global variable <i>pcre_callout</i> |
| (8-bit library) or <i>pcre[16|32]_callout</i> (16-bit or 32-bit library). |
By default, this variable contains NULL, which disables all calling out. |
By default, this variable contains NULL, which disables all calling out. |
</P> |
</P> |
<P> |
<P> |
Line 2502 For example, this pattern has two callout points:
|
Line 2658 For example, this pattern has two callout points:
|
<pre> |
<pre> |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
</pre> |
</pre> |
If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are | If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
255. | 255. If there is a conditional group in the pattern whose condition is an |
| assertion, an additional callout is inserted just before the condition. An |
| explicit callout may also be set at this position, as in this example: |
| <pre> |
| (?(?C9)(?=a)abc|def) |
| </pre> |
| Note that this applies only to assertion conditions, not to other types of |
| condition. |
</P> |
</P> |
<P> |
<P> |
During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is | During matching, when PCRE reaches a callout point, the external function is |
set), the external function is called. It is provided with the number of the | called. It is provided with the number of the callout, the position in the |
callout, the position in the pattern, and, optionally, one item of data | pattern, and, optionally, one item of data originally supplied by the caller of |
originally supplied by the caller of <b>pcre_exec()</b>. The callout function | the matching function. The callout function may cause matching to proceed, to |
may cause matching to proceed, to backtrack, or to fail altogether. A complete | backtrack, or to fail altogether. A complete description of the interface to |
description of the interface to the callout function is given in the | the callout function is given in the |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
documentation. |
documentation. |
<a name="backtrackcontrol"></a></P> |
<a name="backtrackcontrol"></a></P> |
<br><a name="SEC25" href="#TOC1">BACKTRACKING CONTROL</a><br> | <br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br> |
<P> |
<P> |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
are described in the Perl documentation as "experimental and subject to change | are still described in the Perl documentation as "experimental and subject to |
or removal in a future version of Perl". It goes on to say: "Their usage in | change or removal in a future version of Perl". It goes on to say: "Their usage |
production code should be noted to avoid problems during upgrades." The same | in production code should be noted to avoid problems during upgrades." The same |
remarks apply to the PCRE features described in this section. |
remarks apply to the PCRE features described in this section. |
</P> |
</P> |
<P> |
<P> |
Since these verbs are specifically related to backtracking, most of them can be | The new verbs make use of what was previously invalid syntax: an opening |
used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses | parenthesis followed by an asterisk. They are generally of the form |
a backtracking algorithm. With the exception of (*FAIL), which behaves like a | (*VERB) or (*VERB:NAME). Some may take either form, possibly behaving |
failing negative assertion, they cause an error if encountered by | differently depending on whether or not a name is present. A name is any |
<b>pcre_dfa_exec()</b>. | sequence of characters that does not include a closing parenthesis. The maximum |
| length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit |
| libraries. If the name is empty, that is, if the closing parenthesis |
| immediately follows the colon, the effect is as if the colon were not there. |
| Any number of these verbs may occur in a pattern. |
</P> |
</P> |
<P> |
<P> |
If any of these verbs are used in an assertion or in a subpattern that is | Since these verbs are specifically related to backtracking, most of them can be |
called as a subroutine (whether or not recursively), their effect is confined | used only when the pattern is to be matched using one of the traditional |
to that subpattern; it does not extend to the surrounding pattern, with one | matching functions, because these use a backtracking algorithm. With the |
exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in | exception of (*FAIL), which behaves like a failing negative assertion, the |
a successful positive assertion <i>is</i> passed back when a match succeeds | backtracking control verbs cause an error if encountered by a DFA matching |
(compare capturing parentheses in assertions). Note that such subpatterns are | function. |
processed as anchored at the point where they are tested. Note also that Perl's | |
treatment of subroutines is different in some cases. | |
</P> |
</P> |
<P> |
<P> |
The new verbs make use of what was previously invalid syntax: an opening | The behaviour of these verbs in |
parenthesis followed by an asterisk. They are generally of the form | <a href="#btrepeat">repeated groups,</a> |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | <a href="#btassert">assertions,</a> |
depending on whether or not an argument is present. A name is any sequence of | and in |
characters that does not include a closing parenthesis. If the name is empty, | <a href="#btsub">subpatterns called as subroutines</a> |
that is, if the closing parenthesis immediately follows the colon, the effect | (whether or not recursively) is documented below. |
is as if the colon were not there. Any number of these verbs may occur in a | <a name="nooptimize"></a></P> |
pattern. | <br><b> |
</P> | Optimizations that affect backtracking verbs |
| </b><br> |
<P> |
<P> |
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
present. When one of these optimizations suppresses the running of a match, any | present. When one of these optimizations bypasses the running of a match, any |
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the |
| section entitled |
| <a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a> |
| in the |
| <a href="pcreapi.html"><b>pcreapi</b></a> |
| documentation. |
</P> |
</P> |
<P> |
<P> |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Line 2577 followed by a name.
|
Line 2748 followed by a name.
|
This verb causes the match to end successfully, skipping the remainder of the |
This verb causes the match to end successfully, skipping the remainder of the |
pattern. However, when it is inside a subpattern that is called as a |
pattern. However, when it is inside a subpattern that is called as a |
subroutine, only that subpattern is ended successfully. Matching then continues |
subroutine, only that subpattern is ended successfully. Matching then continues |
at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so | at the outer level. If (*ACCEPT) in triggered in a positive assertion, the |
far is captured. For example: | assertion succeeds; in a negative assertion, the assertion fails. |
| </P> |
| <P> |
| If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For |
| example: |
<pre> |
<pre> |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
</pre> |
</pre> |
Line 2612 A name is always required with this verb. There may be
|
Line 2787 A name is always required with this verb. There may be
|
(*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
</P> |
</P> |
<P> |
<P> |
When a match succeeds, the name of the last-encountered (*MARK) on the matching | When a match succeeds, the name of the last-encountered (*MARK:NAME), |
path is passed back to the caller via the <i>pcre_extra</i> data structure, as | (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the |
described in the | caller as described in the section entitled |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> | <a href="pcreapi.html#extradata">"Extra data for <b>pcre_exec()</b>"</a> |
in the |
in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
Line 2635 of obtaining this information than putting each altern
|
Line 2810 of obtaining this information than putting each altern
|
capturing parentheses. |
capturing parentheses. |
</P> |
</P> |
<P> |
<P> |
If (*MARK) is encountered in a positive assertion, its name is recorded and | If a verb with a name is encountered in a positive assertion that is true, the |
passed back if it is the last-encountered. This does not happen for negative | name is recorded and passed back if it is the last-encountered. This does not |
assertions. | happen for negative assertions or failing positive assertions. |
</P> |
</P> |
<P> |
<P> |
After a partial match or a failed match, the name of the last encountered | After a partial match or a failed match, the last encountered name in the |
(*MARK) in the entire match process is returned. For example: | entire match process is returned. For example: |
<pre> |
<pre> |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
data> XP |
data> XP |
No match, mark = B |
No match, mark = B |
</pre> |
</pre> |
Note that in this unanchored example the mark is retained from the match |
Note that in this unanchored example the mark is retained from the match |
attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match |
"P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the |
nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. |
</P> |
</P> |
|
<P> |
|
If you are interested in (*MARK) values after failed matches, you should |
|
probably set the PCRE_NO_START_OPTIMIZE option |
|
<a href="#nooptimize">(see above)</a> |
|
to ensure that the match is always attempted. |
|
</P> |
<br><b> |
<br><b> |
Verbs that act after backtracking |
Verbs that act after backtracking |
</b><br> |
</b><br> |
Line 2659 Verbs that act after backtracking
|
Line 2840 Verbs that act after backtracking
|
The following verbs do nothing when they are encountered. Matching continues |
The following verbs do nothing when they are encountered. Matching continues |
with what follows, but if there is no subsequent match, causing a backtrack to |
with what follows, but if there is no subsequent match, causing a backtrack to |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb. However, when one of these verbs appears inside an atomic group, its | the verb. However, when one of these verbs appears inside an atomic group or an |
effect is confined to that group, because once the group has been matched, | assertion that is true, its effect is confined to that group, because once the |
there is never any backtracking into it. In this situation, backtracking can | group has been matched, there is never any backtracking into it. In this |
"jump back" to the left of the entire atomic group. (Remember also, as stated | situation, backtracking can "jump back" to the left of the entire atomic group |
above, that this localization also applies in subroutine calls and assertions.) | or assertion. (Remember also, as stated above, that this localization also |
| applies in subroutine calls.) |
</P> |
</P> |
<P> |
<P> |
These verbs differ in exactly what kind of failure occurs when backtracking |
These verbs differ in exactly what kind of failure occurs when backtracking |
reaches them. | reaches them. The behaviour described below is what happens when the verb is |
| not in a subroutine or an assertion. Subsequent sections cover these special |
| cases. |
<pre> |
<pre> |
(*COMMIT) |
(*COMMIT) |
</pre> |
</pre> |
This verb, which may not be followed by a name, causes the whole match to fail |
This verb, which may not be followed by a name, causes the whole match to fail |
outright if the rest of the pattern does not match. Even if the pattern is | outright if there is a later matching failure that causes backtracking to reach |
unanchored, no further attempts to find a match by advancing the starting point | it. Even if the pattern is unanchored, no further attempts to find a match by |
take place. Once (*COMMIT) has been passed, <b>pcre_exec()</b> is committed to | advancing the starting point take place. If (*COMMIT) is the only backtracking |
finding a match at the current starting point, or not at all. For example: | verb that is encountered, once it has been passed <b>pcre_exec()</b> is |
| committed to finding a match at the current starting point, or not at all. For |
| example: |
<pre> |
<pre> |
a+(*COMMIT)b |
a+(*COMMIT)b |
</pre> |
</pre> |
Line 2685 recently passed (*MARK) in the path is passed back whe
|
Line 2871 recently passed (*MARK) in the path is passed back whe
|
match failure. |
match failure. |
</P> |
</P> |
<P> |
<P> |
|
If there is more than one backtracking verb in a pattern, a different one that |
|
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a |
|
match does not always guarantee that a match must be at this starting point. |
|
</P> |
|
<P> |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
<b>pcretest</b> example: |
<b>pcretest</b> example: |
Line 2704 starting points.
|
Line 2895 starting points.
|
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
</pre> |
</pre> |
This verb causes the match to fail at the current starting position in the |
This verb causes the match to fail at the current starting position in the |
subject if the rest of the pattern does not match. If the pattern is | subject if there is a later matching failure that causes backtracking to reach |
unanchored, the normal "bumpalong" advance to the next starting character then | it. If the pattern is unanchored, the normal "bumpalong" advance to the next |
happens. Backtracking can occur as usual to the left of (*PRUNE), before it is | starting character then happens. Backtracking can occur as usual to the left of |
reached, or when matching to the right of (*PRUNE), but if there is no match to | (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of | if there is no match to the right, backtracking cannot cross (*PRUNE). In |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, | simple cases, the use of (*PRUNE) is just an alternative to an atomic group or |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. | possessive quantifier, but there are some uses of (*PRUNE) that cannot be |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an | expressed in any other way. In an anchored pattern (*PRUNE) has the same effect |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). | as (*COMMIT). |
| </P> |
| <P> |
| The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). |
| It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
<pre> |
<pre> |
(*SKIP) |
(*SKIP) |
</pre> |
</pre> |
Line 2733 instead of skipping on to "c".
|
Line 2929 instead of skipping on to "c".
|
<pre> |
<pre> |
(*SKIP:NAME) |
(*SKIP:NAME) |
</pre> |
</pre> |
When (*SKIP) has an associated name, its behaviour is modified. If the | When (*SKIP) has an associated name, its behaviour is modified. When it is |
following pattern fails to match, the previous path through the pattern is | triggered, the previous path through the pattern is searched for the most |
searched for the most recent (*MARK) that has the same name. If one is found, | recent (*MARK) that has the same name. If one is found, the "bumpalong" advance |
the "bumpalong" advance is to the subject position that corresponds to that | is to the subject position that corresponds to that (*MARK) instead of to where |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a | (*SKIP) was encountered. If no (*MARK) with a matching name is found, the |
matching name is found, the (*SKIP) is ignored. | (*SKIP) is ignored. |
| </P> |
| <P> |
| Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores |
| names that are set by (*PRUNE:NAME) or (*THEN:NAME). |
<pre> |
<pre> |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
</pre> |
</pre> |
This verb causes a skip to the next innermost alternative if the rest of the | This verb causes a skip to the next innermost alternative when backtracking |
pattern does not match. That is, it cancels pending backtracking, but only | reaches it. That is, it cancels any further backtracking within the current |
within the current alternative. Its name comes from the observation that it can | alternative. Its name comes from the observation that it can be used for a |
be used for a pattern-based if-then-else block: | pattern-based if-then-else block: |
<pre> |
<pre> |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
</pre> |
</pre> |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
second alternative and tries COND2, without backtracking into COND1. The | second alternative and tries COND2, without backtracking into COND1. If that |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). | succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no |
If (*THEN) is not inside an alternation, it acts like (*PRUNE). | more alternatives, so there is a backtrack to whatever came before the entire |
| group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
</P> |
</P> |
<P> |
<P> |
Note that a subpattern that does not contain a | character is just a part of | The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN). |
the enclosing alternative; it is not a nested alternation with only one | It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
| </P> |
| <P> |
| A subpattern that does not contain a | character is just a part of the |
| enclosing alternative; it is not a nested alternation with only one |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
pattern fragments that do not contain any | characters at this level: |
pattern fragments that do not contain any | characters at this level: |
Line 2777 because there are no more alternatives to try. In this
|
Line 2983 because there are no more alternatives to try. In this
|
backtrack into A. |
backtrack into A. |
</P> |
</P> |
<P> |
<P> |
Note also that a conditional subpattern is not considered as having two | Note that a conditional subpattern is not considered as having two |
alternatives, because only one is ever used. In other words, the | character in |
alternatives, because only one is ever used. In other words, the | character in |
a conditional subpattern has a different meaning. Ignoring white space, |
a conditional subpattern has a different meaning. Ignoring white space, |
consider: |
consider: |
Line 2801 unanchored pattern). (*SKIP) is similar, except that t
|
Line 3007 unanchored pattern). (*SKIP) is similar, except that t
|
than one character. (*COMMIT) is the strongest, causing the entire match to |
than one character. (*COMMIT) is the strongest, causing the entire match to |
fail. |
fail. |
</P> |
</P> |
|
<br><b> |
|
More than one backtracking verb |
|
</b><br> |
<P> |
<P> |
If more than one such verb is present in a pattern, the "strongest" one wins. | If more than one backtracking verb is present in a pattern, the one that is |
For example, consider this pattern, where A, B, etc. are complex pattern | backtracked onto first acts. For example, consider this pattern, where A, B, |
fragments: | etc. are complex pattern fragments: |
<pre> |
<pre> |
(A(*COMMIT)B(*THEN)C|D) | (A(*COMMIT)B(*THEN)C|ABD) |
</pre> |
</pre> |
Once A has matched, PCRE is committed to this match, at the current starting | If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to |
position. If subsequently B matches, but C does not, the normal (*THEN) action | fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes |
of trying the next alternative (that is, D) does not happen because (*COMMIT) | the next alternative (ABD) to be tried. This behaviour is consistent, but is |
overrides. | not always the same as Perl's. It means that if two or more backtracking verbs |
| appear in succession, all the the last of them has no effect. Consider this |
| example: |
| <pre> |
| ...(*COMMIT)(*PRUNE)... |
| </pre> |
| If there is a matching failure to the right, backtracking onto (*PRUNE) cases |
| it to be triggered, and its action is taken. There can never be a backtrack |
| onto (*COMMIT). |
| <a name="btrepeat"></a></P> |
| <br><b> |
| Backtracking verbs in repeated groups |
| </b><br> |
| <P> |
| PCRE differs from Perl in its handling of backtracking verbs in repeated |
| groups. For example, consider: |
| <pre> |
| /(a(*COMMIT)b)+ac/ |
| </pre> |
| If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in |
| the second repeat of the group acts. |
| <a name="btassert"></a></P> |
| <br><b> |
| Backtracking verbs in assertions |
| </b><br> |
| <P> |
| (*FAIL) in an assertion has its normal effect: it forces an immediate backtrack. |
</P> |
</P> |
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
|
<P> |
<P> |
|
(*ACCEPT) in a positive assertion causes the assertion to succeed without any |
|
further processing. In a negative assertion, (*ACCEPT) causes the assertion to |
|
fail without any further processing. |
|
</P> |
|
<P> |
|
The other backtracking verbs are not treated specially if they appear in a |
|
positive assertion. In particular, (*THEN) skips to the next alternative in the |
|
innermost enclosing group that has alternations, whether or not this is within |
|
the assertion. |
|
</P> |
|
<P> |
|
Negative assertions are, however, different, in order to ensure that changing a |
|
positive assertion into a negative assertion changes its result. Backtracking |
|
into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true, |
|
without considering any further alternative branches in the assertion. |
|
Backtracking into (*THEN) causes it to skip to the next enclosing alternative |
|
within the assertion (the normal behaviour), but if the assertion does not have |
|
such an alternative, (*THEN) behaves like (*PRUNE). |
|
<a name="btsub"></a></P> |
|
<br><b> |
|
Backtracking verbs in subroutines |
|
</b><br> |
|
<P> |
|
These behaviours occur whether or not the subpattern is called recursively. |
|
Perl's treatment of subroutines is different in some cases. |
|
</P> |
|
<P> |
|
(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces |
|
an immediate backtrack. |
|
</P> |
|
<P> |
|
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to |
|
succeed without any further processing. Matching then continues after the |
|
subroutine call. |
|
</P> |
|
<P> |
|
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause |
|
the subroutine match to fail. |
|
</P> |
|
<P> |
|
(*THEN) skips to the next alternative in the innermost enclosing group within |
|
the subpattern that has alternatives. If there is no such group within the |
|
subpattern, (*THEN) causes the subroutine match to fail. |
|
</P> |
|
<br><a name="SEC27" href="#TOC1">SEE ALSO</a><br> |
|
<P> |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcresyntax</b>(3), <b>pcre</b>(3). | <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>, <b>pcre32(3)</b>. |
</P> |
</P> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> | <br><a name="SEC28" href="#TOC1">AUTHOR</a><br> |
<P> |
<P> |
Philip Hazel |
Philip Hazel |
<br> |
<br> |
Line 2827 University Computing Service
|
Line 3107 University Computing Service
|
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
<br> |
<br> |
</P> |
</P> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> | <br><a name="SEC29" href="#TOC1">REVISION</a><br> |
<P> |
<P> |
Last updated: 29 November 2011 | Last updated: 26 April 2013 |
<br> |
<br> |
Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2013 University of Cambridge. |
<br> |
<br> |
<p> |
<p> |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |