version 1.1.1.3, 2012/10/09 09:19:17
|
version 1.1.1.4, 2013/07/22 08:25:57
|
Line 1
|
Line 1
|
.TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" | .TH PCREPATTERN 3 "26 April 2013" "PCRE 8.33" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
published by O'Reilly, covers regular expressions in great detail. This |
published by O'Reilly, covers regular expressions in great detail. This |
description of PCRE's regular expressions is intended as reference material. |
description of PCRE's regular expressions is intended as reference material. |
.P |
.P |
|
This document discusses the patterns that are supported by PCRE when one its |
|
main matching functions, \fBpcre_exec()\fP (8-bit) or \fBpcre[16|32]_exec()\fP |
|
(16- or 32-bit), is used. PCRE also has alternative matching functions, |
|
\fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, which match using a |
|
different algorithm that is not Perl-compatible. Some of the features discussed |
|
below are not available when DFA matching is used. The advantages and |
|
disadvantages of the alternative functions, and how they differ from the normal |
|
functions, are discussed in the |
|
.\" HREF |
|
\fBpcrematching\fP |
|
.\" |
|
page. |
|
. |
|
. |
|
.SH "SPECIAL START-OF-PATTERN ITEMS" |
|
.rs |
|
.sp |
|
A number of options that can be passed to \fBpcre_compile()\fP can also be set |
|
by special items at the start of a pattern. These are not Perl-compatible, but |
|
are provided to make these options accessible to pattern writers who are not |
|
able to change the program that processes the pattern. Any number of these |
|
items may appear, but they must all be together right at the start of the |
|
pattern string, and the letters must be in upper case. |
|
. |
|
. |
|
.SS "UTF support" |
|
.rs |
|
.sp |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
there is now also support for UTF-8 strings in the original library, and a | there is now also support for UTF-8 strings in the original library, an |
second library that supports 16-bit and UTF-16 character strings. To use these | extra library that supports 16-bit and UTF-16 character strings, and a |
| third library that supports 32-bit and UTF-32 character strings. To use these |
features, PCRE must be built to include appropriate support. When using UTF |
features, PCRE must be built to include appropriate support. When using UTF |
strings you must either call the compiling function with the PCRE_UTF8 or | strings you must either call the compiling function with the PCRE_UTF8, |
PCRE_UTF16 option, or the pattern must start with one of these special | PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of |
sequences: | these special sequences: |
.sp |
.sp |
(*UTF8) |
(*UTF8) |
(*UTF16) |
(*UTF16) |
|
(*UTF32) |
|
(*UTF) |
.sp |
.sp |
|
(*UTF) is a generic sequence that can be used with any of the libraries. |
Starting a pattern with such a sequence is equivalent to setting the relevant |
Starting a pattern with such a sequence is equivalent to setting the relevant |
option. This feature is not Perl-compatible. How setting a UTF mode affects | option. How setting a UTF mode affects pattern matching is mentioned in several |
pattern matching is mentioned in several places below. There is also a summary | places below. There is also a summary of features in the |
of features in the | |
.\" HREF |
.\" HREF |
\fBpcreunicode\fP |
\fBpcreunicode\fP |
.\" |
.\" |
page. |
page. |
.P |
.P |
Another special sequence that may appear at the start of a pattern or in | Some applications that allow their users to supply patterns may wish to |
combination with (*UTF8) or (*UTF16) is: | restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF |
| option is set at compile time, (*UTF) etc. are not allowed, and their |
| appearance causes an error. |
| . |
| . |
| .SS "Unicode property support" |
| .rs |
.sp |
.sp |
|
Another special sequence that may appear at the start of a pattern is |
|
.sp |
(*UCP) |
(*UCP) |
.sp |
.sp |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
such as \ed and \ew to use Unicode properties to determine character types, |
such as \ed and \ew to use Unicode properties to determine character types, |
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
table. |
table. |
.P | . |
| . |
| .SS "Disabling start-up optimizations" |
| .rs |
| .sp |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are | PCRE_NO_START_OPTIMIZE option either at compile or matching time. |
also some more of these special sequences that are concerned with the handling | |
of newlines; they are described below. | |
.P | |
The remainder of this document discusses the patterns that are supported by | |
PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or | |
\fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching | |
functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using | |
a different algorithm that is not Perl-compatible. Some of the features | |
discussed below are not available when DFA matching is used. The advantages and | |
disadvantages of the alternative functions, and how they differ from the normal | |
functions, are discussed in the | |
.\" HREF | |
\fBpcrematching\fP | |
.\" | |
page. | |
. |
. |
. |
. |
.\" HTML <a name="newlines"></a> |
.\" HTML <a name="newlines"></a> |
.SH "NEWLINE CONVENTIONS" | .SS "Newline conventions" |
.rs |
.rs |
.sp |
.sp |
PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
Line 103 example, on a Unix system where LF is the default newl
|
Line 131 example, on a Unix system where LF is the default newl
|
(*CR)a.b |
(*CR)a.b |
.sp |
.sp |
changes the convention to CR. That pattern matches "a\enb" because LF is no |
changes the convention to CR. That pattern matches "a\enb" because LF is no |
longer a newline. Note that these special settings, which are not | longer a newline. If more than one of these settings is present, the last one |
Perl-compatible, are recognized only at the very start of a pattern, and that | |
they must be in upper case. If more than one of them is present, the last one | |
is used. |
is used. |
.P |
.P |
The newline convention affects the interpretation of the dot metacharacter when | The newline convention affects where the circumflex and dollar assertions are |
PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not | true. It also affects the interpretation of the dot metacharacter when |
affect what the \eR escape sequence matches. By default, this is any Unicode | PCRE_DOTALL is not set, and the behaviour of \eN. However, it does not affect |
newline sequence, for Perl compatibility. However, this can be changed; see the | what the \eR escape sequence matches. By default, this is any Unicode newline |
| sequence, for Perl compatibility. However, this can be changed; see the |
description of \eR in the section entitled |
description of \eR in the section entitled |
.\" HTML <a href="#newlineseq"> |
.\" HTML <a href="#newlineseq"> |
.\" </a> |
.\" </a> |
Line 121 below. A change of \eR setting can be combined with a
|
Line 148 below. A change of \eR setting can be combined with a
|
convention. |
convention. |
. |
. |
. |
. |
|
.SS "Setting match and recursion limits" |
|
.rs |
|
.sp |
|
The caller of \fBpcre_exec()\fP can set a limit on the number of times the |
|
internal \fBmatch()\fP function is called and on the maximum depth of |
|
recursive calls. These facilities are provided to catch runaway matches that |
|
are provoked by patterns with huge matching trees (a typical example is a |
|
pattern with nested unlimited repeats) and to avoid running out of system stack |
|
by too much recursion. When one of these limits is reached, \fBpcre_exec()\fP |
|
gives an error return. The limits can also be set by items at the start of the |
|
pattern of the form |
|
.sp |
|
(*LIMIT_MATCH=d) |
|
(*LIMIT_RECURSION=d) |
|
.sp |
|
where d is any number of decimal digits. However, the value of the setting must |
|
be less than the value set by the caller of \fBpcre_exec()\fP for it to have |
|
any effect. In other words, the pattern writer can lower the limit set by the |
|
programmer, but not raise it. If there is more than one setting of one of these |
|
limits, the lower value is used. |
|
. |
|
. |
|
.SH "EBCDIC CHARACTER CODES" |
|
.rs |
|
.sp |
|
PCRE can be compiled to run in an environment that uses EBCDIC as its character |
|
code rather than ASCII or Unicode (typically a mainframe system). In the |
|
sections below, character code values are ASCII or Unicode; in an EBCDIC |
|
environment these characters may have different code values, and there are no |
|
code points greater than 255. |
|
. |
|
. |
.SH "CHARACTERS AND METACHARACTERS" |
.SH "CHARACTERS AND METACHARACTERS" |
.rs |
.rs |
.sp |
.sp |
Line 246 one of the following escape sequences than the binary
|
Line 305 one of the following escape sequences than the binary
|
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
.sp |
.sp |
The precise effect of \ecx is as follows: if x is a lower case letter, it | The precise effect of \ecx on ASCII characters is as follows: if x is a lower |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. | case letter, it is converted to upper case. Then bit 6 of the character (hex |
Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while | 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), |
\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater | but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the |
than 127, a compile-time error occurs. This locks out non-ASCII characters in | data item (byte or 16-bit value) following \ec has a value greater than 127, a |
all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A | compile-time error occurs. This locks out non-ASCII characters in all modes. |
lower case letter is converted to upper case, and then the 0xc0 bits are | |
flipped.) | |
.P |
.P |
|
The \ec facility was designed for use with ASCII characters, but with the |
|
extension to Unicode it is even less useful than it once was. It is, however, |
|
recognized when PCRE is compiled in EBCDIC mode, where data items are always |
|
bytes. In this mode, all values are valid after \ec. If the next character is a |
|
lower case letter, it is converted to upper case. Then the 0xc0 bits of the |
|
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because |
|
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other |
|
characters also generate different values. |
|
.P |
By default, after \ex, from zero to two hexadecimal digits are read (letters |
By default, after \ex, from zero to two hexadecimal digits are read (letters |
can be in upper or lower case). Any number of hexadecimal digits may appear |
can be in upper or lower case). Any number of hexadecimal digits may appear |
between \ex{ and }, but the character code is constrained as follows: |
between \ex{ and }, but the character code is constrained as follows: |
Line 263 between \ex{ and }, but the character code is constrai
|
Line 329 between \ex{ and }, but the character code is constrai
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
16-bit non-UTF mode less than 0x10000 |
16-bit non-UTF mode less than 0x10000 |
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
32-bit non-UTF mode less than 0x80000000 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
.sp |
.sp |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
"surrogate" codepoints). | "surrogate" codepoints), and 0xffef. |
.P |
.P |
If characters other than hexadecimal digits appear between \ex{ and }, or if |
If characters other than hexadecimal digits appear between \ex{ and }, or if |
there is no terminating }, this form of escape is not recognized. Instead, the |
there is no terminating }, this form of escape is not recognized. Instead, the |
Line 313 subsequent digits stand for themselves. The value of t
|
Line 381 subsequent digits stand for themselves. The value of t
|
constrained in the same way as characters specified in hexadecimal. |
constrained in the same way as characters specified in hexadecimal. |
For example: |
For example: |
.sp |
.sp |
\e040 is another way of writing a space | \e040 is another way of writing an ASCII space |
.\" JOIN |
.\" JOIN |
\e40 is the same, provided there are fewer than 40 |
\e40 is the same, provided there are fewer than 40 |
previous capturing subpatterns |
previous capturing subpatterns |
Line 471 release 5.10. In contrast to the other sequences, whic
|
Line 539 release 5.10. In contrast to the other sequences, whic
|
characters by default, these always match certain high-valued codepoints, |
characters by default, these always match certain high-valued codepoints, |
whether or not PCRE_UCP is set. The horizontal space characters are: |
whether or not PCRE_UCP is set. The horizontal space characters are: |
.sp |
.sp |
U+0009 Horizontal tab | U+0009 Horizontal tab (HT) |
U+0020 Space |
U+0020 Space |
U+00A0 Non-break space |
U+00A0 Non-break space |
U+1680 Ogham space mark |
U+1680 Ogham space mark |
Line 493 whether or not PCRE_UCP is set. The horizontal space c
|
Line 561 whether or not PCRE_UCP is set. The horizontal space c
|
.sp |
.sp |
The vertical space characters are: |
The vertical space characters are: |
.sp |
.sp |
U+000A Linefeed | U+000A Linefeed (LF) |
U+000B Vertical tab | U+000B Vertical tab (VT) |
U+000C Form feed | U+000C Form feed (FF) |
U+000D Carriage return | U+000D Carriage return (CR) |
U+0085 Next line | U+0085 Next line (NEL) |
U+2028 Line separator |
U+2028 Line separator |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
.sp |
.sp |
Line 551 change of newline convention; for example, a pattern c
|
Line 619 change of newline convention; for example, a pattern c
|
.sp |
.sp |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
.sp |
.sp |
They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or |
sequences. Inside a character class, \eR is treated as an unrecognized escape | (*UCP) special sequences. Inside a character class, \eR is treated as an |
sequence, and so matches the letter "R" by default, but causes an error if | unrecognized escape sequence, and so matches the letter "R" by default, but |
PCRE_EXTRA is set. | causes an error if PCRE_EXTRA is set. |
. |
. |
. |
. |
.\" HTML <a name="uniextseq"></a> |
.\" HTML <a name="uniextseq"></a> |
Line 569 The extra escape sequences are:
|
Line 637 The extra escape sequences are:
|
.sp |
.sp |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eX an extended Unicode sequence | \eX a Unicode extended grapheme cluster |
.sp |
.sp |
The property names represented by \fIxx\fP above are limited to the Unicode |
The property names represented by \fIxx\fP above are limited to the Unicode |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
Line 762 a modifier or "other".
|
Line 830 a modifier or "other".
|
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
U+DFFF. Such characters are not valid in Unicode strings and so |
U+DFFF. Such characters are not valid in Unicode strings and so |
cannot be tested by PCRE, unless UTF validity checking has been turned off |
cannot be tested by PCRE, unless UTF validity checking has been turned off |
(see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and |
| PCRE_NO_UTF32_CHECK in the |
.\" HREF |
.\" HREF |
\fBpcreapi\fP |
\fBpcreapi\fP |
.\" |
.\" |
Line 777 Instead, this property is assumed for any code point t
|
Line 846 Instead, this property is assumed for any code point t
|
Unicode table. |
Unicode table. |
.P |
.P |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
example, \ep{Lu} always matches only upper case letters. | example, \ep{Lu} always matches only upper case letters. This is different from |
| the behaviour of current versions of Perl. |
.P |
.P |
The \eX escape matches any number of Unicode characters that form an extended | Matching characters by Unicode property is not fast, because PCRE has to do a |
Unicode sequence. \eX is equivalent to | multistage table lookup in order to find a character's property. That is why |
| the traditional escape sequences such as \ed and \ew do not use Unicode |
| properties in PCRE by default, though you can make them do so by setting the |
| PCRE_UCP option or by starting the pattern with (*UCP). |
| . |
| . |
| .SS Extended grapheme clusters |
| .rs |
.sp |
.sp |
(?>\ePM\epM*) | The \eX escape matches any number of Unicode characters that form an "extended |
.sp | grapheme cluster", and treats the sequence as an atomic group |
That is, it matches a character without the "mark" property, followed by zero | |
or more characters with the "mark" property, and treats the sequence as an | |
atomic group | |
.\" HTML <a href="#atomicgroup"> |
.\" HTML <a href="#atomicgroup"> |
.\" </a> |
.\" </a> |
(see below). |
(see below). |
.\" |
.\" |
Characters with the "mark" property are typically accents that affect the | Up to and including release 8.31, PCRE matched an earlier, simpler definition |
preceding character. None of them have codepoints less than 256, so in | that was equivalent to |
8-bit non-UTF-8 mode \eX matches any one character. | .sp |
| (?>\ePM\epM*) |
| .sp |
| That is, it matched a character without the "mark" property, followed by zero |
| or more characters with the "mark" property. Characters with the "mark" |
| property are typically non-spacing accents that affect the preceding character. |
.P |
.P |
Note that recent versions of Perl have changed \eX to match what Unicode calls | This simple definition was extended in Unicode to include more complicated |
an "extended grapheme cluster", which has a more complicated definition. | kinds of composite character by giving each character a grapheme breaking |
| property, and creating rules that use these properties to define the boundaries |
| of extended grapheme clusters. In releases of PCRE later than 8.31, \eX matches |
| one of these clusters. |
.P |
.P |
Matching characters by Unicode property is not fast, because PCRE has to search | \eX always matches at least one character. Then it decides whether to add |
a structure that contains data for over fifteen thousand characters. That is | additional characters according to the following rules for ending a cluster: |
why the traditional escape sequences such as \ed and \ew do not use Unicode | .P |
properties in PCRE by default, though you can make them do so by setting the | 1. End at the end of the subject string. |
PCRE_UCP option or by starting the pattern with (*UCP). | .P |
| 2. Do not end between CR and LF; otherwise end after any control character. |
| .P |
| 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters |
| are of five types: L, V, T, LV, and LVT. An L character may be followed by an |
| L, V, LV, or LVT character; an LV or V character may be followed by a V or T |
| character; an LVT or T character may be follwed only by a T character. |
| .P |
| 4. Do not end before extending characters or spacing marks. Characters with |
| the "mark" property always have the "extend" grapheme breaking property. |
| .P |
| 5. Do not end after prepend characters. |
| .P |
| 6. Otherwise, end the cluster. |
. |
. |
. |
. |
.\" HTML <a name="extraprops"></a> |
.\" HTML <a name="extraprops"></a> |
.SS PCRE's additional properties |
.SS PCRE's additional properties |
.rs |
.rs |
.sp |
.sp |
As well as the standard Unicode properties described in the previous | As well as the standard Unicode properties described above, PCRE supports four |
section, PCRE supports four more that make it possible to convert traditional | more that make it possible to convert traditional escape sequences such as \ew |
escape sequences such as \ew and \es and POSIX character classes to use Unicode | and \es and POSIX character classes to use Unicode properties. PCRE uses these |
properties. PCRE uses these non-standard, non-Perl properties internally when | non-standard, non-Perl properties internally when PCRE_UCP is set. However, |
PCRE_UCP is set. They are: | they may also be used explicitly. These properties are: |
.sp |
.sp |
Xan Any alphanumeric character |
Xan Any alphanumeric character |
Xps Any POSIX space character |
Xps Any POSIX space character |
Line 825 property. Xps matches the characters tab, linefeed, ve
|
Line 920 property. Xps matches the characters tab, linefeed, ve
|
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
|
.P |
|
There is another non-standard property, Xuc, which matches any character that |
|
can be represented by a Universal Character Name in C++ and other programming |
|
languages. These are the characters $, @, ` (grave accent), and all characters |
|
with Unicode code points greater than or equal to U+00A0, except for the |
|
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are |
|
excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH |
|
where H is a hexadecimal digit. Note that the Xuc property does not match these |
|
sequences but the characters that they represent.) |
. |
. |
. |
. |
.\" HTML <a name="resetmatchstart"></a> |
.\" HTML <a name="resetmatchstart"></a> |
Line 930 regular expression.
|
Line 1034 regular expression.
|
.SH "CIRCUMFLEX AND DOLLAR" |
.SH "CIRCUMFLEX AND DOLLAR" |
.rs |
.rs |
.sp |
.sp |
|
The circumflex and dollar metacharacters are zero-width assertions. That is, |
|
they test for a particular condition being true without consuming any |
|
characters from the subject string. |
|
.P |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
character is an assertion that is true only if the current matching point is | character is an assertion that is true only if the current matching point is at |
at the start of the subject string. If the \fIstartoffset\fP argument of | the start of the subject string. If the \fIstartoffset\fP argument of |
\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
option is unset. Inside a character class, circumflex has an entirely different |
option is unset. Inside a character class, circumflex has an entirely different |
meaning |
meaning |
Line 949 constrained to match only at the start of the subject,
|
Line 1057 constrained to match only at the start of the subject,
|
"anchored" pattern. (There are also other constructs that can cause a pattern |
"anchored" pattern. (There are also other constructs that can cause a pattern |
to be anchored.) |
to be anchored.) |
.P |
.P |
A dollar character is an assertion that is true only if the current matching | The dollar character is an assertion that is true only if the current matching |
point is at the end of the subject string, or immediately before a newline | point is at the end of the subject string, or immediately before a newline at |
at the end of the string (by default). Dollar need not be the last character of | the end of the string (by default). Note, however, that it does not actually |
the pattern if a number of alternatives are involved, but it should be the last | match the newline. Dollar need not be the last character of the pattern if a |
item in any branch in which it appears. Dollar has no special meaning in a | number of alternatives are involved, but it should be the last item in any |
character class. | branch in which it appears. Dollar has no special meaning in a character class. |
.P |
.P |
The meaning of dollar can be changed so that it matches only at the very end of |
The meaning of dollar can be changed so that it matches only at the very end of |
the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
Line 1015 name; PCRE does not support this.
|
Line 1123 name; PCRE does not support this.
|
.sp |
.sp |
Outside a character class, the escape sequence \eC matches any one data unit, |
Outside a character class, the escape sequence \eC matches any one data unit, |
whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always | byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is |
| a 32-bit unit. Unlike a dot, \eC always |
matches line-ending characters. The feature is provided in Perl in order to |
matches line-ending characters. The feature is provided in Perl in order to |
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
used. Because \eC breaks up characters into individual data units, matching one |
used. Because \eC breaks up characters into individual data units, matching one |
unit with \eC in a UTF mode means that the rest of the string may start with a |
unit with \eC in a UTF mode means that the rest of the string may start with a |
malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option | start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or |
is used). | PCRE_NO_UTF32_CHECK option is used). |
.P |
.P |
PCRE does not allow \eC to appear in lookbehind assertions |
PCRE does not allow \eC to appear in lookbehind assertions |
.\" HTML <a href="#lookbehind"> |
.\" HTML <a href="#lookbehind"> |
Line 1082 circumflex is not an assertion; it still consumes a ch
|
Line 1191 circumflex is not an assertion; it still consumes a ch
|
string, and therefore it fails if the current pointer is at the end of the |
string, and therefore it fails if the current pointer is at the end of the |
string. |
string. |
.P |
.P |
In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be | In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) |
included in a class as a literal string of data units, or by using the \ex{ | can be included in a class as a literal string of data units, or by using the |
escaping mechanism. | \ex{ escaping mechanism. |
.P |
.P |
When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
Line 1297 the section entitled
|
Line 1406 the section entitled
|
.\" </a> |
.\" </a> |
"Newline sequences" |
"Newline sequences" |
.\" |
.\" |
above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that | above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading |
can be used to set UTF and Unicode property modes; they are equivalent to | sequences that can be used to set UTF and Unicode property modes; they are |
setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. | equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP |
| options, respectively. The (*UTF) sequence is a generic version that can be |
| used with any of the libraries. However, the application can set the |
| PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. |
. |
. |
. |
. |
.\" HTML <a name="subpattern"></a> |
.\" HTML <a name="subpattern"></a> |
Line 1534 quantifier, but a literal string of four characters.
|
Line 1646 quantifier, but a literal string of four characters.
|
In UTF modes, quantifiers apply to characters rather than to individual data |
In UTF modes, quantifiers apply to characters rather than to individual data |
units. Thus, for example, \ex{100}{2} matches two characters, each of |
units. Thus, for example, \ex{100}{2} matches two characters, each of |
which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
\eX{3} matches three Unicode extended sequences, each of which may be several | \eX{3} matches three Unicode extended grapheme clusters, each of which may be |
data units long (and they may be of different lengths). | several data units long (and they may be of different lengths). |
.P |
.P |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
previous item and the quantifier were not present. This may be useful for |
previous item and the quantifier were not present. This may be useful for |
Line 1621 In cases where it is known that the subject string con
|
Line 1733 In cases where it is known that the subject string con
|
worth setting PCRE_DOTALL in order to obtain this optimization, or |
worth setting PCRE_DOTALL in order to obtain this optimization, or |
alternatively using ^ to indicate anchoring explicitly. |
alternatively using ^ to indicate anchoring explicitly. |
.P |
.P |
However, there is one situation where the optimization cannot be used. When .* | However, there are some cases where the optimization cannot be used. When .* |
is inside capturing parentheses that are the subject of a back reference |
is inside capturing parentheses that are the subject of a back reference |
elsewhere in the pattern, a match at the start may fail where a later one |
elsewhere in the pattern, a match at the start may fail where a later one |
succeeds. Consider, for example: |
succeeds. Consider, for example: |
Line 1631 succeeds. Consider, for example:
|
Line 1743 succeeds. Consider, for example:
|
If the subject is "xyz123abc123" the match point is the fourth character. For |
If the subject is "xyz123abc123" the match point is the fourth character. For |
this reason, such a pattern is not implicitly anchored. |
this reason, such a pattern is not implicitly anchored. |
.P |
.P |
|
Another case where implicit anchoring is not applied is when the leading .* is |
|
inside an atomic group. Once again, a match at the start may fail where a later |
|
one succeeds. Consider this pattern: |
|
.sp |
|
(?>.*?a)b |
|
.sp |
|
It matches "ab" in the subject "aab". The use of the backtracking control verbs |
|
(*PRUNE) and (*SKIP) also disable this optimization. |
|
.P |
When a capturing subpattern is repeated, the value captured is the substring |
When a capturing subpattern is repeated, the value captured is the substring |
that matched the final iteration. For example, after |
that matched the final iteration. For example, after |
.sp |
.sp |
Line 1899 except that it does not cause the current matching pos
|
Line 2020 except that it does not cause the current matching pos
|
Assertion subpatterns are not capturing subpatterns. If such an assertion |
Assertion subpatterns are not capturing subpatterns. If such an assertion |
contains capturing subpatterns within it, these are counted for the purposes of |
contains capturing subpatterns within it, these are counted for the purposes of |
numbering the capturing subpatterns in the whole pattern. However, substring |
numbering the capturing subpatterns in the whole pattern. However, substring |
capturing is carried out only for positive assertions, because it does not make | capturing is carried out only for positive assertions. (Perl sometimes, but not |
sense for negative assertions. | always, does do capturing in negative assertions.) |
.P |
.P |
For compatibility with Perl, assertion subpatterns may be repeated; though |
For compatibility with Perl, assertion subpatterns may be repeated; though |
it makes no sense to assert the same thing several times, the side effect of |
it makes no sense to assert the same thing several times, the side effect of |
Line 2552 same pair of parentheses when there is a repetition.
|
Line 2673 same pair of parentheses when there is a repetition.
|
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
function by putting its entry point in the global variable \fIpcre_callout\fP |
function by putting its entry point in the global variable \fIpcre_callout\fP |
(8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this | (8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library). |
variable contains NULL, which disables all calling out. | By default, this variable contains NULL, which disables all calling out. |
.P |
.P |
Within a regular expression, (?C) indicates the points at which the external |
Within a regular expression, (?C) indicates the points at which the external |
function is to be called. If you want to identify different callout points, you |
function is to be called. If you want to identify different callout points, you |
Line 2564 For example, this pattern has two callout points:
|
Line 2685 For example, this pattern has two callout points:
|
.sp |
.sp |
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
255. | 255. If there is a conditional group in the pattern whose condition is an |
| assertion, an additional callout is inserted just before the condition. An |
| explicit callout may also be set at this position, as in this example: |
| .sp |
| (?(?C9)(?=a)abc|def) |
| .sp |
| Note that this applies only to assertion conditions, not to other types of |
| condition. |
.P |
.P |
During matching, when PCRE reaches a callout point, the external function is |
During matching, when PCRE reaches a callout point, the external function is |
called. It is provided with the number of the callout, the position in the |
called. It is provided with the number of the callout, the position in the |
Line 2583 documentation.
|
Line 2711 documentation.
|
.rs |
.rs |
.sp |
.sp |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
are described in the Perl documentation as "experimental and subject to change | are still described in the Perl documentation as "experimental and subject to |
or removal in a future version of Perl". It goes on to say: "Their usage in | change or removal in a future version of Perl". It goes on to say: "Their usage |
production code should be noted to avoid problems during upgrades." The same | in production code should be noted to avoid problems during upgrades." The same |
remarks apply to the PCRE features described in this section. |
remarks apply to the PCRE features described in this section. |
.P |
.P |
|
The new verbs make use of what was previously invalid syntax: an opening |
|
parenthesis followed by an asterisk. They are generally of the form |
|
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving |
|
differently depending on whether or not a name is present. A name is any |
|
sequence of characters that does not include a closing parenthesis. The maximum |
|
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit |
|
libraries. If the name is empty, that is, if the closing parenthesis |
|
immediately follows the colon, the effect is as if the colon were not there. |
|
Any number of these verbs may occur in a pattern. |
|
.P |
Since these verbs are specifically related to backtracking, most of them can be |
Since these verbs are specifically related to backtracking, most of them can be |
used only when the pattern is to be matched using one of the traditional |
used only when the pattern is to be matched using one of the traditional |
matching functions, which use a backtracking algorithm. With the exception of | matching functions, because these use a backtracking algorithm. With the |
(*FAIL), which behaves like a failing negative assertion, they cause an error | exception of (*FAIL), which behaves like a failing negative assertion, the |
if encountered by a DFA matching function. | backtracking control verbs cause an error if encountered by a DFA matching |
| function. |
.P |
.P |
If any of these verbs are used in an assertion or in a subpattern that is | The behaviour of these verbs in |
called as a subroutine (whether or not recursively), their effect is confined | .\" HTML <a href="#btrepeat"> |
to that subpattern; it does not extend to the surrounding pattern, with one | .\" </a> |
exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in | repeated groups, |
a successful positive assertion \fIis\fP passed back when a match succeeds | .\" |
(compare capturing parentheses in assertions). Note that such subpatterns are | .\" HTML <a href="#btassert"> |
processed as anchored at the point where they are tested. Note also that Perl's | .\" </a> |
treatment of subroutines and assertions is different in some cases. | assertions, |
.P | .\" |
The new verbs make use of what was previously invalid syntax: an opening | and in |
parenthesis followed by an asterisk. They are generally of the form | .\" HTML <a href="#btsub"> |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | .\" </a> |
depending on whether or not an argument is present. A name is any sequence of | subpatterns called as subroutines |
characters that does not include a closing parenthesis. The maximum length of | .\" |
name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name | (whether or not recursively) is documented below. |
is empty, that is, if the closing parenthesis immediately follows the colon, | |
the effect is as if the colon were not there. Any number of these verbs may | |
occur in a pattern. | |
. |
. |
. |
. |
.\" HTML <a name="nooptimize"></a> |
.\" HTML <a name="nooptimize"></a> |
Line 2621 occur in a pattern.
|
Line 2757 occur in a pattern.
|
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
present. When one of these optimizations suppresses the running of a match, any | present. When one of these optimizations bypasses the running of a match, any |
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
Line 2652 followed by a name.
|
Line 2788 followed by a name.
|
This verb causes the match to end successfully, skipping the remainder of the |
This verb causes the match to end successfully, skipping the remainder of the |
pattern. However, when it is inside a subpattern that is called as a |
pattern. However, when it is inside a subpattern that is called as a |
subroutine, only that subpattern is ended successfully. Matching then continues |
subroutine, only that subpattern is ended successfully. Matching then continues |
at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so | at the outer level. If (*ACCEPT) in triggered in a positive assertion, the |
far is captured. For example: | assertion succeeds; in a negative assertion, the assertion fails. |
| .P |
| If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For |
| example: |
.sp |
.sp |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
.sp |
.sp |
Line 2686 starting point (see (*SKIP) below).
|
Line 2825 starting point (see (*SKIP) below).
|
A name is always required with this verb. There may be as many instances of |
A name is always required with this verb. There may be as many instances of |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
.P |
.P |
When a match succeeds, the name of the last-encountered (*MARK) on the matching | When a match succeeds, the name of the last-encountered (*MARK:NAME), |
path is passed back to the caller as described in the section entitled | (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the |
| caller as described in the section entitled |
.\" HTML <a href="pcreapi.html#extradata"> |
.\" HTML <a href="pcreapi.html#extradata"> |
.\" </a> |
.\" </a> |
"Extra data for \fBpcre_exec()\fP" |
"Extra data for \fBpcre_exec()\fP" |
Line 2712 indicates which of the two alternatives matched. This
|
Line 2852 indicates which of the two alternatives matched. This
|
of obtaining this information than putting each alternative in its own |
of obtaining this information than putting each alternative in its own |
capturing parentheses. |
capturing parentheses. |
.P |
.P |
If (*MARK) is encountered in a positive assertion, its name is recorded and | If a verb with a name is encountered in a positive assertion that is true, the |
passed back if it is the last-encountered. This does not happen for negative | name is recorded and passed back if it is the last-encountered. This does not |
assertions. | happen for negative assertions or failing positive assertions. |
.P |
.P |
After a partial match or a failed match, the name of the last encountered | After a partial match or a failed match, the last encountered name in the |
(*MARK) in the entire match process is returned. For example: | entire match process is returned. For example: |
.sp |
.sp |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
data> XP |
data> XP |
Line 2743 to ensure that the match is always attempted.
|
Line 2883 to ensure that the match is always attempted.
|
The following verbs do nothing when they are encountered. Matching continues |
The following verbs do nothing when they are encountered. Matching continues |
with what follows, but if there is no subsequent match, causing a backtrack to |
with what follows, but if there is no subsequent match, causing a backtrack to |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb. However, when one of these verbs appears inside an atomic group, its | the verb. However, when one of these verbs appears inside an atomic group or an |
effect is confined to that group, because once the group has been matched, | assertion that is true, its effect is confined to that group, because once the |
there is never any backtracking into it. In this situation, backtracking can | group has been matched, there is never any backtracking into it. In this |
"jump back" to the left of the entire atomic group. (Remember also, as stated | situation, backtracking can "jump back" to the left of the entire atomic group |
above, that this localization also applies in subroutine calls and assertions.) | or assertion. (Remember also, as stated above, that this localization also |
| applies in subroutine calls.) |
.P |
.P |
These verbs differ in exactly what kind of failure occurs when backtracking |
These verbs differ in exactly what kind of failure occurs when backtracking |
reaches them. | reaches them. The behaviour described below is what happens when the verb is |
| not in a subroutine or an assertion. Subsequent sections cover these special |
| cases. |
.sp |
.sp |
(*COMMIT) |
(*COMMIT) |
.sp |
.sp |
This verb, which may not be followed by a name, causes the whole match to fail |
This verb, which may not be followed by a name, causes the whole match to fail |
outright if the rest of the pattern does not match. Even if the pattern is | outright if there is a later matching failure that causes backtracking to reach |
unanchored, no further attempts to find a match by advancing the starting point | it. Even if the pattern is unanchored, no further attempts to find a match by |
take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to | advancing the starting point take place. If (*COMMIT) is the only backtracking |
finding a match at the current starting point, or not at all. For example: | verb that is encountered, once it has been passed \fBpcre_exec()\fP is |
| committed to finding a match at the current starting point, or not at all. For |
| example: |
.sp |
.sp |
a+(*COMMIT)b |
a+(*COMMIT)b |
.sp |
.sp |
Line 2767 dynamic anchor, or "I've started, so I must finish." T
|
Line 2912 dynamic anchor, or "I've started, so I must finish." T
|
recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
match failure. |
match failure. |
.P |
.P |
|
If there is more than one backtracking verb in a pattern, a different one that |
|
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a |
|
match does not always guarantee that a match must be at this starting point. |
|
.P |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
\fBpcretest\fP example: |
\fBpcretest\fP example: |
Line 2786 starting points.
|
Line 2935 starting points.
|
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
.sp |
.sp |
This verb causes the match to fail at the current starting position in the |
This verb causes the match to fail at the current starting position in the |
subject if the rest of the pattern does not match. If the pattern is | subject if there is a later matching failure that causes backtracking to reach |
unanchored, the normal "bumpalong" advance to the next starting character then | it. If the pattern is unanchored, the normal "bumpalong" advance to the next |
happens. Backtracking can occur as usual to the left of (*PRUNE), before it is | starting character then happens. Backtracking can occur as usual to the left of |
reached, or when matching to the right of (*PRUNE), but if there is no match to | (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of | if there is no match to the right, backtracking cannot cross (*PRUNE). In |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, | simple cases, the use of (*PRUNE) is just an alternative to an atomic group or |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. | possessive quantifier, but there are some uses of (*PRUNE) that cannot be |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an | expressed in any other way. In an anchored pattern (*PRUNE) has the same effect |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). | as (*COMMIT). |
| .P |
| The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). |
| It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
.sp |
.sp |
(*SKIP) |
(*SKIP) |
.sp |
.sp |
Line 2815 instead of skipping on to "c".
|
Line 2968 instead of skipping on to "c".
|
.sp |
.sp |
(*SKIP:NAME) |
(*SKIP:NAME) |
.sp |
.sp |
When (*SKIP) has an associated name, its behaviour is modified. If the | When (*SKIP) has an associated name, its behaviour is modified. When it is |
following pattern fails to match, the previous path through the pattern is | triggered, the previous path through the pattern is searched for the most |
searched for the most recent (*MARK) that has the same name. If one is found, | recent (*MARK) that has the same name. If one is found, the "bumpalong" advance |
the "bumpalong" advance is to the subject position that corresponds to that | is to the subject position that corresponds to that (*MARK) instead of to where |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a | (*SKIP) was encountered. If no (*MARK) with a matching name is found, the |
matching name is found, the (*SKIP) is ignored. | (*SKIP) is ignored. |
| .P |
| Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores |
| names that are set by (*PRUNE:NAME) or (*THEN:NAME). |
.sp |
.sp |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
.sp |
.sp |
This verb causes a skip to the next innermost alternative if the rest of the | This verb causes a skip to the next innermost alternative when backtracking |
pattern does not match. That is, it cancels pending backtracking, but only | reaches it. That is, it cancels any further backtracking within the current |
within the current alternative. Its name comes from the observation that it can | alternative. Its name comes from the observation that it can be used for a |
be used for a pattern-based if-then-else block: | pattern-based if-then-else block: |
.sp |
.sp |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
.sp |
.sp |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
second alternative and tries COND2, without backtracking into COND1. The | second alternative and tries COND2, without backtracking into COND1. If that |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). | succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no |
If (*THEN) is not inside an alternation, it acts like (*PRUNE). | more alternatives, so there is a backtrack to whatever came before the entire |
| group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
.P |
.P |
Note that a subpattern that does not contain a | character is just a part of | The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN). |
the enclosing alternative; it is not a nested alternation with only one | It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
| .P |
| A subpattern that does not contain a | character is just a part of the |
| enclosing alternative; it is not a nested alternation with only one |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
pattern fragments that do not contain any | characters at this level: |
pattern fragments that do not contain any | characters at this level: |
Line 2857 in C, matching moves to (*FAIL), which causes the whol
|
Line 3018 in C, matching moves to (*FAIL), which causes the whol
|
because there are no more alternatives to try. In this case, matching does now |
because there are no more alternatives to try. In this case, matching does now |
backtrack into A. |
backtrack into A. |
.P |
.P |
Note also that a conditional subpattern is not considered as having two | Note that a conditional subpattern is not considered as having two |
alternatives, because only one is ever used. In other words, the | character in |
alternatives, because only one is ever used. In other words, the | character in |
a conditional subpattern has a different meaning. Ignoring white space, |
a conditional subpattern has a different meaning. Ignoring white space, |
consider: |
consider: |
Line 2879 starting position, but allowing an advance to the next
|
Line 3040 starting position, but allowing an advance to the next
|
unanchored pattern). (*SKIP) is similar, except that the advance may be more |
unanchored pattern). (*SKIP) is similar, except that the advance may be more |
than one character. (*COMMIT) is the strongest, causing the entire match to |
than one character. (*COMMIT) is the strongest, causing the entire match to |
fail. |
fail. |
.P | . |
If more than one such verb is present in a pattern, the "strongest" one wins. | . |
For example, consider this pattern, where A, B, etc. are complex pattern | .SS "More than one backtracking verb" |
fragments: | .rs |
.sp |
.sp |
(A(*COMMIT)B(*THEN)C|D) | If more than one backtracking verb is present in a pattern, the one that is |
| backtracked onto first acts. For example, consider this pattern, where A, B, |
| etc. are complex pattern fragments: |
.sp |
.sp |
Once A has matched, PCRE is committed to this match, at the current starting | (A(*COMMIT)B(*THEN)C|ABD) |
position. If subsequently B matches, but C does not, the normal (*THEN) action | .sp |
of trying the next alternative (that is, D) does not happen because (*COMMIT) | If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to |
overrides. | fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes |
| the next alternative (ABD) to be tried. This behaviour is consistent, but is |
| not always the same as Perl's. It means that if two or more backtracking verbs |
| appear in succession, all the the last of them has no effect. Consider this |
| example: |
| .sp |
| ...(*COMMIT)(*PRUNE)... |
| .sp |
| If there is a matching failure to the right, backtracking onto (*PRUNE) cases |
| it to be triggered, and its action is taken. There can never be a backtrack |
| onto (*COMMIT). |
. |
. |
. |
. |
|
.\" HTML <a name="btrepeat"></a> |
|
.SS "Backtracking verbs in repeated groups" |
|
.rs |
|
.sp |
|
PCRE differs from Perl in its handling of backtracking verbs in repeated |
|
groups. For example, consider: |
|
.sp |
|
/(a(*COMMIT)b)+ac/ |
|
.sp |
|
If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in |
|
the second repeat of the group acts. |
|
. |
|
. |
|
.\" HTML <a name="btassert"></a> |
|
.SS "Backtracking verbs in assertions" |
|
.rs |
|
.sp |
|
(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack. |
|
.P |
|
(*ACCEPT) in a positive assertion causes the assertion to succeed without any |
|
further processing. In a negative assertion, (*ACCEPT) causes the assertion to |
|
fail without any further processing. |
|
.P |
|
The other backtracking verbs are not treated specially if they appear in a |
|
positive assertion. In particular, (*THEN) skips to the next alternative in the |
|
innermost enclosing group that has alternations, whether or not this is within |
|
the assertion. |
|
.P |
|
Negative assertions are, however, different, in order to ensure that changing a |
|
positive assertion into a negative assertion changes its result. Backtracking |
|
into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true, |
|
without considering any further alternative branches in the assertion. |
|
Backtracking into (*THEN) causes it to skip to the next enclosing alternative |
|
within the assertion (the normal behaviour), but if the assertion does not have |
|
such an alternative, (*THEN) behaves like (*PRUNE). |
|
. |
|
. |
|
.\" HTML <a name="btsub"></a> |
|
.SS "Backtracking verbs in subroutines" |
|
.rs |
|
.sp |
|
These behaviours occur whether or not the subpattern is called recursively. |
|
Perl's treatment of subroutines is different in some cases. |
|
.P |
|
(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces |
|
an immediate backtrack. |
|
.P |
|
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to |
|
succeed without any further processing. Matching then continues after the |
|
subroutine call. |
|
.P |
|
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause |
|
the subroutine match to fail. |
|
.P |
|
(*THEN) skips to the next alternative in the innermost enclosing group within |
|
the subpattern that has alternatives. If there is no such group within the |
|
subpattern, (*THEN) causes the subroutine match to fail. |
|
. |
|
. |
.SH "SEE ALSO" |
.SH "SEE ALSO" |
.rs |
.rs |
.sp |
.sp |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP. | \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP. |
. |
. |
. |
. |
.SH AUTHOR |
.SH AUTHOR |
Line 2913 Cambridge CB2 3QH, England.
|
Line 3145 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 17 June 2012 | Last updated: 26 April 2013 |
Copyright (c) 1997-2012 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
.fi |
.fi |