|
version 1.1.1.3, 2012/10/09 09:19:17
|
version 1.1.1.4, 2013/07/22 08:25:57
|
|
Line 1
|
Line 1
|
| .TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" | .TH PCREPATTERN 3 "26 April 2013" "PCRE 8.33" |
| .SH NAME |
.SH NAME |
| PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
| .SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
|
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
| published by O'Reilly, covers regular expressions in great detail. This |
published by O'Reilly, covers regular expressions in great detail. This |
| description of PCRE's regular expressions is intended as reference material. |
description of PCRE's regular expressions is intended as reference material. |
| .P |
.P |
| |
This document discusses the patterns that are supported by PCRE when one its |
| |
main matching functions, \fBpcre_exec()\fP (8-bit) or \fBpcre[16|32]_exec()\fP |
| |
(16- or 32-bit), is used. PCRE also has alternative matching functions, |
| |
\fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, which match using a |
| |
different algorithm that is not Perl-compatible. Some of the features discussed |
| |
below are not available when DFA matching is used. The advantages and |
| |
disadvantages of the alternative functions, and how they differ from the normal |
| |
functions, are discussed in the |
| |
.\" HREF |
| |
\fBpcrematching\fP |
| |
.\" |
| |
page. |
| |
. |
| |
. |
| |
.SH "SPECIAL START-OF-PATTERN ITEMS" |
| |
.rs |
| |
.sp |
| |
A number of options that can be passed to \fBpcre_compile()\fP can also be set |
| |
by special items at the start of a pattern. These are not Perl-compatible, but |
| |
are provided to make these options accessible to pattern writers who are not |
| |
able to change the program that processes the pattern. Any number of these |
| |
items may appear, but they must all be together right at the start of the |
| |
pattern string, and the letters must be in upper case. |
| |
. |
| |
. |
| |
.SS "UTF support" |
| |
.rs |
| |
.sp |
| The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
| there is now also support for UTF-8 strings in the original library, and a | there is now also support for UTF-8 strings in the original library, an |
| second library that supports 16-bit and UTF-16 character strings. To use these | extra library that supports 16-bit and UTF-16 character strings, and a |
| | third library that supports 32-bit and UTF-32 character strings. To use these |
| features, PCRE must be built to include appropriate support. When using UTF |
features, PCRE must be built to include appropriate support. When using UTF |
| strings you must either call the compiling function with the PCRE_UTF8 or | strings you must either call the compiling function with the PCRE_UTF8, |
| PCRE_UTF16 option, or the pattern must start with one of these special | PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of |
| sequences: | these special sequences: |
| .sp |
.sp |
| (*UTF8) |
(*UTF8) |
| (*UTF16) |
(*UTF16) |
| |
(*UTF32) |
| |
(*UTF) |
| .sp |
.sp |
| |
(*UTF) is a generic sequence that can be used with any of the libraries. |
| Starting a pattern with such a sequence is equivalent to setting the relevant |
Starting a pattern with such a sequence is equivalent to setting the relevant |
| option. This feature is not Perl-compatible. How setting a UTF mode affects | option. How setting a UTF mode affects pattern matching is mentioned in several |
| pattern matching is mentioned in several places below. There is also a summary | places below. There is also a summary of features in the |
| of features in the | |
| .\" HREF |
.\" HREF |
| \fBpcreunicode\fP |
\fBpcreunicode\fP |
| .\" |
.\" |
| page. |
page. |
| .P |
.P |
| Another special sequence that may appear at the start of a pattern or in | Some applications that allow their users to supply patterns may wish to |
| combination with (*UTF8) or (*UTF16) is: | restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF |
| | option is set at compile time, (*UTF) etc. are not allowed, and their |
| | appearance causes an error. |
| | . |
| | . |
| | .SS "Unicode property support" |
| | .rs |
| .sp |
.sp |
| |
Another special sequence that may appear at the start of a pattern is |
| |
.sp |
| (*UCP) |
(*UCP) |
| .sp |
.sp |
| This has the same effect as setting the PCRE_UCP option: it causes sequences |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
| such as \ed and \ew to use Unicode properties to determine character types, |
such as \ed and \ew to use Unicode properties to determine character types, |
| instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
| table. |
table. |
| .P | . |
| | . |
| | .SS "Disabling start-up optimizations" |
| | .rs |
| | .sp |
| If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
| PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are | PCRE_NO_START_OPTIMIZE option either at compile or matching time. |
| also some more of these special sequences that are concerned with the handling | |
| of newlines; they are described below. | |
| .P | |
| The remainder of this document discusses the patterns that are supported by | |
| PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or | |
| \fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching | |
| functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using | |
| a different algorithm that is not Perl-compatible. Some of the features | |
| discussed below are not available when DFA matching is used. The advantages and | |
| disadvantages of the alternative functions, and how they differ from the normal | |
| functions, are discussed in the | |
| .\" HREF | |
| \fBpcrematching\fP | |
| .\" | |
| page. | |
| . |
. |
| . |
. |
| .\" HTML <a name="newlines"></a> |
.\" HTML <a name="newlines"></a> |
| .SH "NEWLINE CONVENTIONS" | .SS "Newline conventions" |
| .rs |
.rs |
| .sp |
.sp |
| PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
|
Line 103 example, on a Unix system where LF is the default newl
|
Line 131 example, on a Unix system where LF is the default newl
|
| (*CR)a.b |
(*CR)a.b |
| .sp |
.sp |
| changes the convention to CR. That pattern matches "a\enb" because LF is no |
changes the convention to CR. That pattern matches "a\enb" because LF is no |
| longer a newline. Note that these special settings, which are not | longer a newline. If more than one of these settings is present, the last one |
| Perl-compatible, are recognized only at the very start of a pattern, and that | |
| they must be in upper case. If more than one of them is present, the last one | |
| is used. |
is used. |
| .P |
.P |
| The newline convention affects the interpretation of the dot metacharacter when | The newline convention affects where the circumflex and dollar assertions are |
| PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not | true. It also affects the interpretation of the dot metacharacter when |
| affect what the \eR escape sequence matches. By default, this is any Unicode | PCRE_DOTALL is not set, and the behaviour of \eN. However, it does not affect |
| newline sequence, for Perl compatibility. However, this can be changed; see the | what the \eR escape sequence matches. By default, this is any Unicode newline |
| | sequence, for Perl compatibility. However, this can be changed; see the |
| description of \eR in the section entitled |
description of \eR in the section entitled |
| .\" HTML <a href="#newlineseq"> |
.\" HTML <a href="#newlineseq"> |
| .\" </a> |
.\" </a> |
|
Line 121 below. A change of \eR setting can be combined with a
|
Line 148 below. A change of \eR setting can be combined with a
|
| convention. |
convention. |
| . |
. |
| . |
. |
| |
.SS "Setting match and recursion limits" |
| |
.rs |
| |
.sp |
| |
The caller of \fBpcre_exec()\fP can set a limit on the number of times the |
| |
internal \fBmatch()\fP function is called and on the maximum depth of |
| |
recursive calls. These facilities are provided to catch runaway matches that |
| |
are provoked by patterns with huge matching trees (a typical example is a |
| |
pattern with nested unlimited repeats) and to avoid running out of system stack |
| |
by too much recursion. When one of these limits is reached, \fBpcre_exec()\fP |
| |
gives an error return. The limits can also be set by items at the start of the |
| |
pattern of the form |
| |
.sp |
| |
(*LIMIT_MATCH=d) |
| |
(*LIMIT_RECURSION=d) |
| |
.sp |
| |
where d is any number of decimal digits. However, the value of the setting must |
| |
be less than the value set by the caller of \fBpcre_exec()\fP for it to have |
| |
any effect. In other words, the pattern writer can lower the limit set by the |
| |
programmer, but not raise it. If there is more than one setting of one of these |
| |
limits, the lower value is used. |
| |
. |
| |
. |
| |
.SH "EBCDIC CHARACTER CODES" |
| |
.rs |
| |
.sp |
| |
PCRE can be compiled to run in an environment that uses EBCDIC as its character |
| |
code rather than ASCII or Unicode (typically a mainframe system). In the |
| |
sections below, character code values are ASCII or Unicode; in an EBCDIC |
| |
environment these characters may have different code values, and there are no |
| |
code points greater than 255. |
| |
. |
| |
. |
| .SH "CHARACTERS AND METACHARACTERS" |
.SH "CHARACTERS AND METACHARACTERS" |
| .rs |
.rs |
| .sp |
.sp |
|
Line 246 one of the following escape sequences than the binary
|
Line 305 one of the following escape sequences than the binary
|
| \ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
| \euhhhh character with hex code hhhh (JavaScript mode only) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
| .sp |
.sp |
| The precise effect of \ecx is as follows: if x is a lower case letter, it | The precise effect of \ecx on ASCII characters is as follows: if x is a lower |
| is converted to upper case. Then bit 6 of the character (hex 40) is inverted. | case letter, it is converted to upper case. Then bit 6 of the character (hex |
| Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while | 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), |
| \ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater | but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the |
| than 127, a compile-time error occurs. This locks out non-ASCII characters in | data item (byte or 16-bit value) following \ec has a value greater than 127, a |
| all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A | compile-time error occurs. This locks out non-ASCII characters in all modes. |
| lower case letter is converted to upper case, and then the 0xc0 bits are | |
| flipped.) | |
| .P |
.P |
| |
The \ec facility was designed for use with ASCII characters, but with the |
| |
extension to Unicode it is even less useful than it once was. It is, however, |
| |
recognized when PCRE is compiled in EBCDIC mode, where data items are always |
| |
bytes. In this mode, all values are valid after \ec. If the next character is a |
| |
lower case letter, it is converted to upper case. Then the 0xc0 bits of the |
| |
byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because |
| |
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other |
| |
characters also generate different values. |
| |
.P |
| By default, after \ex, from zero to two hexadecimal digits are read (letters |
By default, after \ex, from zero to two hexadecimal digits are read (letters |
| can be in upper or lower case). Any number of hexadecimal digits may appear |
can be in upper or lower case). Any number of hexadecimal digits may appear |
| between \ex{ and }, but the character code is constrained as follows: |
between \ex{ and }, but the character code is constrained as follows: |
|
Line 263 between \ex{ and }, but the character code is constrai
|
Line 329 between \ex{ and }, but the character code is constrai
|
| 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
| 16-bit non-UTF mode less than 0x10000 |
16-bit non-UTF mode less than 0x10000 |
| 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
| |
32-bit non-UTF mode less than 0x80000000 |
| |
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
| .sp |
.sp |
| Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
| "surrogate" codepoints). | "surrogate" codepoints), and 0xffef. |
| .P |
.P |
| If characters other than hexadecimal digits appear between \ex{ and }, or if |
If characters other than hexadecimal digits appear between \ex{ and }, or if |
| there is no terminating }, this form of escape is not recognized. Instead, the |
there is no terminating }, this form of escape is not recognized. Instead, the |
|
Line 313 subsequent digits stand for themselves. The value of t
|
Line 381 subsequent digits stand for themselves. The value of t
|
| constrained in the same way as characters specified in hexadecimal. |
constrained in the same way as characters specified in hexadecimal. |
| For example: |
For example: |
| .sp |
.sp |
| \e040 is another way of writing a space | \e040 is another way of writing an ASCII space |
| .\" JOIN |
.\" JOIN |
| \e40 is the same, provided there are fewer than 40 |
\e40 is the same, provided there are fewer than 40 |
| previous capturing subpatterns |
previous capturing subpatterns |
|
Line 471 release 5.10. In contrast to the other sequences, whic
|
Line 539 release 5.10. In contrast to the other sequences, whic
|
| characters by default, these always match certain high-valued codepoints, |
characters by default, these always match certain high-valued codepoints, |
| whether or not PCRE_UCP is set. The horizontal space characters are: |
whether or not PCRE_UCP is set. The horizontal space characters are: |
| .sp |
.sp |
| U+0009 Horizontal tab | U+0009 Horizontal tab (HT) |
| U+0020 Space |
U+0020 Space |
| U+00A0 Non-break space |
U+00A0 Non-break space |
| U+1680 Ogham space mark |
U+1680 Ogham space mark |
|
Line 493 whether or not PCRE_UCP is set. The horizontal space c
|
Line 561 whether or not PCRE_UCP is set. The horizontal space c
|
| .sp |
.sp |
| The vertical space characters are: |
The vertical space characters are: |
| .sp |
.sp |
| U+000A Linefeed | U+000A Linefeed (LF) |
| U+000B Vertical tab | U+000B Vertical tab (VT) |
| U+000C Form feed | U+000C Form feed (FF) |
| U+000D Carriage return | U+000D Carriage return (CR) |
| U+0085 Next line | U+0085 Next line (NEL) |
| U+2028 Line separator |
U+2028 Line separator |
| U+2029 Paragraph separator |
U+2029 Paragraph separator |
| .sp |
.sp |
|
Line 551 change of newline convention; for example, a pattern c
|
Line 619 change of newline convention; for example, a pattern c
|
| .sp |
.sp |
| (*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
| .sp |
.sp |
| They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or |
| sequences. Inside a character class, \eR is treated as an unrecognized escape | (*UCP) special sequences. Inside a character class, \eR is treated as an |
| sequence, and so matches the letter "R" by default, but causes an error if | unrecognized escape sequence, and so matches the letter "R" by default, but |
| PCRE_EXTRA is set. | causes an error if PCRE_EXTRA is set. |
| . |
. |
| . |
. |
| .\" HTML <a name="uniextseq"></a> |
.\" HTML <a name="uniextseq"></a> |
|
Line 569 The extra escape sequences are:
|
Line 637 The extra escape sequences are:
|
| .sp |
.sp |
| \ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
| \eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
| \eX an extended Unicode sequence | \eX a Unicode extended grapheme cluster |
| .sp |
.sp |
| The property names represented by \fIxx\fP above are limited to the Unicode |
The property names represented by \fIxx\fP above are limited to the Unicode |
| script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
|
Line 762 a modifier or "other".
|
Line 830 a modifier or "other".
|
| The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
| U+DFFF. Such characters are not valid in Unicode strings and so |
U+DFFF. Such characters are not valid in Unicode strings and so |
| cannot be tested by PCRE, unless UTF validity checking has been turned off |
cannot be tested by PCRE, unless UTF validity checking has been turned off |
| (see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and |
| | PCRE_NO_UTF32_CHECK in the |
| .\" HREF |
.\" HREF |
| \fBpcreapi\fP |
\fBpcreapi\fP |
| .\" |
.\" |
|
Line 777 Instead, this property is assumed for any code point t
|
Line 846 Instead, this property is assumed for any code point t
|
| Unicode table. |
Unicode table. |
| .P |
.P |
| Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
| example, \ep{Lu} always matches only upper case letters. | example, \ep{Lu} always matches only upper case letters. This is different from |
| | the behaviour of current versions of Perl. |
| .P |
.P |
| The \eX escape matches any number of Unicode characters that form an extended | Matching characters by Unicode property is not fast, because PCRE has to do a |
| Unicode sequence. \eX is equivalent to | multistage table lookup in order to find a character's property. That is why |
| | the traditional escape sequences such as \ed and \ew do not use Unicode |
| | properties in PCRE by default, though you can make them do so by setting the |
| | PCRE_UCP option or by starting the pattern with (*UCP). |
| | . |
| | . |
| | .SS Extended grapheme clusters |
| | .rs |
| .sp |
.sp |
| (?>\ePM\epM*) | The \eX escape matches any number of Unicode characters that form an "extended |
| .sp | grapheme cluster", and treats the sequence as an atomic group |
| That is, it matches a character without the "mark" property, followed by zero | |
| or more characters with the "mark" property, and treats the sequence as an | |
| atomic group | |
| .\" HTML <a href="#atomicgroup"> |
.\" HTML <a href="#atomicgroup"> |
| .\" </a> |
.\" </a> |
| (see below). |
(see below). |
| .\" |
.\" |
| Characters with the "mark" property are typically accents that affect the | Up to and including release 8.31, PCRE matched an earlier, simpler definition |
| preceding character. None of them have codepoints less than 256, so in | that was equivalent to |
| 8-bit non-UTF-8 mode \eX matches any one character. | .sp |
| | (?>\ePM\epM*) |
| | .sp |
| | That is, it matched a character without the "mark" property, followed by zero |
| | or more characters with the "mark" property. Characters with the "mark" |
| | property are typically non-spacing accents that affect the preceding character. |
| .P |
.P |
| Note that recent versions of Perl have changed \eX to match what Unicode calls | This simple definition was extended in Unicode to include more complicated |
| an "extended grapheme cluster", which has a more complicated definition. | kinds of composite character by giving each character a grapheme breaking |
| | property, and creating rules that use these properties to define the boundaries |
| | of extended grapheme clusters. In releases of PCRE later than 8.31, \eX matches |
| | one of these clusters. |
| .P |
.P |
| Matching characters by Unicode property is not fast, because PCRE has to search | \eX always matches at least one character. Then it decides whether to add |
| a structure that contains data for over fifteen thousand characters. That is | additional characters according to the following rules for ending a cluster: |
| why the traditional escape sequences such as \ed and \ew do not use Unicode | .P |
| properties in PCRE by default, though you can make them do so by setting the | 1. End at the end of the subject string. |
| PCRE_UCP option or by starting the pattern with (*UCP). | .P |
| | 2. Do not end between CR and LF; otherwise end after any control character. |
| | .P |
| | 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters |
| | are of five types: L, V, T, LV, and LVT. An L character may be followed by an |
| | L, V, LV, or LVT character; an LV or V character may be followed by a V or T |
| | character; an LVT or T character may be follwed only by a T character. |
| | .P |
| | 4. Do not end before extending characters or spacing marks. Characters with |
| | the "mark" property always have the "extend" grapheme breaking property. |
| | .P |
| | 5. Do not end after prepend characters. |
| | .P |
| | 6. Otherwise, end the cluster. |
| . |
. |
| . |
. |
| .\" HTML <a name="extraprops"></a> |
.\" HTML <a name="extraprops"></a> |
| .SS PCRE's additional properties |
.SS PCRE's additional properties |
| .rs |
.rs |
| .sp |
.sp |
| As well as the standard Unicode properties described in the previous | As well as the standard Unicode properties described above, PCRE supports four |
| section, PCRE supports four more that make it possible to convert traditional | more that make it possible to convert traditional escape sequences such as \ew |
| escape sequences such as \ew and \es and POSIX character classes to use Unicode | and \es and POSIX character classes to use Unicode properties. PCRE uses these |
| properties. PCRE uses these non-standard, non-Perl properties internally when | non-standard, non-Perl properties internally when PCRE_UCP is set. However, |
| PCRE_UCP is set. They are: | they may also be used explicitly. These properties are: |
| .sp |
.sp |
| Xan Any alphanumeric character |
Xan Any alphanumeric character |
| Xps Any POSIX space character |
Xps Any POSIX space character |
|
Line 825 property. Xps matches the characters tab, linefeed, ve
|
Line 920 property. Xps matches the characters tab, linefeed, ve
|
| carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
| Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
| same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
| |
.P |
| |
There is another non-standard property, Xuc, which matches any character that |
| |
can be represented by a Universal Character Name in C++ and other programming |
| |
languages. These are the characters $, @, ` (grave accent), and all characters |
| |
with Unicode code points greater than or equal to U+00A0, except for the |
| |
surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are |
| |
excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH |
| |
where H is a hexadecimal digit. Note that the Xuc property does not match these |
| |
sequences but the characters that they represent.) |
| . |
. |
| . |
. |
| .\" HTML <a name="resetmatchstart"></a> |
.\" HTML <a name="resetmatchstart"></a> |
|
Line 930 regular expression.
|
Line 1034 regular expression.
|
| .SH "CIRCUMFLEX AND DOLLAR" |
.SH "CIRCUMFLEX AND DOLLAR" |
| .rs |
.rs |
| .sp |
.sp |
| |
The circumflex and dollar metacharacters are zero-width assertions. That is, |
| |
they test for a particular condition being true without consuming any |
| |
characters from the subject string. |
| |
.P |
| Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
| character is an assertion that is true only if the current matching point is | character is an assertion that is true only if the current matching point is at |
| at the start of the subject string. If the \fIstartoffset\fP argument of | the start of the subject string. If the \fIstartoffset\fP argument of |
| \fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
| option is unset. Inside a character class, circumflex has an entirely different |
option is unset. Inside a character class, circumflex has an entirely different |
| meaning |
meaning |
|
Line 949 constrained to match only at the start of the subject,
|
Line 1057 constrained to match only at the start of the subject,
|
| "anchored" pattern. (There are also other constructs that can cause a pattern |
"anchored" pattern. (There are also other constructs that can cause a pattern |
| to be anchored.) |
to be anchored.) |
| .P |
.P |
| A dollar character is an assertion that is true only if the current matching | The dollar character is an assertion that is true only if the current matching |
| point is at the end of the subject string, or immediately before a newline | point is at the end of the subject string, or immediately before a newline at |
| at the end of the string (by default). Dollar need not be the last character of | the end of the string (by default). Note, however, that it does not actually |
| the pattern if a number of alternatives are involved, but it should be the last | match the newline. Dollar need not be the last character of the pattern if a |
| item in any branch in which it appears. Dollar has no special meaning in a | number of alternatives are involved, but it should be the last item in any |
| character class. | branch in which it appears. Dollar has no special meaning in a character class. |
| .P |
.P |
| The meaning of dollar can be changed so that it matches only at the very end of |
The meaning of dollar can be changed so that it matches only at the very end of |
| the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
|
Line 1015 name; PCRE does not support this.
|
Line 1123 name; PCRE does not support this.
|
| .sp |
.sp |
| Outside a character class, the escape sequence \eC matches any one data unit, |
Outside a character class, the escape sequence \eC matches any one data unit, |
| whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
| byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always | byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is |
| | a 32-bit unit. Unlike a dot, \eC always |
| matches line-ending characters. The feature is provided in Perl in order to |
matches line-ending characters. The feature is provided in Perl in order to |
| match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
| used. Because \eC breaks up characters into individual data units, matching one |
used. Because \eC breaks up characters into individual data units, matching one |
| unit with \eC in a UTF mode means that the rest of the string may start with a |
unit with \eC in a UTF mode means that the rest of the string may start with a |
| malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
| it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
| start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option | start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or |
| is used). | PCRE_NO_UTF32_CHECK option is used). |
| .P |
.P |
| PCRE does not allow \eC to appear in lookbehind assertions |
PCRE does not allow \eC to appear in lookbehind assertions |
| .\" HTML <a href="#lookbehind"> |
.\" HTML <a href="#lookbehind"> |
|
Line 1082 circumflex is not an assertion; it still consumes a ch
|
Line 1191 circumflex is not an assertion; it still consumes a ch
|
| string, and therefore it fails if the current pointer is at the end of the |
string, and therefore it fails if the current pointer is at the end of the |
| string. |
string. |
| .P |
.P |
| In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be | In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) |
| included in a class as a literal string of data units, or by using the \ex{ | can be included in a class as a literal string of data units, or by using the |
| escaping mechanism. | \ex{ escaping mechanism. |
| .P |
.P |
| When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
| upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
|
Line 1297 the section entitled
|
Line 1406 the section entitled
|
| .\" </a> |
.\" </a> |
| "Newline sequences" |
"Newline sequences" |
| .\" |
.\" |
| above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that | above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading |
| can be used to set UTF and Unicode property modes; they are equivalent to | sequences that can be used to set UTF and Unicode property modes; they are |
| setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. | equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP |
| | options, respectively. The (*UTF) sequence is a generic version that can be |
| | used with any of the libraries. However, the application can set the |
| | PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. |
| . |
. |
| . |
. |
| .\" HTML <a name="subpattern"></a> |
.\" HTML <a name="subpattern"></a> |
|
Line 1534 quantifier, but a literal string of four characters.
|
Line 1646 quantifier, but a literal string of four characters.
|
| In UTF modes, quantifiers apply to characters rather than to individual data |
In UTF modes, quantifiers apply to characters rather than to individual data |
| units. Thus, for example, \ex{100}{2} matches two characters, each of |
units. Thus, for example, \ex{100}{2} matches two characters, each of |
| which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
| \eX{3} matches three Unicode extended sequences, each of which may be several | \eX{3} matches three Unicode extended grapheme clusters, each of which may be |
| data units long (and they may be of different lengths). | several data units long (and they may be of different lengths). |
| .P |
.P |
| The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
| previous item and the quantifier were not present. This may be useful for |
previous item and the quantifier were not present. This may be useful for |
|
Line 1621 In cases where it is known that the subject string con
|
Line 1733 In cases where it is known that the subject string con
|
| worth setting PCRE_DOTALL in order to obtain this optimization, or |
worth setting PCRE_DOTALL in order to obtain this optimization, or |
| alternatively using ^ to indicate anchoring explicitly. |
alternatively using ^ to indicate anchoring explicitly. |
| .P |
.P |
| However, there is one situation where the optimization cannot be used. When .* | However, there are some cases where the optimization cannot be used. When .* |
| is inside capturing parentheses that are the subject of a back reference |
is inside capturing parentheses that are the subject of a back reference |
| elsewhere in the pattern, a match at the start may fail where a later one |
elsewhere in the pattern, a match at the start may fail where a later one |
| succeeds. Consider, for example: |
succeeds. Consider, for example: |
|
Line 1631 succeeds. Consider, for example:
|
Line 1743 succeeds. Consider, for example:
|
| If the subject is "xyz123abc123" the match point is the fourth character. For |
If the subject is "xyz123abc123" the match point is the fourth character. For |
| this reason, such a pattern is not implicitly anchored. |
this reason, such a pattern is not implicitly anchored. |
| .P |
.P |
| |
Another case where implicit anchoring is not applied is when the leading .* is |
| |
inside an atomic group. Once again, a match at the start may fail where a later |
| |
one succeeds. Consider this pattern: |
| |
.sp |
| |
(?>.*?a)b |
| |
.sp |
| |
It matches "ab" in the subject "aab". The use of the backtracking control verbs |
| |
(*PRUNE) and (*SKIP) also disable this optimization. |
| |
.P |
| When a capturing subpattern is repeated, the value captured is the substring |
When a capturing subpattern is repeated, the value captured is the substring |
| that matched the final iteration. For example, after |
that matched the final iteration. For example, after |
| .sp |
.sp |
|
Line 1899 except that it does not cause the current matching pos
|
Line 2020 except that it does not cause the current matching pos
|
| Assertion subpatterns are not capturing subpatterns. If such an assertion |
Assertion subpatterns are not capturing subpatterns. If such an assertion |
| contains capturing subpatterns within it, these are counted for the purposes of |
contains capturing subpatterns within it, these are counted for the purposes of |
| numbering the capturing subpatterns in the whole pattern. However, substring |
numbering the capturing subpatterns in the whole pattern. However, substring |
| capturing is carried out only for positive assertions, because it does not make | capturing is carried out only for positive assertions. (Perl sometimes, but not |
| sense for negative assertions. | always, does do capturing in negative assertions.) |
| .P |
.P |
| For compatibility with Perl, assertion subpatterns may be repeated; though |
For compatibility with Perl, assertion subpatterns may be repeated; though |
| it makes no sense to assert the same thing several times, the side effect of |
it makes no sense to assert the same thing several times, the side effect of |
|
Line 2552 same pair of parentheses when there is a repetition.
|
Line 2673 same pair of parentheses when there is a repetition.
|
| PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
| code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
| function by putting its entry point in the global variable \fIpcre_callout\fP |
function by putting its entry point in the global variable \fIpcre_callout\fP |
| (8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this | (8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library). |
| variable contains NULL, which disables all calling out. | By default, this variable contains NULL, which disables all calling out. |
| .P |
.P |
| Within a regular expression, (?C) indicates the points at which the external |
Within a regular expression, (?C) indicates the points at which the external |
| function is to be called. If you want to identify different callout points, you |
function is to be called. If you want to identify different callout points, you |
|
Line 2564 For example, this pattern has two callout points:
|
Line 2685 For example, this pattern has two callout points:
|
| .sp |
.sp |
| If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
| automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
| 255. | 255. If there is a conditional group in the pattern whose condition is an |
| | assertion, an additional callout is inserted just before the condition. An |
| | explicit callout may also be set at this position, as in this example: |
| | .sp |
| | (?(?C9)(?=a)abc|def) |
| | .sp |
| | Note that this applies only to assertion conditions, not to other types of |
| | condition. |
| .P |
.P |
| During matching, when PCRE reaches a callout point, the external function is |
During matching, when PCRE reaches a callout point, the external function is |
| called. It is provided with the number of the callout, the position in the |
called. It is provided with the number of the callout, the position in the |
|
Line 2583 documentation.
|
Line 2711 documentation.
|
| .rs |
.rs |
| .sp |
.sp |
| Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
| are described in the Perl documentation as "experimental and subject to change | are still described in the Perl documentation as "experimental and subject to |
| or removal in a future version of Perl". It goes on to say: "Their usage in | change or removal in a future version of Perl". It goes on to say: "Their usage |
| production code should be noted to avoid problems during upgrades." The same | in production code should be noted to avoid problems during upgrades." The same |
| remarks apply to the PCRE features described in this section. |
remarks apply to the PCRE features described in this section. |
| .P |
.P |
| |
The new verbs make use of what was previously invalid syntax: an opening |
| |
parenthesis followed by an asterisk. They are generally of the form |
| |
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving |
| |
differently depending on whether or not a name is present. A name is any |
| |
sequence of characters that does not include a closing parenthesis. The maximum |
| |
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit |
| |
libraries. If the name is empty, that is, if the closing parenthesis |
| |
immediately follows the colon, the effect is as if the colon were not there. |
| |
Any number of these verbs may occur in a pattern. |
| |
.P |
| Since these verbs are specifically related to backtracking, most of them can be |
Since these verbs are specifically related to backtracking, most of them can be |
| used only when the pattern is to be matched using one of the traditional |
used only when the pattern is to be matched using one of the traditional |
| matching functions, which use a backtracking algorithm. With the exception of | matching functions, because these use a backtracking algorithm. With the |
| (*FAIL), which behaves like a failing negative assertion, they cause an error | exception of (*FAIL), which behaves like a failing negative assertion, the |
| if encountered by a DFA matching function. | backtracking control verbs cause an error if encountered by a DFA matching |
| | function. |
| .P |
.P |
| If any of these verbs are used in an assertion or in a subpattern that is | The behaviour of these verbs in |
| called as a subroutine (whether or not recursively), their effect is confined | .\" HTML <a href="#btrepeat"> |
| to that subpattern; it does not extend to the surrounding pattern, with one | .\" </a> |
| exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in | repeated groups, |
| a successful positive assertion \fIis\fP passed back when a match succeeds | .\" |
| (compare capturing parentheses in assertions). Note that such subpatterns are | .\" HTML <a href="#btassert"> |
| processed as anchored at the point where they are tested. Note also that Perl's | .\" </a> |
| treatment of subroutines and assertions is different in some cases. | assertions, |
| .P | .\" |
| The new verbs make use of what was previously invalid syntax: an opening | and in |
| parenthesis followed by an asterisk. They are generally of the form | .\" HTML <a href="#btsub"> |
| (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | .\" </a> |
| depending on whether or not an argument is present. A name is any sequence of | subpatterns called as subroutines |
| characters that does not include a closing parenthesis. The maximum length of | .\" |
| name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name | (whether or not recursively) is documented below. |
| is empty, that is, if the closing parenthesis immediately follows the colon, | |
| the effect is as if the colon were not there. Any number of these verbs may | |
| occur in a pattern. | |
| . |
. |
| . |
. |
| .\" HTML <a name="nooptimize"></a> |
.\" HTML <a name="nooptimize"></a> |
|
Line 2621 occur in a pattern.
|
Line 2757 occur in a pattern.
|
| PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
| some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
| minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
| present. When one of these optimizations suppresses the running of a match, any | present. When one of these optimizations bypasses the running of a match, any |
| included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
| the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
| when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
|
Line 2652 followed by a name.
|
Line 2788 followed by a name.
|
| This verb causes the match to end successfully, skipping the remainder of the |
This verb causes the match to end successfully, skipping the remainder of the |
| pattern. However, when it is inside a subpattern that is called as a |
pattern. However, when it is inside a subpattern that is called as a |
| subroutine, only that subpattern is ended successfully. Matching then continues |
subroutine, only that subpattern is ended successfully. Matching then continues |
| at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so | at the outer level. If (*ACCEPT) in triggered in a positive assertion, the |
| far is captured. For example: | assertion succeeds; in a negative assertion, the assertion fails. |
| | .P |
| | If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For |
| | example: |
| .sp |
.sp |
| A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
| .sp |
.sp |
|
Line 2686 starting point (see (*SKIP) below).
|
Line 2825 starting point (see (*SKIP) below).
|
| A name is always required with this verb. There may be as many instances of |
A name is always required with this verb. There may be as many instances of |
| (*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
| .P |
.P |
| When a match succeeds, the name of the last-encountered (*MARK) on the matching | When a match succeeds, the name of the last-encountered (*MARK:NAME), |
| path is passed back to the caller as described in the section entitled | (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the |
| | caller as described in the section entitled |
| .\" HTML <a href="pcreapi.html#extradata"> |
.\" HTML <a href="pcreapi.html#extradata"> |
| .\" </a> |
.\" </a> |
| "Extra data for \fBpcre_exec()\fP" |
"Extra data for \fBpcre_exec()\fP" |
|
Line 2712 indicates which of the two alternatives matched. This
|
Line 2852 indicates which of the two alternatives matched. This
|
| of obtaining this information than putting each alternative in its own |
of obtaining this information than putting each alternative in its own |
| capturing parentheses. |
capturing parentheses. |
| .P |
.P |
| If (*MARK) is encountered in a positive assertion, its name is recorded and | If a verb with a name is encountered in a positive assertion that is true, the |
| passed back if it is the last-encountered. This does not happen for negative | name is recorded and passed back if it is the last-encountered. This does not |
| assertions. | happen for negative assertions or failing positive assertions. |
| .P |
.P |
| After a partial match or a failed match, the name of the last encountered | After a partial match or a failed match, the last encountered name in the |
| (*MARK) in the entire match process is returned. For example: | entire match process is returned. For example: |
| .sp |
.sp |
| re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
| data> XP |
data> XP |
|
Line 2743 to ensure that the match is always attempted.
|
Line 2883 to ensure that the match is always attempted.
|
| The following verbs do nothing when they are encountered. Matching continues |
The following verbs do nothing when they are encountered. Matching continues |
| with what follows, but if there is no subsequent match, causing a backtrack to |
with what follows, but if there is no subsequent match, causing a backtrack to |
| the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
| the verb. However, when one of these verbs appears inside an atomic group, its | the verb. However, when one of these verbs appears inside an atomic group or an |
| effect is confined to that group, because once the group has been matched, | assertion that is true, its effect is confined to that group, because once the |
| there is never any backtracking into it. In this situation, backtracking can | group has been matched, there is never any backtracking into it. In this |
| "jump back" to the left of the entire atomic group. (Remember also, as stated | situation, backtracking can "jump back" to the left of the entire atomic group |
| above, that this localization also applies in subroutine calls and assertions.) | or assertion. (Remember also, as stated above, that this localization also |
| | applies in subroutine calls.) |
| .P |
.P |
| These verbs differ in exactly what kind of failure occurs when backtracking |
These verbs differ in exactly what kind of failure occurs when backtracking |
| reaches them. | reaches them. The behaviour described below is what happens when the verb is |
| | not in a subroutine or an assertion. Subsequent sections cover these special |
| | cases. |
| .sp |
.sp |
| (*COMMIT) |
(*COMMIT) |
| .sp |
.sp |
| This verb, which may not be followed by a name, causes the whole match to fail |
This verb, which may not be followed by a name, causes the whole match to fail |
| outright if the rest of the pattern does not match. Even if the pattern is | outright if there is a later matching failure that causes backtracking to reach |
| unanchored, no further attempts to find a match by advancing the starting point | it. Even if the pattern is unanchored, no further attempts to find a match by |
| take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to | advancing the starting point take place. If (*COMMIT) is the only backtracking |
| finding a match at the current starting point, or not at all. For example: | verb that is encountered, once it has been passed \fBpcre_exec()\fP is |
| | committed to finding a match at the current starting point, or not at all. For |
| | example: |
| .sp |
.sp |
| a+(*COMMIT)b |
a+(*COMMIT)b |
| .sp |
.sp |
|
Line 2767 dynamic anchor, or "I've started, so I must finish." T
|
Line 2912 dynamic anchor, or "I've started, so I must finish." T
|
| recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
| match failure. |
match failure. |
| .P |
.P |
| |
If there is more than one backtracking verb in a pattern, a different one that |
| |
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a |
| |
match does not always guarantee that a match must be at this starting point. |
| |
.P |
| Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
| unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
| \fBpcretest\fP example: |
\fBpcretest\fP example: |
|
Line 2786 starting points.
|
Line 2935 starting points.
|
| (*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
| .sp |
.sp |
| This verb causes the match to fail at the current starting position in the |
This verb causes the match to fail at the current starting position in the |
| subject if the rest of the pattern does not match. If the pattern is | subject if there is a later matching failure that causes backtracking to reach |
| unanchored, the normal "bumpalong" advance to the next starting character then | it. If the pattern is unanchored, the normal "bumpalong" advance to the next |
| happens. Backtracking can occur as usual to the left of (*PRUNE), before it is | starting character then happens. Backtracking can occur as usual to the left of |
| reached, or when matching to the right of (*PRUNE), but if there is no match to | (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but |
| the right, backtracking cannot cross (*PRUNE). In simple cases, the use of | if there is no match to the right, backtracking cannot cross (*PRUNE). In |
| (*PRUNE) is just an alternative to an atomic group or possessive quantifier, | simple cases, the use of (*PRUNE) is just an alternative to an atomic group or |
| but there are some uses of (*PRUNE) that cannot be expressed in any other way. | possessive quantifier, but there are some uses of (*PRUNE) that cannot be |
| The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an | expressed in any other way. In an anchored pattern (*PRUNE) has the same effect |
| anchored pattern (*PRUNE) has the same effect as (*COMMIT). | as (*COMMIT). |
| | .P |
| | The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). |
| | It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| | caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
| .sp |
.sp |
| (*SKIP) |
(*SKIP) |
| .sp |
.sp |
|
Line 2815 instead of skipping on to "c".
|
Line 2968 instead of skipping on to "c".
|
| .sp |
.sp |
| (*SKIP:NAME) |
(*SKIP:NAME) |
| .sp |
.sp |
| When (*SKIP) has an associated name, its behaviour is modified. If the | When (*SKIP) has an associated name, its behaviour is modified. When it is |
| following pattern fails to match, the previous path through the pattern is | triggered, the previous path through the pattern is searched for the most |
| searched for the most recent (*MARK) that has the same name. If one is found, | recent (*MARK) that has the same name. If one is found, the "bumpalong" advance |
| the "bumpalong" advance is to the subject position that corresponds to that | is to the subject position that corresponds to that (*MARK) instead of to where |
| (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a | (*SKIP) was encountered. If no (*MARK) with a matching name is found, the |
| matching name is found, the (*SKIP) is ignored. | (*SKIP) is ignored. |
| | .P |
| | Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores |
| | names that are set by (*PRUNE:NAME) or (*THEN:NAME). |
| .sp |
.sp |
| (*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
| .sp |
.sp |
| This verb causes a skip to the next innermost alternative if the rest of the | This verb causes a skip to the next innermost alternative when backtracking |
| pattern does not match. That is, it cancels pending backtracking, but only | reaches it. That is, it cancels any further backtracking within the current |
| within the current alternative. Its name comes from the observation that it can | alternative. Its name comes from the observation that it can be used for a |
| be used for a pattern-based if-then-else block: | pattern-based if-then-else block: |
| .sp |
.sp |
| ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
| .sp |
.sp |
| If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
| the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
| second alternative and tries COND2, without backtracking into COND1. The | second alternative and tries COND2, without backtracking into COND1. If that |
| behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). | succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no |
| If (*THEN) is not inside an alternation, it acts like (*PRUNE). | more alternatives, so there is a backtrack to whatever came before the entire |
| | group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
| .P |
.P |
| Note that a subpattern that does not contain a | character is just a part of | The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN). |
| the enclosing alternative; it is not a nested alternation with only one | It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| | caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
| | .P |
| | A subpattern that does not contain a | character is just a part of the |
| | enclosing alternative; it is not a nested alternation with only one |
| alternative. The effect of (*THEN) extends beyond such a subpattern to the |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
| enclosing alternative. Consider this pattern, where A, B, etc. are complex |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
| pattern fragments that do not contain any | characters at this level: |
pattern fragments that do not contain any | characters at this level: |
|
Line 2857 in C, matching moves to (*FAIL), which causes the whol
|
Line 3018 in C, matching moves to (*FAIL), which causes the whol
|
| because there are no more alternatives to try. In this case, matching does now |
because there are no more alternatives to try. In this case, matching does now |
| backtrack into A. |
backtrack into A. |
| .P |
.P |
| Note also that a conditional subpattern is not considered as having two | Note that a conditional subpattern is not considered as having two |
| alternatives, because only one is ever used. In other words, the | character in |
alternatives, because only one is ever used. In other words, the | character in |
| a conditional subpattern has a different meaning. Ignoring white space, |
a conditional subpattern has a different meaning. Ignoring white space, |
| consider: |
consider: |
|
Line 2879 starting position, but allowing an advance to the next
|
Line 3040 starting position, but allowing an advance to the next
|
| unanchored pattern). (*SKIP) is similar, except that the advance may be more |
unanchored pattern). (*SKIP) is similar, except that the advance may be more |
| than one character. (*COMMIT) is the strongest, causing the entire match to |
than one character. (*COMMIT) is the strongest, causing the entire match to |
| fail. |
fail. |
| .P | . |
| If more than one such verb is present in a pattern, the "strongest" one wins. | . |
| For example, consider this pattern, where A, B, etc. are complex pattern | .SS "More than one backtracking verb" |
| fragments: | .rs |
| .sp |
.sp |
| (A(*COMMIT)B(*THEN)C|D) | If more than one backtracking verb is present in a pattern, the one that is |
| | backtracked onto first acts. For example, consider this pattern, where A, B, |
| | etc. are complex pattern fragments: |
| .sp |
.sp |
| Once A has matched, PCRE is committed to this match, at the current starting | (A(*COMMIT)B(*THEN)C|ABD) |
| position. If subsequently B matches, but C does not, the normal (*THEN) action | .sp |
| of trying the next alternative (that is, D) does not happen because (*COMMIT) | If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to |
| overrides. | fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes |
| | the next alternative (ABD) to be tried. This behaviour is consistent, but is |
| | not always the same as Perl's. It means that if two or more backtracking verbs |
| | appear in succession, all the the last of them has no effect. Consider this |
| | example: |
| | .sp |
| | ...(*COMMIT)(*PRUNE)... |
| | .sp |
| | If there is a matching failure to the right, backtracking onto (*PRUNE) cases |
| | it to be triggered, and its action is taken. There can never be a backtrack |
| | onto (*COMMIT). |
| . |
. |
| . |
. |
| |
.\" HTML <a name="btrepeat"></a> |
| |
.SS "Backtracking verbs in repeated groups" |
| |
.rs |
| |
.sp |
| |
PCRE differs from Perl in its handling of backtracking verbs in repeated |
| |
groups. For example, consider: |
| |
.sp |
| |
/(a(*COMMIT)b)+ac/ |
| |
.sp |
| |
If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in |
| |
the second repeat of the group acts. |
| |
. |
| |
. |
| |
.\" HTML <a name="btassert"></a> |
| |
.SS "Backtracking verbs in assertions" |
| |
.rs |
| |
.sp |
| |
(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack. |
| |
.P |
| |
(*ACCEPT) in a positive assertion causes the assertion to succeed without any |
| |
further processing. In a negative assertion, (*ACCEPT) causes the assertion to |
| |
fail without any further processing. |
| |
.P |
| |
The other backtracking verbs are not treated specially if they appear in a |
| |
positive assertion. In particular, (*THEN) skips to the next alternative in the |
| |
innermost enclosing group that has alternations, whether or not this is within |
| |
the assertion. |
| |
.P |
| |
Negative assertions are, however, different, in order to ensure that changing a |
| |
positive assertion into a negative assertion changes its result. Backtracking |
| |
into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true, |
| |
without considering any further alternative branches in the assertion. |
| |
Backtracking into (*THEN) causes it to skip to the next enclosing alternative |
| |
within the assertion (the normal behaviour), but if the assertion does not have |
| |
such an alternative, (*THEN) behaves like (*PRUNE). |
| |
. |
| |
. |
| |
.\" HTML <a name="btsub"></a> |
| |
.SS "Backtracking verbs in subroutines" |
| |
.rs |
| |
.sp |
| |
These behaviours occur whether or not the subpattern is called recursively. |
| |
Perl's treatment of subroutines is different in some cases. |
| |
.P |
| |
(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces |
| |
an immediate backtrack. |
| |
.P |
| |
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to |
| |
succeed without any further processing. Matching then continues after the |
| |
subroutine call. |
| |
.P |
| |
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause |
| |
the subroutine match to fail. |
| |
.P |
| |
(*THEN) skips to the next alternative in the innermost enclosing group within |
| |
the subpattern that has alternatives. If there is no such group within the |
| |
subpattern, (*THEN) causes the subroutine match to fail. |
| |
. |
| |
. |
| .SH "SEE ALSO" |
.SH "SEE ALSO" |
| .rs |
.rs |
| .sp |
.sp |
| \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
| \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP. | \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP. |
| . |
. |
| . |
. |
| .SH AUTHOR |
.SH AUTHOR |
|
Line 2913 Cambridge CB2 3QH, England.
|
Line 3145 Cambridge CB2 3QH, England.
|
| .rs |
.rs |
| .sp |
.sp |
| .nf |
.nf |
| Last updated: 17 June 2012 | Last updated: 26 April 2013 |
| Copyright (c) 1997-2012 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
| .fi |
.fi |