version 1.1.1.3, 2012/10/09 09:19:17
|
version 1.1.1.5, 2014/06/15 19:46:05
|
Line 1
|
Line 1
|
.TH PCREPATTERN 3 "04 May 2012" "PCRE 8.31" | .TH PCREPATTERN 3 "03 December 2013" "PCRE 8.34" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
Line 20 have copious examples. Jeffrey Friedl's "Mastering Reg
|
published by O'Reilly, covers regular expressions in great detail. This |
published by O'Reilly, covers regular expressions in great detail. This |
description of PCRE's regular expressions is intended as reference material. |
description of PCRE's regular expressions is intended as reference material. |
.P |
.P |
|
This document discusses the patterns that are supported by PCRE when one its |
|
main matching functions, \fBpcre_exec()\fP (8-bit) or \fBpcre[16|32]_exec()\fP |
|
(16- or 32-bit), is used. PCRE also has alternative matching functions, |
|
\fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP, which match using a |
|
different algorithm that is not Perl-compatible. Some of the features discussed |
|
below are not available when DFA matching is used. The advantages and |
|
disadvantages of the alternative functions, and how they differ from the normal |
|
functions, are discussed in the |
|
.\" HREF |
|
\fBpcrematching\fP |
|
.\" |
|
page. |
|
. |
|
. |
|
.SH "SPECIAL START-OF-PATTERN ITEMS" |
|
.rs |
|
.sp |
|
A number of options that can be passed to \fBpcre_compile()\fP can also be set |
|
by special items at the start of a pattern. These are not Perl-compatible, but |
|
are provided to make these options accessible to pattern writers who are not |
|
able to change the program that processes the pattern. Any number of these |
|
items may appear, but they must all be together right at the start of the |
|
pattern string, and the letters must be in upper case. |
|
. |
|
. |
|
.SS "UTF support" |
|
.rs |
|
.sp |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
there is now also support for UTF-8 strings in the original library, and a | there is now also support for UTF-8 strings in the original library, an |
second library that supports 16-bit and UTF-16 character strings. To use these | extra library that supports 16-bit and UTF-16 character strings, and a |
| third library that supports 32-bit and UTF-32 character strings. To use these |
features, PCRE must be built to include appropriate support. When using UTF |
features, PCRE must be built to include appropriate support. When using UTF |
strings you must either call the compiling function with the PCRE_UTF8 or | strings you must either call the compiling function with the PCRE_UTF8, |
PCRE_UTF16 option, or the pattern must start with one of these special | PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of |
sequences: | these special sequences: |
.sp |
.sp |
(*UTF8) |
(*UTF8) |
(*UTF16) |
(*UTF16) |
|
(*UTF32) |
|
(*UTF) |
.sp |
.sp |
|
(*UTF) is a generic sequence that can be used with any of the libraries. |
Starting a pattern with such a sequence is equivalent to setting the relevant |
Starting a pattern with such a sequence is equivalent to setting the relevant |
option. This feature is not Perl-compatible. How setting a UTF mode affects | option. How setting a UTF mode affects pattern matching is mentioned in several |
pattern matching is mentioned in several places below. There is also a summary | places below. There is also a summary of features in the |
of features in the | |
.\" HREF |
.\" HREF |
\fBpcreunicode\fP |
\fBpcreunicode\fP |
.\" |
.\" |
page. |
page. |
.P |
.P |
Another special sequence that may appear at the start of a pattern or in | Some applications that allow their users to supply patterns may wish to |
combination with (*UTF8) or (*UTF16) is: | restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF |
| option is set at compile time, (*UTF) etc. are not allowed, and their |
| appearance causes an error. |
| . |
| . |
| .SS "Unicode property support" |
| .rs |
.sp |
.sp |
(*UCP) | Another special sequence that may appear at the start of a pattern is (*UCP). |
.sp | |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
such as \ed and \ew to use Unicode properties to determine character types, |
such as \ed and \ew to use Unicode properties to determine character types, |
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
table. |
table. |
.P | . |
| . |
| .SS "Disabling auto-possessification" |
| .rs |
| .sp |
| If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting |
| the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making |
| quantifiers possessive when what follows cannot match the repeated item. For |
| example, by default a+b is treated as a++b. For more details, see the |
| .\" HREF |
| \fBpcreapi\fP |
| .\" |
| documentation. |
| . |
| . |
| .SS "Disabling start-up optimizations" |
| .rs |
| .sp |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are | PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables |
also some more of these special sequences that are concerned with the handling | several optimizations for quickly reaching "no match" results. For more |
of newlines; they are described below. | details, see the |
.P | |
The remainder of this document discusses the patterns that are supported by | |
PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or | |
\fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching | |
functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using | |
a different algorithm that is not Perl-compatible. Some of the features | |
discussed below are not available when DFA matching is used. The advantages and | |
disadvantages of the alternative functions, and how they differ from the normal | |
functions, are discussed in the | |
.\" HREF |
.\" HREF |
\fBpcrematching\fP | \fBpcreapi\fP |
.\" |
.\" |
page. | documentation. |
. |
. |
. |
. |
.\" HTML <a name="newlines"></a> |
.\" HTML <a name="newlines"></a> |
.SH "NEWLINE CONVENTIONS" | .SS "Newline conventions" |
.rs |
.rs |
.sp |
.sp |
PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
Line 103 example, on a Unix system where LF is the default newl
|
Line 147 example, on a Unix system where LF is the default newl
|
(*CR)a.b |
(*CR)a.b |
.sp |
.sp |
changes the convention to CR. That pattern matches "a\enb" because LF is no |
changes the convention to CR. That pattern matches "a\enb" because LF is no |
longer a newline. Note that these special settings, which are not | longer a newline. If more than one of these settings is present, the last one |
Perl-compatible, are recognized only at the very start of a pattern, and that | |
they must be in upper case. If more than one of them is present, the last one | |
is used. |
is used. |
.P |
.P |
The newline convention affects the interpretation of the dot metacharacter when | The newline convention affects where the circumflex and dollar assertions are |
PCRE_DOTALL is not set, and also the behaviour of \eN. However, it does not | true. It also affects the interpretation of the dot metacharacter when |
affect what the \eR escape sequence matches. By default, this is any Unicode | PCRE_DOTALL is not set, and the behaviour of \eN. However, it does not affect |
newline sequence, for Perl compatibility. However, this can be changed; see the | what the \eR escape sequence matches. By default, this is any Unicode newline |
| sequence, for Perl compatibility. However, this can be changed; see the |
description of \eR in the section entitled |
description of \eR in the section entitled |
.\" HTML <a href="#newlineseq"> |
.\" HTML <a href="#newlineseq"> |
.\" </a> |
.\" </a> |
Line 121 below. A change of \eR setting can be combined with a
|
Line 164 below. A change of \eR setting can be combined with a
|
convention. |
convention. |
. |
. |
. |
. |
|
.SS "Setting match and recursion limits" |
|
.rs |
|
.sp |
|
The caller of \fBpcre_exec()\fP can set a limit on the number of times the |
|
internal \fBmatch()\fP function is called and on the maximum depth of |
|
recursive calls. These facilities are provided to catch runaway matches that |
|
are provoked by patterns with huge matching trees (a typical example is a |
|
pattern with nested unlimited repeats) and to avoid running out of system stack |
|
by too much recursion. When one of these limits is reached, \fBpcre_exec()\fP |
|
gives an error return. The limits can also be set by items at the start of the |
|
pattern of the form |
|
.sp |
|
(*LIMIT_MATCH=d) |
|
(*LIMIT_RECURSION=d) |
|
.sp |
|
where d is any number of decimal digits. However, the value of the setting must |
|
be less than the value set (or defaulted) by the caller of \fBpcre_exec()\fP |
|
for it to have any effect. In other words, the pattern writer can lower the |
|
limits set by the programmer, but not raise them. If there is more than one |
|
setting of one of these limits, the lower value is used. |
|
. |
|
. |
|
.SH "EBCDIC CHARACTER CODES" |
|
.rs |
|
.sp |
|
PCRE can be compiled to run in an environment that uses EBCDIC as its character |
|
code rather than ASCII or Unicode (typically a mainframe system). In the |
|
sections below, character code values are ASCII or Unicode; in an EBCDIC |
|
environment these characters may have different code values, and there are no |
|
code points greater than 255. |
|
. |
|
. |
.SH "CHARACTERS AND METACHARACTERS" |
.SH "CHARACTERS AND METACHARACTERS" |
.rs |
.rs |
.sp |
.sp |
Line 198 In a UTF mode, only ASCII numbers and letters have any
|
Line 273 In a UTF mode, only ASCII numbers and letters have any
|
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
.P |
.P |
If a pattern is compiled with the PCRE_EXTENDED option, white space in the | If a pattern is compiled with the PCRE_EXTENDED option, most white space in the |
pattern (other than in a character class) and characters between a # outside | pattern (other than in a character class), and characters between a # outside a |
a character class and the next newline are ignored. An escaping backslash can | character class and the next newline, inclusive, are ignored. An escaping |
be used to include a white space or # character as part of the pattern. | backslash can be used to include a white space or # character as part of the |
| pattern. |
.P |
.P |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
can do so by putting them between \eQ and \eE. This is different from Perl in |
can do so by putting them between \eQ and \eE. This is different from Perl in |
Line 241 one of the following escape sequences than the binary
|
Line 317 one of the following escape sequences than the binary
|
\en linefeed (hex 0A) |
\en linefeed (hex 0A) |
\er carriage return (hex 0D) |
\er carriage return (hex 0D) |
\et tab (hex 09) |
\et tab (hex 09) |
|
\e0dd character with octal code 0dd |
\eddd character with octal code ddd, or back reference |
\eddd character with octal code ddd, or back reference |
|
\eo{ddd..} character with octal code ddd.. |
\exhh character with hex code hh |
\exhh character with hex code hh |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
.sp |
.sp |
The precise effect of \ecx is as follows: if x is a lower case letter, it | The precise effect of \ecx on ASCII characters is as follows: if x is a lower |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. | case letter, it is converted to upper case. Then bit 6 of the character (hex |
Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while | 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), |
\ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater | but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the |
than 127, a compile-time error occurs. This locks out non-ASCII characters in | data item (byte or 16-bit value) following \ec has a value greater than 127, a |
all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A | compile-time error occurs. This locks out non-ASCII characters in all modes. |
lower case letter is converted to upper case, and then the 0xc0 bits are | |
flipped.) | |
.P |
.P |
By default, after \ex, from zero to two hexadecimal digits are read (letters | The \ec facility was designed for use with ASCII characters, but with the |
can be in upper or lower case). Any number of hexadecimal digits may appear | extension to Unicode it is even less useful than it once was. It is, however, |
between \ex{ and }, but the character code is constrained as follows: | recognized when PCRE is compiled in EBCDIC mode, where data items are always |
.sp | bytes. In this mode, all values are valid after \ec. If the next character is a |
8-bit non-UTF mode less than 0x100 | lower case letter, it is converted to upper case. Then the 0xc0 bits of the |
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint | byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because |
16-bit non-UTF mode less than 0x10000 | the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other |
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint | characters also generate different values. |
.sp | |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called | |
"surrogate" codepoints). | |
.P |
.P |
If characters other than hexadecimal digits appear between \ex{ and }, or if |
|
there is no terminating }, this form of escape is not recognized. Instead, the |
|
initial \ex will be interpreted as a basic hexadecimal escape, with no |
|
following digits, giving a character whose value is zero. |
|
.P |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is |
|
as just described only when it is followed by two hexadecimal digits. |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
code points greater than 256 is provided by \eu, which must be followed by |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \eu in JavaScript mode are constrained in the same |
|
was as those specified by \ex in non-JavaScript mode. |
|
.P |
|
Characters whose value is less than 256 can be defined by either of the two |
|
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
|
way they are handled. For example, \exdc is exactly the same as \ex{dc} (or |
|
\eu00dc in JavaScript mode). |
|
.P |
|
After \e0 up to two further octal digits are read. If there are fewer than two |
After \e0 up to two further octal digits are read. If there are fewer than two |
digits, just those that are present are used. Thus the sequence \e0\ex\e07 |
digits, just those that are present are used. Thus the sequence \e0\ex\e07 |
specifies two binary zeros followed by a BEL character (code value 7). Make |
specifies two binary zeros followed by a BEL character (code value 7). Make |
sure you supply two digits after the initial zero if the pattern character that |
sure you supply two digits after the initial zero if the pattern character that |
follows is itself an octal digit. |
follows is itself an octal digit. |
.P |
.P |
The handling of a backslash followed by a digit other than 0 is complicated. | The escape \eo must be followed by a sequence of octal digits, enclosed in |
Outside a character class, PCRE reads it and any following digits as a decimal | braces. An error occurs if this is not the case. This escape is a recent |
number. If the number is less than 10, or if there have been at least that many | addition to Perl; it provides way of specifying character code points as octal |
| numbers greater than 0777, and it also allows octal numbers and back references |
| to be unambiguously specified. |
| .P |
| For greater clarity and unambiguity, it is best to avoid following \e by a |
| digit greater than zero. Instead, use \eo{} or \ex{} to specify character |
| numbers, and \eg{} to specify back references. The following paragraphs |
| describe the old, ambiguous syntax. |
| .P |
| The handling of a backslash followed by a digit other than 0 is complicated, |
| and Perl has changed in recent releases, causing PCRE also to change. Outside a |
| character class, PCRE reads the digit and any following digits as a decimal |
| number. If the number is less than 8, or if there have been at least that many |
previous capturing left parentheses in the expression, the entire sequence is |
previous capturing left parentheses in the expression, the entire sequence is |
taken as a \fIback reference\fP. A description of how this works is given |
taken as a \fIback reference\fP. A description of how this works is given |
.\" HTML <a href="#backreferences"> |
.\" HTML <a href="#backreferences"> |
Line 306 following the discussion of
|
Line 373 following the discussion of
|
parenthesized subpatterns. |
parenthesized subpatterns. |
.\" |
.\" |
.P |
.P |
Inside a character class, or if the decimal number is greater than 9 and there | Inside a character class, or if the decimal number following \e is greater than |
have not been that many capturing subpatterns, PCRE re-reads up to three octal | 7 and there have not been that many capturing subpatterns, PCRE handles \e8 and |
digits following the backslash, and uses them to generate a data character. Any | \e9 as the literal characters "8" and "9", and otherwise re-reads up to three |
subsequent digits stand for themselves. The value of the character is | octal digits following the backslash, using them to generate a data character. |
constrained in the same way as characters specified in hexadecimal. | Any subsequent digits stand for themselves. For example: |
For example: | |
.sp |
.sp |
\e040 is another way of writing a space | \e040 is another way of writing an ASCII space |
.\" JOIN |
.\" JOIN |
\e40 is the same, provided there are fewer than 40 |
\e40 is the same, provided there are fewer than 40 |
previous capturing subpatterns |
previous capturing subpatterns |
Line 330 For example:
|
Line 396 For example:
|
\e377 might be a back reference, otherwise |
\e377 might be a back reference, otherwise |
the value 255 (decimal) |
the value 255 (decimal) |
.\" JOIN |
.\" JOIN |
\e81 is either a back reference, or a binary zero | \e81 is either a back reference, or the two |
followed by the two characters "8" and "1" | characters "8" and "1" |
.sp |
.sp |
Note that octal values of 100 or greater must not be introduced by a leading | Note that octal values of 100 or greater that are specified using this syntax |
zero, because no more than three octal digits are ever read. | must not be introduced by a leading zero, because no more than three octal |
| digits are ever read. |
.P |
.P |
|
By default, after \ex that is not followed by {, from zero to two hexadecimal |
|
digits are read (letters can be in upper or lower case). Any number of |
|
hexadecimal digits may appear between \ex{ and }. If a character other than |
|
a hexadecimal digit appears between \ex{ and }, or if there is no terminating |
|
}, an error occurs. |
|
.P |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is |
|
as just described only when it is followed by two hexadecimal digits. |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
code points greater than 256 is provided by \eu, which must be followed by |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
.P |
|
Characters whose value is less than 256 can be defined by either of the two |
|
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
|
way they are handled. For example, \exdc is exactly the same as \ex{dc} (or |
|
\eu00dc in JavaScript mode). |
|
. |
|
. |
|
.SS "Constraints on character values" |
|
.rs |
|
.sp |
|
Characters that are specified using octal or hexadecimal numbers are |
|
limited to certain values, as follows: |
|
.sp |
|
8-bit non-UTF mode less than 0x100 |
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
|
16-bit non-UTF mode less than 0x10000 |
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
32-bit non-UTF mode less than 0x100000000 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
|
.sp |
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
|
"surrogate" codepoints), and 0xffef. |
|
. |
|
. |
|
.SS "Escape sequences in character classes" |
|
.rs |
|
.sp |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
and outside character classes. In addition, inside a character class, \eb is |
and outside character classes. In addition, inside a character class, \eb is |
interpreted as the backspace character (hex 08). |
interpreted as the backspace character (hex 08). |
Line 426 classes. They each match one character of the appropri
|
Line 531 classes. They each match one character of the appropri
|
matching point is at the end of the subject string, all of them fail, because |
matching point is at the end of the subject string, all of them fail, because |
there is no character to match. |
there is no character to match. |
.P |
.P |
For compatibility with Perl, \es does not match the VT character (code 11). | For compatibility with Perl, \es did not used to match the VT character (code |
This makes it different from the the POSIX "space" class. The \es characters | 11), which made it different from the the POSIX "space" class. However, Perl |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is | added VT at release 5.18, and PCRE followed suit at release 8.34. The default |
included in a Perl script, \es may match the VT character. In PCRE, it never | \es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space |
does. | (32), which are defined as white space in the "C" locale. This list may vary if |
| locale-specific matching is taking place. For example, in some locales the |
| "non-breaking space" character (\exA0) is recognized as white space, and in |
| others the VT character is not. |
.P |
.P |
A "word" character is an underscore or any character that is a letter or digit. |
A "word" character is an underscore or any character that is a letter or digit. |
By default, the definition of letters and digits is controlled by PCRE's |
By default, the definition of letters and digits is controlled by PCRE's |
Line 445 in the
|
Line 553 in the
|
\fBpcreapi\fP |
\fBpcreapi\fP |
.\" |
.\" |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
or "french" in Windows, some character codes greater than 128 are used for | or "french" in Windows, some character codes greater than 127 are used for |
accented letters, and these are then matched by \ew. The use of locales with |
accented letters, and these are then matched by \ew. The use of locales with |
Unicode is discouraged. |
Unicode is discouraged. |
.P |
.P |
By default, in a UTF mode, characters with values greater than 128 never match | By default, characters whose code points are greater than 127 never match \ed, |
\ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain | \es, or \ew, and always match \eD, \eS, and \eW, although this may vary for |
their original meanings from before UTF support was available, mainly for | characters in the range 128-255 when locale-specific matching is happening. |
efficiency reasons. However, if PCRE is compiled with Unicode property support, | These escape sequences retain their original meanings from before Unicode |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode | support was available, mainly for efficiency reasons. If PCRE is compiled with |
properties are used to determine character types, as follows: | Unicode property support, and the PCRE_UCP option is set, the behaviour is |
| changed so that Unicode properties are used to determine character types, as |
| follows: |
.sp |
.sp |
\ed any character that \ep{Nd} matches (decimal digit) | \ed any character that matches \ep{Nd} (decimal digit) |
\es any character that \ep{Z} matches, plus HT, LF, FF, CR | \es any character that matches \ep{Z} or \eh or \ev |
\ew any character that \ep{L} or \ep{N} matches, plus underscore | \ew any character that matches \ep{L} or \ep{N}, plus underscore |
.sp |
.sp |
The upper case escapes match the inverse sets of characters. Note that \ed |
The upper case escapes match the inverse sets of characters. Note that \ed |
matches only decimal digits, whereas \ew matches any Unicode digit, as well as |
matches only decimal digits, whereas \ew matches any Unicode digit, as well as |
Line 468 is noticeably slower when PCRE_UCP is set.
|
Line 578 is noticeably slower when PCRE_UCP is set.
|
.P |
.P |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
characters by default, these always match certain high-valued codepoints, | characters by default, these always match certain high-valued code points, |
whether or not PCRE_UCP is set. The horizontal space characters are: |
whether or not PCRE_UCP is set. The horizontal space characters are: |
.sp |
.sp |
U+0009 Horizontal tab | U+0009 Horizontal tab (HT) |
U+0020 Space |
U+0020 Space |
U+00A0 Non-break space |
U+00A0 Non-break space |
U+1680 Ogham space mark |
U+1680 Ogham space mark |
Line 493 whether or not PCRE_UCP is set. The horizontal space c
|
Line 603 whether or not PCRE_UCP is set. The horizontal space c
|
.sp |
.sp |
The vertical space characters are: |
The vertical space characters are: |
.sp |
.sp |
U+000A Linefeed | U+000A Linefeed (LF) |
U+000B Vertical tab | U+000B Vertical tab (VT) |
U+000C Form feed | U+000C Form feed (FF) |
U+000D Carriage return | U+000D Carriage return (CR) |
U+0085 Next line | U+0085 Next line (NEL) |
U+2028 Line separator |
U+2028 Line separator |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
.sp |
.sp |
Line 551 change of newline convention; for example, a pattern c
|
Line 661 change of newline convention; for example, a pattern c
|
.sp |
.sp |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
.sp |
.sp |
They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or |
sequences. Inside a character class, \eR is treated as an unrecognized escape | (*UCP) special sequences. Inside a character class, \eR is treated as an |
sequence, and so matches the letter "R" by default, but causes an error if | unrecognized escape sequence, and so matches the letter "R" by default, but |
PCRE_EXTRA is set. | causes an error if PCRE_EXTRA is set. |
. |
. |
. |
. |
.\" HTML <a name="uniextseq"></a> |
.\" HTML <a name="uniextseq"></a> |
Line 569 The extra escape sequences are:
|
Line 679 The extra escape sequences are:
|
.sp |
.sp |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eX an extended Unicode sequence | \eX a Unicode extended grapheme cluster |
.sp |
.sp |
The property names represented by \fIxx\fP above are limited to the Unicode |
The property names represented by \fIxx\fP above are limited to the Unicode |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
Line 762 a modifier or "other".
|
Line 872 a modifier or "other".
|
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
U+DFFF. Such characters are not valid in Unicode strings and so |
U+DFFF. Such characters are not valid in Unicode strings and so |
cannot be tested by PCRE, unless UTF validity checking has been turned off |
cannot be tested by PCRE, unless UTF validity checking has been turned off |
(see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and |
| PCRE_NO_UTF32_CHECK in the |
.\" HREF |
.\" HREF |
\fBpcreapi\fP |
\fBpcreapi\fP |
.\" |
.\" |
Line 777 Instead, this property is assumed for any code point t
|
Line 888 Instead, this property is assumed for any code point t
|
Unicode table. |
Unicode table. |
.P |
.P |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
example, \ep{Lu} always matches only upper case letters. | example, \ep{Lu} always matches only upper case letters. This is different from |
| the behaviour of current versions of Perl. |
.P |
.P |
The \eX escape matches any number of Unicode characters that form an extended | Matching characters by Unicode property is not fast, because PCRE has to do a |
Unicode sequence. \eX is equivalent to | multistage table lookup in order to find a character's property. That is why |
| the traditional escape sequences such as \ed and \ew do not use Unicode |
| properties in PCRE by default, though you can make them do so by setting the |
| PCRE_UCP option or by starting the pattern with (*UCP). |
| . |
| . |
| .SS Extended grapheme clusters |
| .rs |
.sp |
.sp |
(?>\ePM\epM*) | The \eX escape matches any number of Unicode characters that form an "extended |
.sp | grapheme cluster", and treats the sequence as an atomic group |
That is, it matches a character without the "mark" property, followed by zero | |
or more characters with the "mark" property, and treats the sequence as an | |
atomic group | |
.\" HTML <a href="#atomicgroup"> |
.\" HTML <a href="#atomicgroup"> |
.\" </a> |
.\" </a> |
(see below). |
(see below). |
.\" |
.\" |
Characters with the "mark" property are typically accents that affect the | Up to and including release 8.31, PCRE matched an earlier, simpler definition |
preceding character. None of them have codepoints less than 256, so in | that was equivalent to |
8-bit non-UTF-8 mode \eX matches any one character. | .sp |
| (?>\ePM\epM*) |
| .sp |
| That is, it matched a character without the "mark" property, followed by zero |
| or more characters with the "mark" property. Characters with the "mark" |
| property are typically non-spacing accents that affect the preceding character. |
.P |
.P |
Note that recent versions of Perl have changed \eX to match what Unicode calls | This simple definition was extended in Unicode to include more complicated |
an "extended grapheme cluster", which has a more complicated definition. | kinds of composite character by giving each character a grapheme breaking |
| property, and creating rules that use these properties to define the boundaries |
| of extended grapheme clusters. In releases of PCRE later than 8.31, \eX matches |
| one of these clusters. |
.P |
.P |
Matching characters by Unicode property is not fast, because PCRE has to search | \eX always matches at least one character. Then it decides whether to add |
a structure that contains data for over fifteen thousand characters. That is | additional characters according to the following rules for ending a cluster: |
why the traditional escape sequences such as \ed and \ew do not use Unicode | .P |
properties in PCRE by default, though you can make them do so by setting the | 1. End at the end of the subject string. |
PCRE_UCP option or by starting the pattern with (*UCP). | .P |
| 2. Do not end between CR and LF; otherwise end after any control character. |
| .P |
| 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters |
| are of five types: L, V, T, LV, and LVT. An L character may be followed by an |
| L, V, LV, or LVT character; an LV or V character may be followed by a V or T |
| character; an LVT or T character may be follwed only by a T character. |
| .P |
| 4. Do not end before extending characters or spacing marks. Characters with |
| the "mark" property always have the "extend" grapheme breaking property. |
| .P |
| 5. Do not end after prepend characters. |
| .P |
| 6. Otherwise, end the cluster. |
. |
. |
. |
. |
.\" HTML <a name="extraprops"></a> |
.\" HTML <a name="extraprops"></a> |
.SS PCRE's additional properties |
.SS PCRE's additional properties |
.rs |
.rs |
.sp |
.sp |
As well as the standard Unicode properties described in the previous | As well as the standard Unicode properties described above, PCRE supports four |
section, PCRE supports four more that make it possible to convert traditional | more that make it possible to convert traditional escape sequences such as \ew |
escape sequences such as \ew and \es and POSIX character classes to use Unicode | and \es to use Unicode properties. PCRE uses these non-standard, non-Perl |
properties. PCRE uses these non-standard, non-Perl properties internally when | properties internally when PCRE_UCP is set. However, they may also be used |
PCRE_UCP is set. They are: | explicitly. These properties are: |
.sp |
.sp |
Xan Any alphanumeric character |
Xan Any alphanumeric character |
Xps Any POSIX space character |
Xps Any POSIX space character |
Line 823 PCRE_UCP is set. They are:
|
Line 960 PCRE_UCP is set. They are:
|
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the | Xsp is the same as Xps; it used to exclude vertical tab, for Perl |
same characters as Xan, plus underscore. | compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd |
| matches the same characters as Xan, plus underscore. |
| .P |
| There is another non-standard property, Xuc, which matches any character that |
| can be represented by a Universal Character Name in C++ and other programming |
| languages. These are the characters $, @, ` (grave accent), and all characters |
| with Unicode code points greater than or equal to U+00A0, except for the |
| surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are |
| excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH |
| where H is a hexadecimal digit. Note that the Xuc property does not match these |
| sequences but the characters that they represent.) |
. |
. |
. |
. |
.\" HTML <a name="resetmatchstart"></a> |
.\" HTML <a name="resetmatchstart"></a> |
Line 930 regular expression.
|
Line 1077 regular expression.
|
.SH "CIRCUMFLEX AND DOLLAR" |
.SH "CIRCUMFLEX AND DOLLAR" |
.rs |
.rs |
.sp |
.sp |
|
The circumflex and dollar metacharacters are zero-width assertions. That is, |
|
they test for a particular condition being true without consuming any |
|
characters from the subject string. |
|
.P |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
character is an assertion that is true only if the current matching point is | character is an assertion that is true only if the current matching point is at |
at the start of the subject string. If the \fIstartoffset\fP argument of | the start of the subject string. If the \fIstartoffset\fP argument of |
\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
\fBpcre_exec()\fP is non-zero, circumflex can never match if the PCRE_MULTILINE |
option is unset. Inside a character class, circumflex has an entirely different |
option is unset. Inside a character class, circumflex has an entirely different |
meaning |
meaning |
Line 949 constrained to match only at the start of the subject,
|
Line 1100 constrained to match only at the start of the subject,
|
"anchored" pattern. (There are also other constructs that can cause a pattern |
"anchored" pattern. (There are also other constructs that can cause a pattern |
to be anchored.) |
to be anchored.) |
.P |
.P |
A dollar character is an assertion that is true only if the current matching | The dollar character is an assertion that is true only if the current matching |
point is at the end of the subject string, or immediately before a newline | point is at the end of the subject string, or immediately before a newline at |
at the end of the string (by default). Dollar need not be the last character of | the end of the string (by default). Note, however, that it does not actually |
the pattern if a number of alternatives are involved, but it should be the last | match the newline. Dollar need not be the last character of the pattern if a |
item in any branch in which it appears. Dollar has no special meaning in a | number of alternatives are involved, but it should be the last item in any |
character class. | branch in which it appears. Dollar has no special meaning in a character class. |
.P |
.P |
The meaning of dollar can be changed so that it matches only at the very end of |
The meaning of dollar can be changed so that it matches only at the very end of |
the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This |
Line 1015 name; PCRE does not support this.
|
Line 1166 name; PCRE does not support this.
|
.sp |
.sp |
Outside a character class, the escape sequence \eC matches any one data unit, |
Outside a character class, the escape sequence \eC matches any one data unit, |
whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always | byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is |
| a 32-bit unit. Unlike a dot, \eC always |
matches line-ending characters. The feature is provided in Perl in order to |
matches line-ending characters. The feature is provided in Perl in order to |
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
used. Because \eC breaks up characters into individual data units, matching one |
used. Because \eC breaks up characters into individual data units, matching one |
unit with \eC in a UTF mode means that the rest of the string may start with a |
unit with \eC in a UTF mode means that the rest of the string may start with a |
malformed UTF character. This has undefined results, because PCRE assumes that |
malformed UTF character. This has undefined results, because PCRE assumes that |
it is dealing with valid UTF strings (and by default it checks this at the |
it is dealing with valid UTF strings (and by default it checks this at the |
start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option | start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or |
is used). | PCRE_NO_UTF32_CHECK option is used). |
.P |
.P |
PCRE does not allow \eC to appear in lookbehind assertions |
PCRE does not allow \eC to appear in lookbehind assertions |
.\" HTML <a href="#lookbehind"> |
.\" HTML <a href="#lookbehind"> |
Line 1082 circumflex is not an assertion; it still consumes a ch
|
Line 1234 circumflex is not an assertion; it still consumes a ch
|
string, and therefore it fails if the current pointer is at the end of the |
string, and therefore it fails if the current pointer is at the end of the |
string. |
string. |
.P |
.P |
In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be | In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff) |
included in a class as a literal string of data units, or by using the \ex{ | can be included in a class as a literal string of data units, or by using the |
escaping mechanism. | \ex{ escaping mechanism. |
.P |
.P |
When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
Line 1106 The minus (hyphen) character can be used to specify a
|
Line 1258 The minus (hyphen) character can be used to specify a
|
character class. For example, [d-m] matches any letter between d and m, |
character class. For example, [d-m] matches any letter between d and m, |
inclusive. If a minus character is required in a class, it must be escaped with |
inclusive. If a minus character is required in a class, it must be escaped with |
a backslash or appear in a position where it cannot be interpreted as |
a backslash or appear in a position where it cannot be interpreted as |
indicating a range, typically as the first or last character in the class. | indicating a range, typically as the first or last character in the class, or |
| immediately after a range. For example, [b-d-z] matches letters in the range b |
| to d, a hyphen character, or z. |
.P |
.P |
It is not possible to have the literal character "]" as the end character of a |
It is not possible to have the literal character "]" as the end character of a |
range. A pattern such as [W-]46] is interpreted as a class of two characters |
range. A pattern such as [W-]46] is interpreted as a class of two characters |
Line 1116 the end of range, so [W-\e]46] is interpreted as a cla
|
Line 1270 the end of range, so [W-\e]46] is interpreted as a cla
|
followed by two other characters. The octal or hexadecimal representation of |
followed by two other characters. The octal or hexadecimal representation of |
"]" can also be used to end a range. |
"]" can also be used to end a range. |
.P |
.P |
|
An error is generated if a POSIX character class (see below) or an escape |
|
sequence other than one that defines a single character appears at a point |
|
where a range ending character is expected. For example, [z-\exff] is valid, |
|
but [A-\ed] and [A-[:digit:]] are not. |
|
.P |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
used for characters specified numerically, for example [\e000-\e037]. Ranges |
used for characters specified numerically, for example [\e000-\e037]. Ranges |
can include any characters that are valid for the current mode. |
can include any characters that are valid for the current mode. |
Line 1154 something AND NOT ...".
|
Line 1313 something AND NOT ...".
|
The only metacharacters that are recognized in character classes are backslash, |
The only metacharacters that are recognized in character classes are backslash, |
hyphen (only where it can be interpreted as specifying a range), circumflex |
hyphen (only where it can be interpreted as specifying a range), circumflex |
(only at the start), opening square bracket (only when it can be interpreted as |
(only at the start), opening square bracket (only when it can be interpreted as |
introducing a POSIX class name - see the next section), and the terminating | introducing a POSIX class name, or for a special compatibility feature - see |
closing square bracket. However, escaping other non-alphanumeric characters | the next two sections), and the terminating closing square bracket. However, |
does no harm. | escaping other non-alphanumeric characters does no harm. |
. |
. |
. |
. |
.SH "POSIX CHARACTER CLASSES" |
.SH "POSIX CHARACTER CLASSES" |
Line 1181 are:
|
Line 1340 are:
|
lower lower case letters |
lower lower case letters |
print printing characters, including space |
print printing characters, including space |
punct printing characters, excluding letters and digits and space |
punct printing characters, excluding letters and digits and space |
space white space (not quite the same as \es) | space white space (the same as \es from PCRE 8.34) |
upper upper case letters |
upper upper case letters |
word "word" characters (same as \ew) |
word "word" characters (same as \ew) |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
.sp |
.sp |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and | The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
space (32). Notice that this list includes the VT character (code 11). This | and space (32). If locale-specific matching is taking place, the list of space |
makes "space" different to \es, which does not include VT (for Perl | characters may be different; there may be fewer or more of them. "Space" used |
compatibility). | to be different to \es, which did not include VT, for Perl compatibility. |
| However, Perl changed at release 5.18, and PCRE followed at release 8.34. |
| "Space" and \es now match the same set of characters. |
.P |
.P |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
5.8. Another Perl extension is negation, which is indicated by a ^ character |
5.8. Another Perl extension is negation, which is indicated by a ^ character |
Line 1201 matches "1", "2", or any non-digit. PCRE (and Perl) al
|
Line 1362 matches "1", "2", or any non-digit. PCRE (and Perl) al
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
.P |
.P |
By default, in UTF modes, characters with values greater than 128 do not match | By default, characters with values greater than 128 do not match any of the |
any of the POSIX character classes. However, if the PCRE_UCP option is passed | POSIX character classes. However, if the PCRE_UCP option is passed to |
to \fBpcre_compile()\fP, some of the classes are changed so that Unicode | \fBpcre_compile()\fP, some of the classes are changed so that Unicode character |
character properties are used. This is achieved by replacing the POSIX classes | properties are used. This is achieved by replacing certain POSIX classes by |
by other sequences, as follows: | other sequences, as follows: |
.sp |
.sp |
[:alnum:] becomes \ep{Xan} |
[:alnum:] becomes \ep{Xan} |
[:alpha:] becomes \ep{L} |
[:alpha:] becomes \ep{L} |
Line 1216 by other sequences, as follows:
|
Line 1377 by other sequences, as follows:
|
[:upper:] becomes \ep{Lu} |
[:upper:] becomes \ep{Lu} |
[:word:] becomes \ep{Xwd} |
[:word:] becomes \ep{Xwd} |
.sp |
.sp |
Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX | Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX |
classes are unchanged, and match only characters with code points less than | classes are handled specially in UCP mode: |
128. | .TP 10 |
| [:graph:] |
| This matches characters that have glyphs that mark the page when printed. In |
| Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf |
| properties, except for: |
| .sp |
| U+061C Arabic Letter Mark |
| U+180E Mongolian Vowel Separator |
| U+2066 - U+2069 Various "isolate"s |
| .sp |
| .TP 10 |
| [:print:] |
| This matches the same characters as [:graph:] plus space characters that are |
| not controls, that is, characters with the Zs property. |
| .TP 10 |
| [:punct:] |
| This matches all characters that have the Unicode P (punctuation) property, |
| plus those characters whose code points are less than 128 that have the S |
| (Symbol) property. |
| .P |
| The other POSIX classes are unchanged, and match only characters with code |
| points less than 128. |
. |
. |
. |
. |
|
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES" |
|
.rs |
|
.sp |
|
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly |
|
syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of |
|
word". PCRE treats these items as follows: |
|
.sp |
|
[[:<:]] is converted to \eb(?=\ew) |
|
[[:>:]] is converted to \eb(?<=\ew) |
|
.sp |
|
Only these exact character sequences are recognized. A sequence such as |
|
[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is |
|
not compatible with Perl. It is provided to help migrations from other |
|
environments, and is best not used in any new patterns. Note that \eb matches |
|
at the start and the end of a word (see |
|
.\" HTML <a href="#smallassertions"> |
|
.\" </a> |
|
"Simple assertions" |
|
.\" |
|
above), and in a Perl-style pattern the preceding or following character |
|
normally shows which is wanted, without the need for the assertions that are |
|
used above in order to give exactly the POSIX behaviour. |
|
. |
|
. |
.SH "VERTICAL BAR" |
.SH "VERTICAL BAR" |
.rs |
.rs |
.sp |
.sp |
Line 1297 the section entitled
|
Line 1503 the section entitled
|
.\" </a> |
.\" </a> |
"Newline sequences" |
"Newline sequences" |
.\" |
.\" |
above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that | above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading |
can be used to set UTF and Unicode property modes; they are equivalent to | sequences that can be used to set UTF and Unicode property modes; they are |
setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. | equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP |
| options, respectively. The (*UTF) sequence is a generic version that can be |
| used with any of the libraries. However, the application can set the |
| PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences. |
. |
. |
. |
. |
.\" HTML <a name="subpattern"></a> |
.\" HTML <a name="subpattern"></a> |
Line 1435 conditions,
|
Line 1644 conditions,
|
.\" |
.\" |
can be made by name as well as by number. |
can be made by name as well as by number. |
.P |
.P |
Names consist of up to 32 alphanumeric characters and underscores. Named | Names consist of up to 32 alphanumeric characters and underscores, but must |
capturing parentheses are still allocated numbers as well as names, exactly as | start with a non-digit. Named capturing parentheses are still allocated numbers |
if the names were not present. The PCRE API provides function calls for | as well as names, exactly as if the names were not present. The PCRE API |
extracting the name-to-number translation table from a compiled pattern. There | provides function calls for extracting the name-to-number translation table |
is also a convenience function for extracting a captured substring by name. | from a compiled pattern. There is also a convenience function for extracting a |
| captured substring by name. |
.P |
.P |
By default, a name must be unique within a pattern, but it is possible to relax |
By default, a name must be unique within a pattern, but it is possible to relax |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
Line 1465 for the first (and in this example, the only) subpatte
|
Line 1675 for the first (and in this example, the only) subpatte
|
matched. This saves searching to find which numbered subpattern it was. |
matched. This saves searching to find which numbered subpattern it was. |
.P |
.P |
If you make a back reference to a non-unique named subpattern from elsewhere in |
If you make a back reference to a non-unique named subpattern from elsewhere in |
the pattern, the one that corresponds to the first occurrence of the name is | the pattern, the subpatterns to which the name refers are checked in the order |
used. In the absence of duplicate numbers (see the previous section) this is | in which they appear in the overall pattern. The first one that is set is used |
the one with the lowest number. If you use a named reference in a condition | for the reference. For example, this pattern matches both "foofoo" and |
| "barbar" but not "foobar" or "barfoo": |
| .sp |
| (?:(?<n>foo)|(?<n>bar))\ek<n> |
| .sp |
| .P |
| If you make a subroutine call to a non-unique named subpattern, the one that |
| corresponds to the first occurrence of the name is used. In the absence of |
| duplicate numbers (see the previous section) this is the one with the lowest |
| number. |
| .P |
| If you use a named reference in a condition |
test (see the |
test (see the |
.\" |
.\" |
.\" HTML <a href="#conditions"> |
.\" HTML <a href="#conditions"> |
Line 1487 documentation.
|
Line 1708 documentation.
|
\fBWarning:\fP You cannot use different names to distinguish between two |
\fBWarning:\fP You cannot use different names to distinguish between two |
subpatterns with the same number because PCRE uses only the numbers when |
subpatterns with the same number because PCRE uses only the numbers when |
matching. For this reason, an error is given at compile time if different names |
matching. For this reason, an error is given at compile time if different names |
are given to subpatterns with the same number. However, you can give the same | are given to subpatterns with the same number. However, you can always give the |
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. | same name to subpatterns with the same number, even when PCRE_DUPNAMES is not |
| set. |
. |
. |
. |
. |
.SH REPETITION |
.SH REPETITION |
Line 1534 quantifier, but a literal string of four characters.
|
Line 1756 quantifier, but a literal string of four characters.
|
In UTF modes, quantifiers apply to characters rather than to individual data |
In UTF modes, quantifiers apply to characters rather than to individual data |
units. Thus, for example, \ex{100}{2} matches two characters, each of |
units. Thus, for example, \ex{100}{2} matches two characters, each of |
which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
\eX{3} matches three Unicode extended sequences, each of which may be several | \eX{3} matches three Unicode extended grapheme clusters, each of which may be |
data units long (and they may be of different lengths). | several data units long (and they may be of different lengths). |
.P |
.P |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
previous item and the quantifier were not present. This may be useful for |
previous item and the quantifier were not present. This may be useful for |
Line 1621 In cases where it is known that the subject string con
|
Line 1843 In cases where it is known that the subject string con
|
worth setting PCRE_DOTALL in order to obtain this optimization, or |
worth setting PCRE_DOTALL in order to obtain this optimization, or |
alternatively using ^ to indicate anchoring explicitly. |
alternatively using ^ to indicate anchoring explicitly. |
.P |
.P |
However, there is one situation where the optimization cannot be used. When .* | However, there are some cases where the optimization cannot be used. When .* |
is inside capturing parentheses that are the subject of a back reference |
is inside capturing parentheses that are the subject of a back reference |
elsewhere in the pattern, a match at the start may fail where a later one |
elsewhere in the pattern, a match at the start may fail where a later one |
succeeds. Consider, for example: |
succeeds. Consider, for example: |
Line 1631 succeeds. Consider, for example:
|
Line 1853 succeeds. Consider, for example:
|
If the subject is "xyz123abc123" the match point is the fourth character. For |
If the subject is "xyz123abc123" the match point is the fourth character. For |
this reason, such a pattern is not implicitly anchored. |
this reason, such a pattern is not implicitly anchored. |
.P |
.P |
|
Another case where implicit anchoring is not applied is when the leading .* is |
|
inside an atomic group. Once again, a match at the start may fail where a later |
|
one succeeds. Consider this pattern: |
|
.sp |
|
(?>.*?a)b |
|
.sp |
|
It matches "ab" in the subject "aab". The use of the backtracking control verbs |
|
(*PRUNE) and (*SKIP) also disable this optimization. |
|
.P |
When a capturing subpattern is repeated, the value captured is the substring |
When a capturing subpattern is repeated, the value captured is the substring |
that matched the final iteration. For example, after |
that matched the final iteration. For example, after |
.sp |
.sp |
Line 1899 except that it does not cause the current matching pos
|
Line 2130 except that it does not cause the current matching pos
|
Assertion subpatterns are not capturing subpatterns. If such an assertion |
Assertion subpatterns are not capturing subpatterns. If such an assertion |
contains capturing subpatterns within it, these are counted for the purposes of |
contains capturing subpatterns within it, these are counted for the purposes of |
numbering the capturing subpatterns in the whole pattern. However, substring |
numbering the capturing subpatterns in the whole pattern. However, substring |
capturing is carried out only for positive assertions, because it does not make | capturing is carried out only for positive assertions. (Perl sometimes, but not |
sense for negative assertions. | always, does do capturing in negative assertions.) |
.P |
.P |
For compatibility with Perl, assertion subpatterns may be repeated; though |
For compatibility with Perl, assertion subpatterns may be repeated; though |
it makes no sense to assert the same thing several times, the side effect of |
it makes no sense to assert the same thing several times, the side effect of |
Line 2150 This makes the fragment independent of the parentheses
|
Line 2381 This makes the fragment independent of the parentheses
|
.sp |
.sp |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
this facility before Perl, the syntax (?(name)...) is also recognized. However, | this facility before Perl, the syntax (?(name)...) is also recognized. |
there is a possible ambiguity with this syntax, because subpattern names may | |
consist entirely of digits. PCRE looks first for a named subpattern; if it | |
cannot find one and the name consists entirely of digits, PCRE looks for a | |
subpattern of that number, which must be greater than zero. Using subpattern | |
names that consist entirely of digits is not recommended. | |
.P |
.P |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
.sp |
.sp |
Line 2552 same pair of parentheses when there is a repetition.
|
Line 2778 same pair of parentheses when there is a repetition.
|
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
function by putting its entry point in the global variable \fIpcre_callout\fP |
function by putting its entry point in the global variable \fIpcre_callout\fP |
(8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this | (8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library). |
variable contains NULL, which disables all calling out. | By default, this variable contains NULL, which disables all calling out. |
.P |
.P |
Within a regular expression, (?C) indicates the points at which the external |
Within a regular expression, (?C) indicates the points at which the external |
function is to be called. If you want to identify different callout points, you |
function is to be called. If you want to identify different callout points, you |
Line 2564 For example, this pattern has two callout points:
|
Line 2790 For example, this pattern has two callout points:
|
.sp |
.sp |
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
255. | 255. If there is a conditional group in the pattern whose condition is an |
| assertion, an additional callout is inserted just before the condition. An |
| explicit callout may also be set at this position, as in this example: |
| .sp |
| (?(?C9)(?=a)abc|def) |
| .sp |
| Note that this applies only to assertion conditions, not to other types of |
| condition. |
.P |
.P |
During matching, when PCRE reaches a callout point, the external function is |
During matching, when PCRE reaches a callout point, the external function is |
called. It is provided with the number of the callout, the position in the |
called. It is provided with the number of the callout, the position in the |
pattern, and, optionally, one item of data originally supplied by the caller of |
pattern, and, optionally, one item of data originally supplied by the caller of |
the matching function. The callout function may cause matching to proceed, to |
the matching function. The callout function may cause matching to proceed, to |
backtrack, or to fail altogether. A complete description of the interface to | backtrack, or to fail altogether. |
the callout function is given in the | .P |
| By default, PCRE implements a number of optimizations at compile time and |
| matching time, and one side-effect is that sometimes callouts are skipped. If |
| you need all possible callouts to happen, you need to set options that disable |
| the relevant optimizations. More details, and a complete description of the |
| interface to the callout function, are given in the |
.\" HREF |
.\" HREF |
\fBpcrecallout\fP |
\fBpcrecallout\fP |
.\" |
.\" |
Line 2583 documentation.
|
Line 2821 documentation.
|
.rs |
.rs |
.sp |
.sp |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which |
are described in the Perl documentation as "experimental and subject to change | are still described in the Perl documentation as "experimental and subject to |
or removal in a future version of Perl". It goes on to say: "Their usage in | change or removal in a future version of Perl". It goes on to say: "Their usage |
production code should be noted to avoid problems during upgrades." The same | in production code should be noted to avoid problems during upgrades." The same |
remarks apply to the PCRE features described in this section. |
remarks apply to the PCRE features described in this section. |
.P |
.P |
|
The new verbs make use of what was previously invalid syntax: an opening |
|
parenthesis followed by an asterisk. They are generally of the form |
|
(*VERB) or (*VERB:NAME). Some may take either form, possibly behaving |
|
differently depending on whether or not a name is present. A name is any |
|
sequence of characters that does not include a closing parenthesis. The maximum |
|
length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit |
|
libraries. If the name is empty, that is, if the closing parenthesis |
|
immediately follows the colon, the effect is as if the colon were not there. |
|
Any number of these verbs may occur in a pattern. |
|
.P |
Since these verbs are specifically related to backtracking, most of them can be |
Since these verbs are specifically related to backtracking, most of them can be |
used only when the pattern is to be matched using one of the traditional |
used only when the pattern is to be matched using one of the traditional |
matching functions, which use a backtracking algorithm. With the exception of | matching functions, because these use a backtracking algorithm. With the |
(*FAIL), which behaves like a failing negative assertion, they cause an error | exception of (*FAIL), which behaves like a failing negative assertion, the |
if encountered by a DFA matching function. | backtracking control verbs cause an error if encountered by a DFA matching |
| function. |
.P |
.P |
If any of these verbs are used in an assertion or in a subpattern that is | The behaviour of these verbs in |
called as a subroutine (whether or not recursively), their effect is confined | .\" HTML <a href="#btrepeat"> |
to that subpattern; it does not extend to the surrounding pattern, with one | .\" </a> |
exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in | repeated groups, |
a successful positive assertion \fIis\fP passed back when a match succeeds | .\" |
(compare capturing parentheses in assertions). Note that such subpatterns are | .\" HTML <a href="#btassert"> |
processed as anchored at the point where they are tested. Note also that Perl's | .\" </a> |
treatment of subroutines and assertions is different in some cases. | assertions, |
.P | .\" |
The new verbs make use of what was previously invalid syntax: an opening | and in |
parenthesis followed by an asterisk. They are generally of the form | .\" HTML <a href="#btsub"> |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, | .\" </a> |
depending on whether or not an argument is present. A name is any sequence of | subpatterns called as subroutines |
characters that does not include a closing parenthesis. The maximum length of | .\" |
name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name | (whether or not recursively) is documented below. |
is empty, that is, if the closing parenthesis immediately follows the colon, | |
the effect is as if the colon were not there. Any number of these verbs may | |
occur in a pattern. | |
. |
. |
. |
. |
.\" HTML <a name="nooptimize"></a> |
.\" HTML <a name="nooptimize"></a> |
Line 2621 occur in a pattern.
|
Line 2867 occur in a pattern.
|
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
minimum length of matching subject, or that a particular character must be |
minimum length of matching subject, or that a particular character must be |
present. When one of these optimizations suppresses the running of a match, any | present. When one of these optimizations bypasses the running of a match, any |
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the |
Line 2652 followed by a name.
|
Line 2898 followed by a name.
|
This verb causes the match to end successfully, skipping the remainder of the |
This verb causes the match to end successfully, skipping the remainder of the |
pattern. However, when it is inside a subpattern that is called as a |
pattern. However, when it is inside a subpattern that is called as a |
subroutine, only that subpattern is ended successfully. Matching then continues |
subroutine, only that subpattern is ended successfully. Matching then continues |
at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so | at the outer level. If (*ACCEPT) in triggered in a positive assertion, the |
far is captured. For example: | assertion succeeds; in a negative assertion, the assertion fails. |
| .P |
| If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For |
| example: |
.sp |
.sp |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
.sp |
.sp |
Line 2686 starting point (see (*SKIP) below).
|
Line 2935 starting point (see (*SKIP) below).
|
A name is always required with this verb. There may be as many instances of |
A name is always required with this verb. There may be as many instances of |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
(*MARK) as you like in a pattern, and their names do not have to be unique. |
.P |
.P |
When a match succeeds, the name of the last-encountered (*MARK) on the matching | When a match succeeds, the name of the last-encountered (*MARK:NAME), |
path is passed back to the caller as described in the section entitled | (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed back to the |
| caller as described in the section entitled |
.\" HTML <a href="pcreapi.html#extradata"> |
.\" HTML <a href="pcreapi.html#extradata"> |
.\" </a> |
.\" </a> |
"Extra data for \fBpcre_exec()\fP" |
"Extra data for \fBpcre_exec()\fP" |
Line 2712 indicates which of the two alternatives matched. This
|
Line 2962 indicates which of the two alternatives matched. This
|
of obtaining this information than putting each alternative in its own |
of obtaining this information than putting each alternative in its own |
capturing parentheses. |
capturing parentheses. |
.P |
.P |
If (*MARK) is encountered in a positive assertion, its name is recorded and | If a verb with a name is encountered in a positive assertion that is true, the |
passed back if it is the last-encountered. This does not happen for negative | name is recorded and passed back if it is the last-encountered. This does not |
assertions. | happen for negative assertions or failing positive assertions. |
.P |
.P |
After a partial match or a failed match, the name of the last encountered | After a partial match or a failed match, the last encountered name in the |
(*MARK) in the entire match process is returned. For example: | entire match process is returned. For example: |
.sp |
.sp |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
data> XP |
data> XP |
Line 2743 to ensure that the match is always attempted.
|
Line 2993 to ensure that the match is always attempted.
|
The following verbs do nothing when they are encountered. Matching continues |
The following verbs do nothing when they are encountered. Matching continues |
with what follows, but if there is no subsequent match, causing a backtrack to |
with what follows, but if there is no subsequent match, causing a backtrack to |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb, a failure is forced. That is, backtracking cannot pass to the left of |
the verb. However, when one of these verbs appears inside an atomic group, its | the verb. However, when one of these verbs appears inside an atomic group or an |
effect is confined to that group, because once the group has been matched, | assertion that is true, its effect is confined to that group, because once the |
there is never any backtracking into it. In this situation, backtracking can | group has been matched, there is never any backtracking into it. In this |
"jump back" to the left of the entire atomic group. (Remember also, as stated | situation, backtracking can "jump back" to the left of the entire atomic group |
above, that this localization also applies in subroutine calls and assertions.) | or assertion. (Remember also, as stated above, that this localization also |
| applies in subroutine calls.) |
.P |
.P |
These verbs differ in exactly what kind of failure occurs when backtracking |
These verbs differ in exactly what kind of failure occurs when backtracking |
reaches them. | reaches them. The behaviour described below is what happens when the verb is |
| not in a subroutine or an assertion. Subsequent sections cover these special |
| cases. |
.sp |
.sp |
(*COMMIT) |
(*COMMIT) |
.sp |
.sp |
This verb, which may not be followed by a name, causes the whole match to fail |
This verb, which may not be followed by a name, causes the whole match to fail |
outright if the rest of the pattern does not match. Even if the pattern is | outright if there is a later matching failure that causes backtracking to reach |
unanchored, no further attempts to find a match by advancing the starting point | it. Even if the pattern is unanchored, no further attempts to find a match by |
take place. Once (*COMMIT) has been passed, \fBpcre_exec()\fP is committed to | advancing the starting point take place. If (*COMMIT) is the only backtracking |
finding a match at the current starting point, or not at all. For example: | verb that is encountered, once it has been passed \fBpcre_exec()\fP is |
| committed to finding a match at the current starting point, or not at all. For |
| example: |
.sp |
.sp |
a+(*COMMIT)b |
a+(*COMMIT)b |
.sp |
.sp |
Line 2767 dynamic anchor, or "I've started, so I must finish." T
|
Line 3022 dynamic anchor, or "I've started, so I must finish." T
|
recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
recently passed (*MARK) in the path is passed back when (*COMMIT) forces a |
match failure. |
match failure. |
.P |
.P |
|
If there is more than one backtracking verb in a pattern, a different one that |
|
follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a |
|
match does not always guarantee that a match must be at this starting point. |
|
.P |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
Note that (*COMMIT) at the start of a pattern is not the same as an anchor, |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
unless PCRE's start-of-match optimizations are turned off, as shown in this |
\fBpcretest\fP example: |
\fBpcretest\fP example: |
Line 2786 starting points.
|
Line 3045 starting points.
|
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
.sp |
.sp |
This verb causes the match to fail at the current starting position in the |
This verb causes the match to fail at the current starting position in the |
subject if the rest of the pattern does not match. If the pattern is | subject if there is a later matching failure that causes backtracking to reach |
unanchored, the normal "bumpalong" advance to the next starting character then | it. If the pattern is unanchored, the normal "bumpalong" advance to the next |
happens. Backtracking can occur as usual to the left of (*PRUNE), before it is | starting character then happens. Backtracking can occur as usual to the left of |
reached, or when matching to the right of (*PRUNE), but if there is no match to | (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but |
the right, backtracking cannot cross (*PRUNE). In simple cases, the use of | if there is no match to the right, backtracking cannot cross (*PRUNE). In |
(*PRUNE) is just an alternative to an atomic group or possessive quantifier, | simple cases, the use of (*PRUNE) is just an alternative to an atomic group or |
but there are some uses of (*PRUNE) that cannot be expressed in any other way. | possessive quantifier, but there are some uses of (*PRUNE) that cannot be |
The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an | expressed in any other way. In an anchored pattern (*PRUNE) has the same effect |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). | as (*COMMIT). |
| .P |
| The behaviour of (*PRUNE:NAME) is the not the same as (*MARK:NAME)(*PRUNE). |
| It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
.sp |
.sp |
(*SKIP) |
(*SKIP) |
.sp |
.sp |
Line 2815 instead of skipping on to "c".
|
Line 3078 instead of skipping on to "c".
|
.sp |
.sp |
(*SKIP:NAME) |
(*SKIP:NAME) |
.sp |
.sp |
When (*SKIP) has an associated name, its behaviour is modified. If the | When (*SKIP) has an associated name, its behaviour is modified. When it is |
following pattern fails to match, the previous path through the pattern is | triggered, the previous path through the pattern is searched for the most |
searched for the most recent (*MARK) that has the same name. If one is found, | recent (*MARK) that has the same name. If one is found, the "bumpalong" advance |
the "bumpalong" advance is to the subject position that corresponds to that | is to the subject position that corresponds to that (*MARK) instead of to where |
(*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a | (*SKIP) was encountered. If no (*MARK) with a matching name is found, the |
matching name is found, the (*SKIP) is ignored. | (*SKIP) is ignored. |
| .P |
| Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores |
| names that are set by (*PRUNE:NAME) or (*THEN:NAME). |
.sp |
.sp |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
.sp |
.sp |
This verb causes a skip to the next innermost alternative if the rest of the | This verb causes a skip to the next innermost alternative when backtracking |
pattern does not match. That is, it cancels pending backtracking, but only | reaches it. That is, it cancels any further backtracking within the current |
within the current alternative. Its name comes from the observation that it can | alternative. Its name comes from the observation that it can be used for a |
be used for a pattern-based if-then-else block: | pattern-based if-then-else block: |
.sp |
.sp |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
.sp |
.sp |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
If the COND1 pattern matches, FOO is tried (and possibly further items after |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
the end of the group if FOO succeeds); on failure, the matcher skips to the |
second alternative and tries COND2, without backtracking into COND1. The | second alternative and tries COND2, without backtracking into COND1. If that |
behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN). | succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no |
If (*THEN) is not inside an alternation, it acts like (*PRUNE). | more alternatives, so there is a backtrack to whatever came before the entire |
| group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). |
.P |
.P |
Note that a subpattern that does not contain a | character is just a part of | The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN). |
the enclosing alternative; it is not a nested alternation with only one | It is like (*MARK:NAME) in that the name is remembered for passing back to the |
| caller. However, (*SKIP:NAME) searches only for names set with (*MARK). |
| .P |
| A subpattern that does not contain a | character is just a part of the |
| enclosing alternative; it is not a nested alternation with only one |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
alternative. The effect of (*THEN) extends beyond such a subpattern to the |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
enclosing alternative. Consider this pattern, where A, B, etc. are complex |
pattern fragments that do not contain any | characters at this level: |
pattern fragments that do not contain any | characters at this level: |
Line 2857 in C, matching moves to (*FAIL), which causes the whol
|
Line 3128 in C, matching moves to (*FAIL), which causes the whol
|
because there are no more alternatives to try. In this case, matching does now |
because there are no more alternatives to try. In this case, matching does now |
backtrack into A. |
backtrack into A. |
.P |
.P |
Note also that a conditional subpattern is not considered as having two | Note that a conditional subpattern is not considered as having two |
alternatives, because only one is ever used. In other words, the | character in |
alternatives, because only one is ever used. In other words, the | character in |
a conditional subpattern has a different meaning. Ignoring white space, |
a conditional subpattern has a different meaning. Ignoring white space, |
consider: |
consider: |
Line 2879 starting position, but allowing an advance to the next
|
Line 3150 starting position, but allowing an advance to the next
|
unanchored pattern). (*SKIP) is similar, except that the advance may be more |
unanchored pattern). (*SKIP) is similar, except that the advance may be more |
than one character. (*COMMIT) is the strongest, causing the entire match to |
than one character. (*COMMIT) is the strongest, causing the entire match to |
fail. |
fail. |
.P | . |
If more than one such verb is present in a pattern, the "strongest" one wins. | . |
For example, consider this pattern, where A, B, etc. are complex pattern | .SS "More than one backtracking verb" |
fragments: | .rs |
.sp |
.sp |
(A(*COMMIT)B(*THEN)C|D) | If more than one backtracking verb is present in a pattern, the one that is |
| backtracked onto first acts. For example, consider this pattern, where A, B, |
| etc. are complex pattern fragments: |
.sp |
.sp |
Once A has matched, PCRE is committed to this match, at the current starting | (A(*COMMIT)B(*THEN)C|ABD) |
position. If subsequently B matches, but C does not, the normal (*THEN) action | .sp |
of trying the next alternative (that is, D) does not happen because (*COMMIT) | If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to |
overrides. | fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes |
| the next alternative (ABD) to be tried. This behaviour is consistent, but is |
| not always the same as Perl's. It means that if two or more backtracking verbs |
| appear in succession, all the the last of them has no effect. Consider this |
| example: |
| .sp |
| ...(*COMMIT)(*PRUNE)... |
| .sp |
| If there is a matching failure to the right, backtracking onto (*PRUNE) causes |
| it to be triggered, and its action is taken. There can never be a backtrack |
| onto (*COMMIT). |
. |
. |
. |
. |
|
.\" HTML <a name="btrepeat"></a> |
|
.SS "Backtracking verbs in repeated groups" |
|
.rs |
|
.sp |
|
PCRE differs from Perl in its handling of backtracking verbs in repeated |
|
groups. For example, consider: |
|
.sp |
|
/(a(*COMMIT)b)+ac/ |
|
.sp |
|
If the subject is "abac", Perl matches, but PCRE fails because the (*COMMIT) in |
|
the second repeat of the group acts. |
|
. |
|
. |
|
.\" HTML <a name="btassert"></a> |
|
.SS "Backtracking verbs in assertions" |
|
.rs |
|
.sp |
|
(*FAIL) in an assertion has its normal effect: it forces an immediate backtrack. |
|
.P |
|
(*ACCEPT) in a positive assertion causes the assertion to succeed without any |
|
further processing. In a negative assertion, (*ACCEPT) causes the assertion to |
|
fail without any further processing. |
|
.P |
|
The other backtracking verbs are not treated specially if they appear in a |
|
positive assertion. In particular, (*THEN) skips to the next alternative in the |
|
innermost enclosing group that has alternations, whether or not this is within |
|
the assertion. |
|
.P |
|
Negative assertions are, however, different, in order to ensure that changing a |
|
positive assertion into a negative assertion changes its result. Backtracking |
|
into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true, |
|
without considering any further alternative branches in the assertion. |
|
Backtracking into (*THEN) causes it to skip to the next enclosing alternative |
|
within the assertion (the normal behaviour), but if the assertion does not have |
|
such an alternative, (*THEN) behaves like (*PRUNE). |
|
. |
|
. |
|
.\" HTML <a name="btsub"></a> |
|
.SS "Backtracking verbs in subroutines" |
|
.rs |
|
.sp |
|
These behaviours occur whether or not the subpattern is called recursively. |
|
Perl's treatment of subroutines is different in some cases. |
|
.P |
|
(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces |
|
an immediate backtrack. |
|
.P |
|
(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to |
|
succeed without any further processing. Matching then continues after the |
|
subroutine call. |
|
.P |
|
(*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine cause |
|
the subroutine match to fail. |
|
.P |
|
(*THEN) skips to the next alternative in the innermost enclosing group within |
|
the subpattern that has alternatives. If there is no such group within the |
|
subpattern, (*THEN) causes the subroutine match to fail. |
|
. |
|
. |
.SH "SEE ALSO" |
.SH "SEE ALSO" |
.rs |
.rs |
.sp |
.sp |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP. | \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP. |
. |
. |
. |
. |
.SH AUTHOR |
.SH AUTHOR |
Line 2913 Cambridge CB2 3QH, England.
|
Line 3255 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 17 June 2012 | Last updated: 03 December 2013 |
Copyright (c) 1997-2012 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
.fi |
.fi |