version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.3, 2012/10/09 09:19:18
|
Line 19 man page, in case the conversion went wrong.
|
Line 19 man page, in case the conversion went wrong.
|
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> |
<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> | <li><a name="TOC7" href="#SEC7">MATCHING A SINGLE DATA UNIT</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> |
<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> |
Line 61 description of PCRE's regular expressions is intended
|
Line 61 description of PCRE's regular expressions is intended
|
</P> |
</P> |
<P> |
<P> |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
there is now also support for UTF-8 character strings. To use this, | there is now also support for UTF-8 strings in the original library, and a |
PCRE must be built to include UTF-8 support, and you must call | second library that supports 16-bit and UTF-16 character strings. To use these |
<b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There | features, PCRE must be built to include appropriate support. When using UTF |
is also a special sequence that can be given at the start of a pattern: | strings you must either call the compiling function with the PCRE_UTF8 or |
| PCRE_UTF16 option, or the pattern must start with one of these special |
| sequences: |
<pre> |
<pre> |
(*UTF8) |
(*UTF8) |
|
(*UTF16) |
</pre> |
</pre> |
Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8 | Starting a pattern with such a sequence is equivalent to setting the relevant |
option. This feature is not Perl-compatible. How setting UTF-8 mode affects | option. This feature is not Perl-compatible. How setting a UTF mode affects |
pattern matching is mentioned in several places below. There is also a summary |
pattern matching is mentioned in several places below. There is also a summary |
of UTF-8 features in the | of features in the |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
page. |
page. |
</P> |
</P> |
<P> |
<P> |
Another special sequence that may appear at the start of a pattern or in |
Another special sequence that may appear at the start of a pattern or in |
combination with (*UTF8) is: | combination with (*UTF8) or (*UTF16) is: |
<pre> |
<pre> |
(*UCP) |
(*UCP) |
</pre> |
</pre> |
Line 94 of newlines; they are described below.
|
Line 97 of newlines; they are described below.
|
</P> |
</P> |
<P> |
<P> |
The remainder of this document discusses the patterns that are supported by |
The remainder of this document discusses the patterns that are supported by |
PCRE when its main matching function, <b>pcre_exec()</b>, is used. | PCRE when one its main matching functions, <b>pcre_exec()</b> (8-bit) or |
From release 6.0, PCRE offers a second matching function, | <b>pcre16_exec()</b> (16-bit), is used. PCRE also has alternative matching |
<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not | functions, <b>pcre_dfa_exec()</b> and <b>pcre16_dfa_exec()</b>, which match using |
Perl-compatible. Some of the features discussed below are not available when | a different algorithm that is not Perl-compatible. Some of the features |
<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the | discussed below are not available when DFA matching is used. The advantages and |
alternative function, and how it differs from the normal function, are | disadvantages of the alternative functions, and how they differ from the normal |
discussed in the | functions, are discussed in the |
<a href="pcrematching.html"><b>pcrematching</b></a> |
<a href="pcrematching.html"><b>pcrematching</b></a> |
page. |
page. |
<a name="newlines"></a></P> |
<a name="newlines"></a></P> |
Line 126 string with one of the following five sequences:
|
Line 129 string with one of the following five sequences:
|
(*ANYCRLF) any of the three above |
(*ANYCRLF) any of the three above |
(*ANY) all Unicode newline sequences |
(*ANY) all Unicode newline sequences |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function. For |
<b>pcre_compile2()</b>. For example, on a Unix system where LF is the default | example, on a Unix system where LF is the default newline sequence, the pattern |
newline sequence, the pattern | |
<pre> |
<pre> |
(*CR)a.b |
(*CR)a.b |
</pre> |
</pre> |
Line 158 corresponding characters in the subject. As a trivial
|
Line 160 corresponding characters in the subject. As a trivial
|
</pre> |
</pre> |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
independently of case. In UTF-8 mode, PCRE always understands the concept of | independently of case. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching for characters 128 and above, you must |
If you want to use caseless matching for characters 128 and above, you must |
ensure that PCRE is compiled with Unicode property support as well as with |
ensure that PCRE is compiled with Unicode property support as well as with |
UTF-8 support. | UTF support. |
</P> |
</P> |
<P> |
<P> |
The power of regular expressions comes from the ability to include alternatives |
The power of regular expressions comes from the ability to include alternatives |
Line 220 non-alphanumeric with backslash to specify that it sta
|
Line 222 non-alphanumeric with backslash to specify that it sta
|
particular, if you want to match a backslash, you write \\. |
particular, if you want to match a backslash, you write \\. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, only ASCII numbers and letters have any special meaning after a | In a UTF mode, only ASCII numbers and letters have any special meaning after a |
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
</P> |
</P> |
<P> |
<P> |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the | If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class) and characters between a # outside |
a character class and the next newline are ignored. An escaping backslash can |
a character class and the next newline are ignored. An escaping backslash can |
be used to include a whitespace or # character as part of the pattern. | be used to include a white space or # character as part of the pattern. |
</P> |
</P> |
<P> |
<P> |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
Line 262 one of the following escape sequences than the binary
|
Line 264 one of the following escape sequences than the binary
|
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
\cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
\e escape (hex 1B) |
\e escape (hex 1B) |
\f formfeed (hex 0C) | \f form feed (hex 0C) |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
\t tab (hex 09) |
\t tab (hex 09) |
Line 276 is converted to upper case. Then bit 6 of the characte
|
Line 278 is converted to upper case. Then bit 6 of the characte
|
Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while |
Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while |
\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater |
\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte | all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A |
values are valid. A lower case letter is converted to upper case, and then the | lower case letter is converted to upper case, and then the 0xc0 bits are |
0xc0 bits are flipped.) | flipped.) |
</P> |
</P> |
<P> |
<P> |
By default, after \x, from zero to two hexadecimal digits are read (letters |
By default, after \x, from zero to two hexadecimal digits are read (letters |
can be in upper or lower case). Any number of hexadecimal digits may appear |
can be in upper or lower case). Any number of hexadecimal digits may appear |
between \x{ and }, but the value of the character code must be less than 256 | between \x{ and }, but the character code is constrained as follows: |
in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum | <pre> |
value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest | 8-bit non-UTF mode less than 0x100 |
Unicode code point, which is 10FFFF. | 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
| 16-bit non-UTF mode less than 0x10000 |
| 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
| </pre> |
| Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
| "surrogate" codepoints). |
</P> |
</P> |
<P> |
<P> |
If characters other than hexadecimal digits appear between \x{ and }, or if |
If characters other than hexadecimal digits appear between \x{ and }, or if |
Line 300 as just described only when it is followed by two hexa
|
Line 307 as just described only when it is followed by two hexa
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
code points greater than 256 is provided by \u, which must be followed by |
code points greater than 256 is provided by \u, which must be followed by |
four hexadecimal digits; otherwise it matches a literal "u" character. |
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \u in JavaScript mode are constrained in the same |
|
was as those specified by \x in non-JavaScript mode. |
</P> |
</P> |
<P> |
<P> |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
Line 328 following the discussion of
|
Line 337 following the discussion of
|
Inside a character class, or if the decimal number is greater than 9 and there |
Inside a character class, or if the decimal number is greater than 9 and there |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
digits following the backslash, and uses them to generate a data character. Any |
digits following the backslash, and uses them to generate a data character. Any |
subsequent digits stand for themselves. In non-UTF-8 mode, the value of a | subsequent digits stand for themselves. The value of the character is |
character specified in octal must be less than \400. In UTF-8 mode, values up | constrained in the same way as characters specified in hexadecimal. |
to \777 are permitted. For example: | For example: |
<pre> |
<pre> |
\040 is another way of writing a space |
\040 is another way of writing a space |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
Line 339 to \777 are permitted. For example:
|
Line 348 to \777 are permitted. For example:
|
\011 is always a tab |
\011 is always a tab |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
\113 might be a back reference, otherwise the character with octal code 113 |
\113 might be a back reference, otherwise the character with octal code 113 |
\377 might be a back reference, otherwise the byte consisting entirely of 1 bits | \377 might be a back reference, otherwise the value 255 (decimal) |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
</pre> |
</pre> |
Note that octal values of 100 or greater must not be introduced by a leading |
Note that octal values of 100 or greater must not be introduced by a leading |
Line 399 Another use of backslash is for specifying generic cha
|
Line 408 Another use of backslash is for specifying generic cha
|
<pre> |
<pre> |
\d any decimal digit |
\d any decimal digit |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
\h any horizontal whitespace character | \h any horizontal white space character |
\H any character that is not a horizontal whitespace character | \H any character that is not a horizontal white space character |
\s any whitespace character | \s any white space character |
\S any character that is not a whitespace character | \S any character that is not a white space character |
\v any vertical whitespace character | \v any vertical white space character |
\V any character that is not a vertical whitespace character | \V any character that is not a vertical white space character |
\w any "word" character |
\w any "word" character |
\W any "non-word" character |
\W any "non-word" character |
</pre> |
</pre> |
Line 443 accented letters, and these are then matched by \w. Th
|
Line 452 accented letters, and these are then matched by \w. Th
|
Unicode is discouraged. |
Unicode is discouraged. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 never match | By default, in a UTF mode, characters with values greater than 128 never match |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
their original meanings from before UTF-8 support was available, mainly for | their original meanings from before UTF support was available, mainly for |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
properties are used to determine character types, as follows: |
properties are used to determine character types, as follows: |
Line 463 is noticeably slower when PCRE_UCP is set.
|
Line 472 is noticeably slower when PCRE_UCP is set.
|
<P> |
<P> |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
characters by default, these always match certain high-valued codepoints in | characters by default, these always match certain high-valued codepoints, |
UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters | whether or not PCRE_UCP is set. The horizontal space characters are: |
are: | |
<pre> |
<pre> |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
U+0020 Space |
U+0020 Space |
Line 491 The vertical space characters are:
|
Line 499 The vertical space characters are:
|
<pre> |
<pre> |
U+000A Linefeed |
U+000A Linefeed |
U+000B Vertical tab |
U+000B Vertical tab |
U+000C Formfeed | U+000C Form feed |
U+000D Carriage return |
U+000D Carriage return |
U+0085 Next line |
U+0085 Next line |
U+2028 Line separator |
U+2028 Line separator |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
<a name="newlineseq"></a></PRE> | </pre> |
</P> | In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are |
| relevant. |
| <a name="newlineseq"></a></P> |
<br><b> |
<br><b> |
Newline sequences |
Newline sequences |
</b><br> |
</b><br> |
<P> |
<P> |
Outside a character class, by default, the escape sequence \R matches any |
Outside a character class, by default, the escape sequence \R matches any |
Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following: | Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the |
| following: |
<pre> |
<pre> |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
</pre> |
</pre> |
Line 511 This is an example of an "atomic group", details of wh
|
Line 522 This is an example of an "atomic group", details of wh
|
<a href="#atomicgroup">below.</a> |
<a href="#atomicgroup">below.</a> |
This particular group matches either the two-character sequence CR followed by |
This particular group matches either the two-character sequence CR followed by |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, |
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next | U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next |
line, U+0085). The two-character sequence is treated as a single unit that |
line, U+0085). The two-character sequence is treated as a single unit that |
cannot be split. |
cannot be split. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, two additional characters whose codepoints are greater than 255 | In other modes, two additional characters whose codepoints are greater than 255 |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
Unicode character property support is not needed for these characters to be |
Unicode character property support is not needed for these characters to be |
recognized. |
recognized. |
Line 533 one of the following sequences:
|
Line 544 one of the following sequences:
|
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function, but |
<b>pcre_compile2()</b>, but they can be overridden by options given to | they can themselves be overridden by options given to a matching function. Note |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings, | that these special settings, which are not Perl-compatible, are recognized only |
which are not Perl-compatible, are recognized only at the very start of a | at the very start of a pattern, and that they must be in upper case. If more |
pattern, and that they must be in upper case. If more than one of them is | than one of them is present, the last one is used. They can be combined with a |
present, the last one is used. They can be combined with a change of newline | change of newline convention; for example, a pattern can start with: |
convention; for example, a pattern can start with: | |
<pre> |
<pre> |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
</pre> |
</pre> |
They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside | They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special |
a character class, \R is treated as an unrecognized escape sequence, and so | sequences. Inside a character class, \R is treated as an unrecognized escape |
matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. | sequence, and so matches the letter "R" by default, but causes an error if |
| PCRE_EXTRA is set. |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
<br><b> |
<br><b> |
Unicode character properties |
Unicode character properties |
Line 553 Unicode character properties
|
Line 564 Unicode character properties
|
<P> |
<P> |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
escape sequences that match characters with specific properties are available. |
escape sequences that match characters with specific properties are available. |
When not in UTF-8 mode, these sequences are of course limited to testing | When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing |
characters whose codepoints are less than 256, but they do work in this mode. |
characters whose codepoints are less than 256, but they do work in this mode. |
The extra escape sequences are: |
The extra escape sequences are: |
<pre> |
<pre> |
Line 587 Armenian,
|
Line 598 Armenian,
|
Avestan, |
Avestan, |
Balinese, |
Balinese, |
Bamum, |
Bamum, |
|
Batak, |
Bengali, |
Bengali, |
Bopomofo, |
Bopomofo, |
|
Brahmi, |
Braille, |
Braille, |
Buginese, |
Buginese, |
Buhid, |
Buhid, |
Canadian_Aboriginal, |
Canadian_Aboriginal, |
Carian, |
Carian, |
|
Chakma, |
Cham, |
Cham, |
Cherokee, |
Cherokee, |
Common, |
Common, |
Line 636 Lisu,
|
Line 650 Lisu,
|
Lycian, |
Lycian, |
Lydian, |
Lydian, |
Malayalam, |
Malayalam, |
|
Mandaic, |
Meetei_Mayek, |
Meetei_Mayek, |
|
Meroitic_Cursive, |
|
Meroitic_Hieroglyphs, |
|
Miao, |
Mongolian, |
Mongolian, |
Myanmar, |
Myanmar, |
New_Tai_Lue, |
New_Tai_Lue, |
Line 655 Rejang,
|
Line 673 Rejang,
|
Runic, |
Runic, |
Samaritan, |
Samaritan, |
Saurashtra, |
Saurashtra, |
|
Sharada, |
Shavian, |
Shavian, |
Sinhala, |
Sinhala, |
|
Sora_Sompeng, |
Sundanese, |
Sundanese, |
Syloti_Nagri, |
Syloti_Nagri, |
Syriac, |
Syriac, |
Line 665 Tagbanwa,
|
Line 685 Tagbanwa,
|
Tai_Le, |
Tai_Le, |
Tai_Tham, |
Tai_Tham, |
Tai_Viet, |
Tai_Viet, |
|
Takri, |
Tamil, |
Tamil, |
Telugu, |
Telugu, |
Thaana, |
Thaana, |
Line 742 a modifier or "other".
|
Line 763 a modifier or "other".
|
</P> |
</P> |
<P> |
<P> |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so | U+DFFF. Such characters are not valid in Unicode strings and so |
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off | cannot be tested by PCRE, unless UTF validity checking has been turned off |
(see the discussion of PCRE_NO_UTF8_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
page). Perl does not support the Cs property. |
page). Perl does not support the Cs property. |
</P> |
</P> |
Line 774 atomic group
|
Line 795 atomic group
|
<a href="#atomicgroup">(see below).</a> |
<a href="#atomicgroup">(see below).</a> |
Characters with the "mark" property are typically accents that affect the |
Characters with the "mark" property are typically accents that affect the |
preceding character. None of them have codepoints less than 256, so in |
preceding character. None of them have codepoints less than 256, so in |
non-UTF-8 mode \X matches any one character. | 8-bit non-UTF-8 mode \X matches any one character. |
</P> |
</P> |
<P> |
<P> |
Note that recent versions of Perl have changed \X to match what Unicode calls |
Note that recent versions of Perl have changed \X to match what Unicode calls |
Line 785 Matching characters by Unicode property is not fast, b
|
Line 806 Matching characters by Unicode property is not fast, b
|
a structure that contains data for over fifteen thousand characters. That is |
a structure that contains data for over fifteen thousand characters. That is |
why the traditional escape sequences such as \d and \w do not use Unicode |
why the traditional escape sequences such as \d and \w do not use Unicode |
properties in PCRE by default, though you can make them do so by setting the |
properties in PCRE by default, though you can make them do so by setting the |
PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with | PCRE_UCP option or by starting the pattern with (*UCP). |
(*UCP). | |
<a name="extraprops"></a></P> |
<a name="extraprops"></a></P> |
<br><b> |
<br><b> |
PCRE's additional properties |
PCRE's additional properties |
Line 804 PCRE_UCP is set. They are:
|
Line 824 PCRE_UCP is set. They are:
|
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
</pre> |
</pre> |
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or | property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
same characters as Xan, plus underscore. |
same characters as Xan, plus underscore. |
Line 865 escape sequence" error is generated instead.
|
Line 885 escape sequence" error is generated instead.
|
A word boundary is a position in the subject string where the current character |
A word boundary is a position in the subject string where the current character |
and the previous character do not both match \w or \W (i.e. one matches |
and the previous character do not both match \w or \W (i.e. one matches |
\w and the other matches \W), or the start or end of the string if the |
\w and the other matches \W), or the start or end of the string if the |
first or last character matches \w, respectively. In UTF-8 mode, the meanings | first or last character matches \w, respectively. In a UTF mode, the meanings |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
of word" or "end of word" metasequence. However, whatever follows \b normally |
of word" or "end of word" metasequence. However, whatever follows \b normally |
Line 962 end of the subject in both modes, and if all branches
|
Line 982 end of the subject in both modes, and if all branches
|
<P> |
<P> |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
the subject string except (by default) a character that signifies the end of a |
the subject string except (by default) a character that signifies the end of a |
line. In UTF-8 mode, the matched character may be more than one byte long. | line. |
</P> |
</P> |
<P> |
<P> |
When a line ending is defined as a single character, dot never matches that |
When a line ending is defined as a single character, dot never matches that |
Line 989 the PCRE_DOTALL option. In other words, it matches any
|
Line 1009 the PCRE_DOTALL option. In other words, it matches any
|
that signifies the end of a line. Perl also uses \N to match characters by |
that signifies the end of a line. Perl also uses \N to match characters by |
name; PCRE does not support this. |
name; PCRE does not support this. |
</P> |
</P> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> | <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE DATA UNIT</a><br> |
<P> |
<P> |
Outside a character class, the escape sequence \C matches any one byte, both | Outside a character class, the escape sequence \C matches any one data unit, |
in and out of UTF-8 mode. Unlike a dot, it always matches line-ending | whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
characters. The feature is provided in Perl in order to match individual bytes | byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \C always |
in UTF-8 mode, but it is unclear how it can usefully be used. Because \C | matches line-ending characters. The feature is provided in Perl in order to |
breaks up characters into individual bytes, matching one byte with \C in UTF-8 | match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
mode means that the rest of the string may start with a malformed UTF-8 | used. Because \C breaks up characters into individual data units, matching one |
character. This has undefined results, because PCRE assumes that it is dealing | unit with \C in a UTF mode means that the rest of the string may start with a |
with valid UTF-8 strings (and by default it checks this at the start of | malformed UTF character. This has undefined results, because PCRE assumes that |
processing unless the PCRE_NO_UTF8_CHECK option is used). | it is dealing with valid UTF strings (and by default it checks this at the |
| start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option |
| is used). |
</P> |
</P> |
<P> |
<P> |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
<a href="#lookbehind">(described below)</a> |
<a href="#lookbehind">(described below)</a> |
in UTF-8 mode, because this would make it impossible to calculate the length of | in a UTF mode, because this would make it impossible to calculate the length of |
the lookbehind. |
the lookbehind. |
</P> |
</P> |
<P> |
<P> |
In general, the \C escape sequence is best avoided in UTF-8 mode. However, one | In general, the \C escape sequence is best avoided. However, one |
way of using it that avoids the problem of malformed UTF-8 characters is to | way of using it that avoids the problem of malformed UTF characters is to use a |
use a lookahead to check the length of the next character, as in this pattern | lookahead to check the length of the next character, as in this pattern, which |
(ignore white space and line breaks): | could be used with a UTF-8 string (ignore white space and line breaks): |
<pre> |
<pre> |
(?| (?=[\x00-\x7f])(\C) | |
(?| (?=[\x00-\x7f])(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
Line 1036 a member of the class, it should be the first data cha
|
Line 1058 a member of the class, it should be the first data cha
|
(after an initial circumflex, if present) or escaped with a backslash. |
(after an initial circumflex, if present) or escaped with a backslash. |
</P> |
</P> |
<P> |
<P> |
A character class matches a single character in the subject. In UTF-8 mode, the | A character class matches a single character in the subject. In a UTF mode, the |
character may be more than one byte long. A matched character must be in the | character may be more than one data unit long. A matched character must be in |
set of characters defined by the class, unless the first character in the class | the set of characters defined by the class, unless the first character in the |
definition is a circumflex, in which case the subject character must not be in | class definition is a circumflex, in which case the subject character must not |
the set defined by the class. If a circumflex is actually required as a member | be in the set defined by the class. If a circumflex is actually required as a |
of the class, ensure it is not the first character, or escape it with a | member of the class, ensure it is not the first character, or escape it with a |
backslash. |
backslash. |
</P> |
</P> |
<P> |
<P> |
Line 1054 string, and therefore it fails if the current pointer
|
Line 1076 string, and therefore it fails if the current pointer
|
string. |
string. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, characters with values greater than 255 can be included in a | In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be |
class as a literal string of bytes, or by using the \x{ escaping mechanism. | included in a class as a literal string of data units, or by using the \x{ |
| escaping mechanism. |
</P> |
</P> |
<P> |
<P> |
When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
caseful version would. In UTF-8 mode, PCRE always understands the concept of | caseful version would. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching in UTF8-mode for characters 128 and above, | If you want to use caseless matching in a UTF mode for characters 128 and |
you must ensure that PCRE is compiled with Unicode property support as well as | above, you must ensure that PCRE is compiled with Unicode property support as |
with UTF-8 support. | well as with UTF support. |
</P> |
</P> |
<P> |
<P> |
Characters that might indicate line breaks are never treated in any special way |
Characters that might indicate line breaks are never treated in any special way |
Line 1093 followed by two other characters. The octal or hexadec
|
Line 1116 followed by two other characters. The octal or hexadec
|
</P> |
</P> |
<P> |
<P> |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
used for characters specified numerically, for example [\000-\037]. In UTF-8 | used for characters specified numerically, for example [\000-\037]. Ranges |
mode, ranges can include characters whose values are greater than 255, for | can include any characters that are valid for the current mode. |
example [\x{100}-\x{2ff}]. | |
</P> |
</P> |
<P> |
<P> |
If a range that includes letters is used when caseless matching is set, it |
If a range that includes letters is used when caseless matching is set, it |
matches the letters in either case. For example, [W-c] is equivalent to |
matches the letters in either case. For example, [W-c] is equivalent to |
[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character | [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for | characters in both cases. In UTF modes, PCRE supports the concept of case for |
characters with values greater than 128 only when it is compiled with Unicode |
characters with values greater than 128 only when it is compiled with Unicode |
property support. |
property support. |
</P> |
</P> |
Line 1110 property support.
|
Line 1132 property support.
|
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
\V, \w, and \W may appear in a character class, and add the characters that |
\V, \w, and \W may appear in a character class, and add the characters that |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w | digit. In UTF modes, the PCRE_UCP option affects the meanings of \d, \s, \w |
and their upper case partners, just as it does when they appear outside a |
and their upper case partners, just as it does when they appear outside a |
character class, as described in the section entitled |
character class, as described in the section entitled |
<a href="#genericchartypes">"Generic character types"</a> |
<a href="#genericchartypes">"Generic character types"</a> |
Line 1179 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
Line 1201 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 do not match | By default, in UTF modes, characters with values greater than 128 do not match |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
character properties are used. This is achieved by replacing the POSIX classes |
character properties are used. This is achieved by replacing the POSIX classes |
Line 1264 behaviour otherwise.
|
Line 1286 behaviour otherwise.
|
</P> |
</P> |
<P> |
<P> |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
application when the compile or match functions are called. In some cases the | application when the compiling or matching functions are called. In some cases |
pattern can contain special leading sequences such as (*CRLF) to override what | the pattern can contain special leading sequences such as (*CRLF) to override |
the application has set or what has been defaulted. Details are given in the | what the application has set or what has been defaulted. Details are given in |
section entitled | the section entitled |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
above. There are also the (*UTF8) and (*UCP) leading sequences that can be used | above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that |
to set UTF-8 and Unicode property modes; they are equivalent to setting the | can be used to set UTF and Unicode property modes; they are equivalent to |
PCRE_UTF8 and the PCRE_UCP options, respectively. | setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. |
<a name="subpattern"></a></P> |
<a name="subpattern"></a></P> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
<P> |
<P> |
Line 1289 match "cataract", "erpillar" or an empty string.
|
Line 1311 match "cataract", "erpillar" or an empty string.
|
<br> |
<br> |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
the whole pattern matches, that portion of the subject string that matched the |
the whole pattern matches, that portion of the subject string that matched the |
subpattern is passed back to the caller via the <i>ovector</i> argument of | subpattern is passed back to the caller via the <i>ovector</i> argument of the |
<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting | matching function. (This applies only to the traditional matching functions; |
from 1) to obtain numbers for the capturing subpatterns. For example, if the | the DFA matching functions do not support capturing.) |
string "the red king" is matched against the pattern | </P> |
| <P> |
| Opening parentheses are counted from left to right (starting from 1) to obtain |
| numbers for the capturing subpatterns. For example, if the string "the red |
| king" is matched against the pattern |
<pre> |
<pre> |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
</pre> |
</pre> |
Line 1452 items:
|
Line 1478 items:
|
a literal data character |
a literal data character |
the dot metacharacter |
the dot metacharacter |
the \C escape sequence |
the \C escape sequence |
the \X escape sequence (in UTF-8 mode with Unicode properties) | the \X escape sequence |
the \R escape sequence |
the \R escape sequence |
an escape such as \d or \pL that matches a single character |
an escape such as \d or \pL that matches a single character |
a character class |
a character class |
Line 1484 quantifier, is taken as a literal character. For examp
|
Line 1510 quantifier, is taken as a literal character. For examp
|
quantifier, but a literal string of four characters. |
quantifier, but a literal string of four characters. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual | In UTF modes, quantifiers apply to characters rather than to individual data |
bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of | units. Thus, for example, \x{100}{2} matches two characters, each of |
which is represented by a two-byte sequence. Similarly, when Unicode property | which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
support is available, \X{3} matches three Unicode extended sequences, each of | \X{3} matches three Unicode extended sequences, each of which may be several |
which may be several bytes long (and they may be of different lengths). | data units long (and they may be of different lengths). |
</P> |
</P> |
<P> |
<P> |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
Line 1805 Because there may be many capturing parentheses in a p
|
Line 1831 Because there may be many capturing parentheses in a p
|
following a backslash are taken as part of a potential back reference number. |
following a backslash are taken as part of a potential back reference number. |
If the pattern continues with a digit character, some delimiter must be used to |
If the pattern continues with a digit character, some delimiter must be used to |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is set, this can be |
whitespace. Otherwise, the \g{ syntax or an empty comment (see | white space. Otherwise, the \g{ syntax or an empty comment (see |
<a href="#comments">"Comments"</a> |
<a href="#comments">"Comments"</a> |
below) can be used. |
below) can be used. |
</P> |
</P> |
Line 1950 match. If there are insufficient characters before the
|
Line 1976 match. If there are insufficient characters before the
|
assertion fails. |
assertion fails. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, | In a UTF mode, PCRE does not allow the \C escape (which matches a single data |
even in UTF-8 mode) to appear in lookbehind assertions, because it makes it | unit even in a UTF mode) to appear in lookbehind assertions, because it makes |
impossible to calculate the length of the lookbehind. The \X and \R escapes, | it impossible to calculate the length of the lookbehind. The \X and \R |
which can match different numbers of bytes, are also not permitted. | escapes, which can match different numbers of data units, are also not |
| permitted. |
</P> |
</P> |
<P> |
<P> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
Line 2146 point in the pattern; the idea of DEFINE is that it ca
|
Line 2173 point in the pattern; the idea of DEFINE is that it ca
|
subroutines that can be referenced from elsewhere. (The use of |
subroutines that can be referenced from elsewhere. (The use of |
<a href="#subpatternsassubroutines">subroutines</a> |
<a href="#subpatternsassubroutines">subroutines</a> |
is described below.) For example, a pattern to match an IPv4 address such as |
is described below.) For example, a pattern to match an IPv4 address such as |
"192.168.23.245" could be written like this (ignore whitespace and line | "192.168.23.245" could be written like this (ignore white space and line |
breaks): |
breaks): |
<pre> |
<pre> |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
Line 2192 closing parenthesis. Nested parentheses are not permit
|
Line 2219 closing parenthesis. Nested parentheses are not permit
|
option is set, an unescaped # character also introduces a comment, which in |
option is set, an unescaped # character also introduces a comment, which in |
this case continues to immediately after the next newline character or |
this case continues to immediately after the next newline character or |
character sequence in the pattern. Which characters are interpreted as newlines |
character sequence in the pattern. Which characters are interpreted as newlines |
is controlled by the options passed to <b>pcre_compile()</b> or by a special | is controlled by the options passed to a compiling function or by a special |
sequence at the start of the pattern, as described in the section entitled |
sequence at the start of the pattern, as described in the section entitled |
<a href="#newlines">"Newline conventions"</a> |
<a href="#newlines">"Newline conventions"</a> |
above. Note that the end of this type of comment is a literal newline sequence |
above. Note that the end of this type of comment is a literal newline sequence |
Line 2491 same pair of parentheses when there is a repetition.
|
Line 2518 same pair of parentheses when there is a repetition.
|
<P> |
<P> |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
function by putting its entry point in the global variable <i>pcre_callout</i>. | function by putting its entry point in the global variable <i>pcre_callout</i> |
By default, this variable contains NULL, which disables all calling out. | (8-bit library) or <i>pcre16_callout</i> (16-bit library). By default, this |
| variable contains NULL, which disables all calling out. |
</P> |
</P> |
<P> |
<P> |
Within a regular expression, (?C) indicates the points at which the external |
Within a regular expression, (?C) indicates the points at which the external |
Line 2502 For example, this pattern has two callout points:
|
Line 2530 For example, this pattern has two callout points:
|
<pre> |
<pre> |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
</pre> |
</pre> |
If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are | If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
255. |
255. |
</P> |
</P> |
<P> |
<P> |
During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is | During matching, when PCRE reaches a callout point, the external function is |
set), the external function is called. It is provided with the number of the | called. It is provided with the number of the callout, the position in the |
callout, the position in the pattern, and, optionally, one item of data | pattern, and, optionally, one item of data originally supplied by the caller of |
originally supplied by the caller of <b>pcre_exec()</b>. The callout function | the matching function. The callout function may cause matching to proceed, to |
may cause matching to proceed, to backtrack, or to fail altogether. A complete | backtrack, or to fail altogether. A complete description of the interface to |
description of the interface to the callout function is given in the | the callout function is given in the |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
documentation. |
documentation. |
<a name="backtrackcontrol"></a></P> |
<a name="backtrackcontrol"></a></P> |
Line 2526 remarks apply to the PCRE features described in this s
|
Line 2554 remarks apply to the PCRE features described in this s
|
</P> |
</P> |
<P> |
<P> |
Since these verbs are specifically related to backtracking, most of them can be |
Since these verbs are specifically related to backtracking, most of them can be |
used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses | used only when the pattern is to be matched using one of the traditional |
a backtracking algorithm. With the exception of (*FAIL), which behaves like a | matching functions, which use a backtracking algorithm. With the exception of |
failing negative assertion, they cause an error if encountered by | (*FAIL), which behaves like a failing negative assertion, they cause an error |
<b>pcre_dfa_exec()</b>. | if encountered by a DFA matching function. |
</P> |
</P> |
<P> |
<P> |
If any of these verbs are used in an assertion or in a subpattern that is |
If any of these verbs are used in an assertion or in a subpattern that is |
Line 2539 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
Line 2567 exception: the name from a *(MARK), (*PRUNE), or (*THE
|
a successful positive assertion <i>is</i> passed back when a match succeeds |
a successful positive assertion <i>is</i> passed back when a match succeeds |
(compare capturing parentheses in assertions). Note that such subpatterns are |
(compare capturing parentheses in assertions). Note that such subpatterns are |
processed as anchored at the point where they are tested. Note also that Perl's |
processed as anchored at the point where they are tested. Note also that Perl's |
treatment of subroutines is different in some cases. | treatment of subroutines and assertions is different in some cases. |
</P> |
</P> |
<P> |
<P> |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
parenthesis followed by an asterisk. They are generally of the form |
parenthesis followed by an asterisk. They are generally of the form |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
(*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour, |
depending on whether or not an argument is present. A name is any sequence of |
depending on whether or not an argument is present. A name is any sequence of |
characters that does not include a closing parenthesis. If the name is empty, | characters that does not include a closing parenthesis. The maximum length of |
that is, if the closing parenthesis immediately follows the colon, the effect | name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name |
is as if the colon were not there. Any number of these verbs may occur in a | is empty, that is, if the closing parenthesis immediately follows the colon, |
pattern. | the effect is as if the colon were not there. Any number of these verbs may |
</P> | occur in a pattern. |
| <a name="nooptimize"></a></P> |
| <br><b> |
| Optimizations that affect backtracking verbs |
| </b><br> |
<P> |
<P> |
PCRE contains some optimizations that are used to speed up matching by running |
PCRE contains some optimizations that are used to speed up matching by running |
some checks at the start of each match attempt. For example, it may know the |
some checks at the start of each match attempt. For example, it may know the |
Line 2559 present. When one of these optimizations suppresses th
|
Line 2591 present. When one of these optimizations suppresses th
|
included backtracking verbs will not, of course, be processed. You can suppress |
included backtracking verbs will not, of course, be processed. You can suppress |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the |
pattern with (*NO_START_OPT). | pattern with (*NO_START_OPT). There is more discussion of this option in the |
| section entitled |
| <a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a> |
| in the |
| <a href="pcreapi.html"><b>pcreapi</b></a> |
| documentation. |
</P> |
</P> |
<P> |
<P> |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Experiments with Perl suggest that it too has similar optimizations, sometimes |
Line 2613 A name is always required with this verb. There may be
|
Line 2650 A name is always required with this verb. There may be
|
</P> |
</P> |
<P> |
<P> |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
path is passed back to the caller via the <i>pcre_extra</i> data structure, as | path is passed back to the caller as described in the section entitled |
described in the | <a href="pcreapi.html#extradata">"Extra data for <b>pcre_exec()</b>"</a> |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> | |
in the |
in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
Line 2648 After a partial match or a failed match, the name of t
|
Line 2684 After a partial match or a failed match, the name of t
|
No match, mark = B |
No match, mark = B |
</pre> |
</pre> |
Note that in this unanchored example the mark is retained from the match |
Note that in this unanchored example the mark is retained from the match |
attempt that started at the letter "X". Subsequent match attempts starting at | attempt that started at the letter "X" in the subject. Subsequent match |
"P" and then with an empty string do not get as far as the (*MARK) item, but | attempts starting at "P" and then with an empty string do not get as far as the |
nevertheless do not reset it. | (*MARK) item, but nevertheless do not reset it. |
</P> |
</P> |
|
<P> |
|
If you are interested in (*MARK) values after failed matches, you should |
|
probably set the PCRE_NO_START_OPTIMIZE option |
|
<a href="#nooptimize">(see above)</a> |
|
to ensure that the match is always attempted. |
|
</P> |
<br><b> |
<br><b> |
Verbs that act after backtracking |
Verbs that act after backtracking |
</b><br> |
</b><br> |
Line 2816 overrides.
|
Line 2858 overrides.
|
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
<P> |
<P> |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcresyntax</b>(3), <b>pcre</b>(3). | <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>. |
</P> |
</P> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> |
<P> |
<P> |
Line 2829 Cambridge CB2 3QH, England.
|
Line 2871 Cambridge CB2 3QH, England.
|
</P> |
</P> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<P> |
<P> |
Last updated: 29 November 2011 | Last updated: 17 June 2012 |
<br> |
<br> |
Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2012 University of Cambridge. |
<br> |
<br> |
<p> |
<p> |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |