version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.2, 2012/02/21 23:50:25
|
Line 19 man page, in case the conversion went wrong.
|
Line 19 man page, in case the conversion went wrong.
|
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
<li><a name="TOC4" href="#SEC4">BACKSLASH</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC5" href="#SEC5">CIRCUMFLEX AND DOLLAR</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> |
<li><a name="TOC6" href="#SEC6">FULL STOP (PERIOD, DOT) AND \N</a> |
<li><a name="TOC7" href="#SEC7">MATCHING A SINGLE BYTE</a> | <li><a name="TOC7" href="#SEC7">MATCHING A SINGLE DATA UNIT</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC8" href="#SEC8">SQUARE BRACKETS AND CHARACTER CLASSES</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC9" href="#SEC9">POSIX CHARACTER CLASSES</a> |
<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> |
<li><a name="TOC10" href="#SEC10">VERTICAL BAR</a> |
Line 61 description of PCRE's regular expressions is intended
|
Line 61 description of PCRE's regular expressions is intended
|
</P> |
</P> |
<P> |
<P> |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
there is now also support for UTF-8 character strings. To use this, | there is now also support for UTF-8 strings in the original library, and a |
PCRE must be built to include UTF-8 support, and you must call | second library that supports 16-bit and UTF-16 character strings. To use these |
<b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There | features, PCRE must be built to include appropriate support. When using UTF |
is also a special sequence that can be given at the start of a pattern: | strings you must either call the compiling function with the PCRE_UTF8 or |
| PCRE_UTF16 option, or the pattern must start with one of these special |
| sequences: |
<pre> |
<pre> |
(*UTF8) |
(*UTF8) |
|
(*UTF16) |
</pre> |
</pre> |
Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8 | Starting a pattern with such a sequence is equivalent to setting the relevant |
option. This feature is not Perl-compatible. How setting UTF-8 mode affects | option. This feature is not Perl-compatible. How setting a UTF mode affects |
pattern matching is mentioned in several places below. There is also a summary |
pattern matching is mentioned in several places below. There is also a summary |
of UTF-8 features in the | of features in the |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
<a href="pcreunicode.html"><b>pcreunicode</b></a> |
page. |
page. |
</P> |
</P> |
<P> |
<P> |
Another special sequence that may appear at the start of a pattern or in |
Another special sequence that may appear at the start of a pattern or in |
combination with (*UTF8) is: | combination with (*UTF8) or (*UTF16) is: |
<pre> |
<pre> |
(*UCP) |
(*UCP) |
</pre> |
</pre> |
Line 94 of newlines; they are described below.
|
Line 97 of newlines; they are described below.
|
</P> |
</P> |
<P> |
<P> |
The remainder of this document discusses the patterns that are supported by |
The remainder of this document discusses the patterns that are supported by |
PCRE when its main matching function, <b>pcre_exec()</b>, is used. | PCRE when one its main matching functions, <b>pcre_exec()</b> (8-bit) or |
From release 6.0, PCRE offers a second matching function, | <b>pcre16_exec()</b> (16-bit), is used. PCRE also has alternative matching |
<b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not | functions, <b>pcre_dfa_exec()</b> and <b>pcre16_dfa_exec()</b>, which match using |
Perl-compatible. Some of the features discussed below are not available when | a different algorithm that is not Perl-compatible. Some of the features |
<b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the | discussed below are not available when DFA matching is used. The advantages and |
alternative function, and how it differs from the normal function, are | disadvantages of the alternative functions, and how they differ from the normal |
discussed in the | functions, are discussed in the |
<a href="pcrematching.html"><b>pcrematching</b></a> |
<a href="pcrematching.html"><b>pcrematching</b></a> |
page. |
page. |
<a name="newlines"></a></P> |
<a name="newlines"></a></P> |
Line 126 string with one of the following five sequences:
|
Line 129 string with one of the following five sequences:
|
(*ANYCRLF) any of the three above |
(*ANYCRLF) any of the three above |
(*ANY) all Unicode newline sequences |
(*ANY) all Unicode newline sequences |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function. For |
<b>pcre_compile2()</b>. For example, on a Unix system where LF is the default | example, on a Unix system where LF is the default newline sequence, the pattern |
newline sequence, the pattern | |
<pre> |
<pre> |
(*CR)a.b |
(*CR)a.b |
</pre> |
</pre> |
Line 158 corresponding characters in the subject. As a trivial
|
Line 160 corresponding characters in the subject. As a trivial
|
</pre> |
</pre> |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
caseless matching is specified (the PCRE_CASELESS option), letters are matched |
independently of case. In UTF-8 mode, PCRE always understands the concept of | independently of case. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching for characters 128 and above, you must |
If you want to use caseless matching for characters 128 and above, you must |
ensure that PCRE is compiled with Unicode property support as well as with |
ensure that PCRE is compiled with Unicode property support as well as with |
UTF-8 support. | UTF support. |
</P> |
</P> |
<P> |
<P> |
The power of regular expressions comes from the ability to include alternatives |
The power of regular expressions comes from the ability to include alternatives |
Line 220 non-alphanumeric with backslash to specify that it sta
|
Line 222 non-alphanumeric with backslash to specify that it sta
|
particular, if you want to match a backslash, you write \\. |
particular, if you want to match a backslash, you write \\. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, only ASCII numbers and letters have any special meaning after a | In a UTF mode, only ASCII numbers and letters have any special meaning after a |
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
</P> |
</P> |
Line 276 is converted to upper case. Then bit 6 of the characte
|
Line 278 is converted to upper case. Then bit 6 of the characte
|
Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while |
Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while |
\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater |
\c; becomes hex 7B (; is 3B). If the byte following \c has a value greater |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
than 127, a compile-time error occurs. This locks out non-ASCII characters in |
both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte | all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A |
values are valid. A lower case letter is converted to upper case, and then the | lower case letter is converted to upper case, and then the 0xc0 bits are |
0xc0 bits are flipped.) | flipped.) |
</P> |
</P> |
<P> |
<P> |
By default, after \x, from zero to two hexadecimal digits are read (letters |
By default, after \x, from zero to two hexadecimal digits are read (letters |
can be in upper or lower case). Any number of hexadecimal digits may appear |
can be in upper or lower case). Any number of hexadecimal digits may appear |
between \x{ and }, but the value of the character code must be less than 256 | between \x{ and }, but the character code is constrained as follows: |
in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum | <pre> |
value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest | 8-bit non-UTF mode less than 0x100 |
Unicode code point, which is 10FFFF. | 8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
| 16-bit non-UTF mode less than 0x10000 |
| 16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
| </pre> |
| Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
| "surrogate" codepoints). |
</P> |
</P> |
<P> |
<P> |
If characters other than hexadecimal digits appear between \x{ and }, or if |
If characters other than hexadecimal digits appear between \x{ and }, or if |
Line 328 following the discussion of
|
Line 335 following the discussion of
|
Inside a character class, or if the decimal number is greater than 9 and there |
Inside a character class, or if the decimal number is greater than 9 and there |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
digits following the backslash, and uses them to generate a data character. Any |
digits following the backslash, and uses them to generate a data character. Any |
subsequent digits stand for themselves. In non-UTF-8 mode, the value of a | subsequent digits stand for themselves. The value of the character is |
character specified in octal must be less than \400. In UTF-8 mode, values up | constrained in the same way as characters specified in hexadecimal. |
to \777 are permitted. For example: | For example: |
<pre> |
<pre> |
\040 is another way of writing a space |
\040 is another way of writing a space |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
Line 339 to \777 are permitted. For example:
|
Line 346 to \777 are permitted. For example:
|
\011 is always a tab |
\011 is always a tab |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
\113 might be a back reference, otherwise the character with octal code 113 |
\113 might be a back reference, otherwise the character with octal code 113 |
\377 might be a back reference, otherwise the byte consisting entirely of 1 bits | \377 might be a back reference, otherwise the value 255 (decimal) |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
</pre> |
</pre> |
Note that octal values of 100 or greater must not be introduced by a leading |
Note that octal values of 100 or greater must not be introduced by a leading |
Line 443 accented letters, and these are then matched by \w. Th
|
Line 450 accented letters, and these are then matched by \w. Th
|
Unicode is discouraged. |
Unicode is discouraged. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 never match | By default, in a UTF mode, characters with values greater than 128 never match |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
their original meanings from before UTF-8 support was available, mainly for | their original meanings from before UTF support was available, mainly for |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
properties are used to determine character types, as follows: |
properties are used to determine character types, as follows: |
Line 463 is noticeably slower when PCRE_UCP is set.
|
Line 470 is noticeably slower when PCRE_UCP is set.
|
<P> |
<P> |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
characters by default, these always match certain high-valued codepoints in | characters by default, these always match certain high-valued codepoints, |
UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters | whether or not PCRE_UCP is set. The horizontal space characters are: |
are: | |
<pre> |
<pre> |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
U+0020 Space |
U+0020 Space |
Line 496 The vertical space characters are:
|
Line 502 The vertical space characters are:
|
U+0085 Next line |
U+0085 Next line |
U+2028 Line separator |
U+2028 Line separator |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
<a name="newlineseq"></a></PRE> | </pre> |
</P> | In 8-bit, non-UTF-8 mode, only the characters with codepoints less than 256 are |
| relevant. |
| <a name="newlineseq"></a></P> |
<br><b> |
<br><b> |
Newline sequences |
Newline sequences |
</b><br> |
</b><br> |
<P> |
<P> |
Outside a character class, by default, the escape sequence \R matches any |
Outside a character class, by default, the escape sequence \R matches any |
Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following: | Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent to the |
| following: |
<pre> |
<pre> |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
</pre> |
</pre> |
Line 516 line, U+0085). The two-character sequence is treated a
|
Line 525 line, U+0085). The two-character sequence is treated a
|
cannot be split. |
cannot be split. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, two additional characters whose codepoints are greater than 255 | In other modes, two additional characters whose codepoints are greater than 255 |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). |
Unicode character property support is not needed for these characters to be |
Unicode character property support is not needed for these characters to be |
recognized. |
recognized. |
Line 533 one of the following sequences:
|
Line 542 one of the following sequences:
|
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
</pre> |
</pre> |
These override the default and the options given to <b>pcre_compile()</b> or | These override the default and the options given to the compiling function, but |
<b>pcre_compile2()</b>, but they can be overridden by options given to | they can themselves be overridden by options given to a matching function. Note |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings, | that these special settings, which are not Perl-compatible, are recognized only |
which are not Perl-compatible, are recognized only at the very start of a | at the very start of a pattern, and that they must be in upper case. If more |
pattern, and that they must be in upper case. If more than one of them is | than one of them is present, the last one is used. They can be combined with a |
present, the last one is used. They can be combined with a change of newline | change of newline convention; for example, a pattern can start with: |
convention; for example, a pattern can start with: | |
<pre> |
<pre> |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
</pre> |
</pre> |
They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside | They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special |
a character class, \R is treated as an unrecognized escape sequence, and so | sequences. Inside a character class, \R is treated as an unrecognized escape |
matches the letter "R" by default, but causes an error if PCRE_EXTRA is set. | sequence, and so matches the letter "R" by default, but causes an error if |
| PCRE_EXTRA is set. |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
<br><b> |
<br><b> |
Unicode character properties |
Unicode character properties |
Line 553 Unicode character properties
|
Line 562 Unicode character properties
|
<P> |
<P> |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
escape sequences that match characters with specific properties are available. |
escape sequences that match characters with specific properties are available. |
When not in UTF-8 mode, these sequences are of course limited to testing | When in 8-bit non-UTF-8 mode, these sequences are of course limited to testing |
characters whose codepoints are less than 256, but they do work in this mode. |
characters whose codepoints are less than 256, but they do work in this mode. |
The extra escape sequences are: |
The extra escape sequences are: |
<pre> |
<pre> |
Line 742 a modifier or "other".
|
Line 751 a modifier or "other".
|
</P> |
</P> |
<P> |
<P> |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
The Cs (Surrogate) property applies only to characters in the range U+D800 to |
U+DFFF. Such characters are not valid in UTF-8 strings (see RFC 3629) and so | U+DFFF. Such characters are not valid in Unicode strings and so |
cannot be tested by PCRE, unless UTF-8 validity checking has been turned off | cannot be tested by PCRE, unless UTF validity checking has been turned off |
(see the discussion of PCRE_NO_UTF8_CHECK in the | (see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
page). Perl does not support the Cs property. |
page). Perl does not support the Cs property. |
</P> |
</P> |
Line 774 atomic group
|
Line 783 atomic group
|
<a href="#atomicgroup">(see below).</a> |
<a href="#atomicgroup">(see below).</a> |
Characters with the "mark" property are typically accents that affect the |
Characters with the "mark" property are typically accents that affect the |
preceding character. None of them have codepoints less than 256, so in |
preceding character. None of them have codepoints less than 256, so in |
non-UTF-8 mode \X matches any one character. | 8-bit non-UTF-8 mode \X matches any one character. |
</P> |
</P> |
<P> |
<P> |
Note that recent versions of Perl have changed \X to match what Unicode calls |
Note that recent versions of Perl have changed \X to match what Unicode calls |
Line 785 Matching characters by Unicode property is not fast, b
|
Line 794 Matching characters by Unicode property is not fast, b
|
a structure that contains data for over fifteen thousand characters. That is |
a structure that contains data for over fifteen thousand characters. That is |
why the traditional escape sequences such as \d and \w do not use Unicode |
why the traditional escape sequences such as \d and \w do not use Unicode |
properties in PCRE by default, though you can make them do so by setting the |
properties in PCRE by default, though you can make them do so by setting the |
PCRE_UCP option for <b>pcre_compile()</b> or by starting the pattern with | PCRE_UCP option or by starting the pattern with (*UCP). |
(*UCP). | |
<a name="extraprops"></a></P> |
<a name="extraprops"></a></P> |
<br><b> |
<br><b> |
PCRE's additional properties |
PCRE's additional properties |
Line 865 escape sequence" error is generated instead.
|
Line 873 escape sequence" error is generated instead.
|
A word boundary is a position in the subject string where the current character |
A word boundary is a position in the subject string where the current character |
and the previous character do not both match \w or \W (i.e. one matches |
and the previous character do not both match \w or \W (i.e. one matches |
\w and the other matches \W), or the start or end of the string if the |
\w and the other matches \W), or the start or end of the string if the |
first or last character matches \w, respectively. In UTF-8 mode, the meanings | first or last character matches \w, respectively. In a UTF mode, the meanings |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
of \w and \W can be changed by setting the PCRE_UCP option. When this is |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
done, it also affects \b and \B. Neither PCRE nor Perl has a separate "start |
of word" or "end of word" metasequence. However, whatever follows \b normally |
of word" or "end of word" metasequence. However, whatever follows \b normally |
Line 962 end of the subject in both modes, and if all branches
|
Line 970 end of the subject in both modes, and if all branches
|
<P> |
<P> |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
the subject string except (by default) a character that signifies the end of a |
the subject string except (by default) a character that signifies the end of a |
line. In UTF-8 mode, the matched character may be more than one byte long. | line. |
</P> |
</P> |
<P> |
<P> |
When a line ending is defined as a single character, dot never matches that |
When a line ending is defined as a single character, dot never matches that |
Line 989 the PCRE_DOTALL option. In other words, it matches any
|
Line 997 the PCRE_DOTALL option. In other words, it matches any
|
that signifies the end of a line. Perl also uses \N to match characters by |
that signifies the end of a line. Perl also uses \N to match characters by |
name; PCRE does not support this. |
name; PCRE does not support this. |
</P> |
</P> |
<br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br> | <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE DATA UNIT</a><br> |
<P> |
<P> |
Outside a character class, the escape sequence \C matches any one byte, both | Outside a character class, the escape sequence \C matches any one data unit, |
in and out of UTF-8 mode. Unlike a dot, it always matches line-ending | whether or not a UTF mode is set. In the 8-bit library, one data unit is one |
characters. The feature is provided in Perl in order to match individual bytes | byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \C always |
in UTF-8 mode, but it is unclear how it can usefully be used. Because \C | matches line-ending characters. The feature is provided in Perl in order to |
breaks up characters into individual bytes, matching one byte with \C in UTF-8 | match individual bytes in UTF-8 mode, but it is unclear how it can usefully be |
mode means that the rest of the string may start with a malformed UTF-8 | used. Because \C breaks up characters into individual data units, matching one |
character. This has undefined results, because PCRE assumes that it is dealing | unit with \C in a UTF mode means that the rest of the string may start with a |
with valid UTF-8 strings (and by default it checks this at the start of | malformed UTF character. This has undefined results, because PCRE assumes that |
processing unless the PCRE_NO_UTF8_CHECK option is used). | it is dealing with valid UTF strings (and by default it checks this at the |
| start of processing unless the PCRE_NO_UTF8_CHECK option is used). |
</P> |
</P> |
<P> |
<P> |
PCRE does not allow \C to appear in lookbehind assertions |
PCRE does not allow \C to appear in lookbehind assertions |
<a href="#lookbehind">(described below)</a> |
<a href="#lookbehind">(described below)</a> |
in UTF-8 mode, because this would make it impossible to calculate the length of | in a UTF mode, because this would make it impossible to calculate the length of |
the lookbehind. |
the lookbehind. |
</P> |
</P> |
<P> |
<P> |
In general, the \C escape sequence is best avoided in UTF-8 mode. However, one | In general, the \C escape sequence is best avoided. However, one |
way of using it that avoids the problem of malformed UTF-8 characters is to | way of using it that avoids the problem of malformed UTF characters is to use a |
use a lookahead to check the length of the next character, as in this pattern | lookahead to check the length of the next character, as in this pattern, which |
(ignore white space and line breaks): | could be used with a UTF-8 string (ignore white space and line breaks): |
<pre> |
<pre> |
(?| (?=[\x00-\x7f])(\C) | |
(?| (?=[\x00-\x7f])(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
(?=[\x80-\x{7ff}])(\C)(\C) | |
Line 1036 a member of the class, it should be the first data cha
|
Line 1045 a member of the class, it should be the first data cha
|
(after an initial circumflex, if present) or escaped with a backslash. |
(after an initial circumflex, if present) or escaped with a backslash. |
</P> |
</P> |
<P> |
<P> |
A character class matches a single character in the subject. In UTF-8 mode, the | A character class matches a single character in the subject. In a UTF mode, the |
character may be more than one byte long. A matched character must be in the | character may be more than one data unit long. A matched character must be in |
set of characters defined by the class, unless the first character in the class | the set of characters defined by the class, unless the first character in the |
definition is a circumflex, in which case the subject character must not be in | class definition is a circumflex, in which case the subject character must not |
the set defined by the class. If a circumflex is actually required as a member | be in the set defined by the class. If a circumflex is actually required as a |
of the class, ensure it is not the first character, or escape it with a | member of the class, ensure it is not the first character, or escape it with a |
backslash. |
backslash. |
</P> |
</P> |
<P> |
<P> |
Line 1054 string, and therefore it fails if the current pointer
|
Line 1063 string, and therefore it fails if the current pointer
|
string. |
string. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, characters with values greater than 255 can be included in a | In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be |
class as a literal string of bytes, or by using the \x{ escaping mechanism. | included in a class as a literal string of data units, or by using the \x{ |
| escaping mechanism. |
</P> |
</P> |
<P> |
<P> |
When caseless matching is set, any letters in a class represent both their |
When caseless matching is set, any letters in a class represent both their |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
upper case and lower case versions, so for example, a caseless [aeiou] matches |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
"A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a |
caseful version would. In UTF-8 mode, PCRE always understands the concept of | caseful version would. In a UTF mode, PCRE always understands the concept of |
case for characters whose values are less than 128, so caseless matching is |
case for characters whose values are less than 128, so caseless matching is |
always possible. For characters with higher values, the concept of case is |
always possible. For characters with higher values, the concept of case is |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
supported if PCRE is compiled with Unicode property support, but not otherwise. |
If you want to use caseless matching in UTF8-mode for characters 128 and above, | If you want to use caseless matching in a UTF mode for characters 128 and |
you must ensure that PCRE is compiled with Unicode property support as well as | above, you must ensure that PCRE is compiled with Unicode property support as |
with UTF-8 support. | well as with UTF support. |
</P> |
</P> |
<P> |
<P> |
Characters that might indicate line breaks are never treated in any special way |
Characters that might indicate line breaks are never treated in any special way |
Line 1093 followed by two other characters. The octal or hexadec
|
Line 1103 followed by two other characters. The octal or hexadec
|
</P> |
</P> |
<P> |
<P> |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
used for characters specified numerically, for example [\000-\037]. In UTF-8 | used for characters specified numerically, for example [\000-\037]. Ranges |
mode, ranges can include characters whose values are greater than 255, for | can include any characters that are valid for the current mode. |
example [\x{100}-\x{2ff}]. | |
</P> |
</P> |
<P> |
<P> |
If a range that includes letters is used when caseless matching is set, it |
If a range that includes letters is used when caseless matching is set, it |
matches the letters in either case. For example, [W-c] is equivalent to |
matches the letters in either case. For example, [W-c] is equivalent to |
[][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character | [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
tables for a French locale are in use, [\xc8-\xcb] matches accented E |
characters in both cases. In UTF-8 mode, PCRE supports the concept of case for | characters in both cases. In UTF modes, PCRE supports the concept of case for |
characters with values greater than 128 only when it is compiled with Unicode |
characters with values greater than 128 only when it is compiled with Unicode |
property support. |
property support. |
</P> |
</P> |
Line 1110 property support.
|
Line 1119 property support.
|
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, |
\V, \w, and \W may appear in a character class, and add the characters that |
\V, \w, and \W may appear in a character class, and add the characters that |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
they match to the class. For example, [\dABCDEF] matches any hexadecimal |
digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w | digit. In UTF modes, the PCRE_UCP option affects the meanings of \d, \s, \w |
and their upper case partners, just as it does when they appear outside a |
and their upper case partners, just as it does when they appear outside a |
character class, as described in the section entitled |
character class, as described in the section entitled |
<a href="#genericchartypes">"Generic character types"</a> |
<a href="#genericchartypes">"Generic character types"</a> |
Line 1179 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
Line 1188 syntax [.ch.] and [=ch=] where "ch" is a "collating el
|
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
</P> |
</P> |
<P> |
<P> |
By default, in UTF-8 mode, characters with values greater than 128 do not match | By default, in UTF modes, characters with values greater than 128 do not match |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
character properties are used. This is achieved by replacing the POSIX classes |
character properties are used. This is achieved by replacing the POSIX classes |
Line 1264 behaviour otherwise.
|
Line 1273 behaviour otherwise.
|
</P> |
</P> |
<P> |
<P> |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
<b>Note:</b> There are other PCRE-specific options that can be set by the |
application when the compile or match functions are called. In some cases the | application when the compiling or matching functions are called. In some cases |
pattern can contain special leading sequences such as (*CRLF) to override what | the pattern can contain special leading sequences such as (*CRLF) to override |
the application has set or what has been defaulted. Details are given in the | what the application has set or what has been defaulted. Details are given in |
section entitled | the section entitled |
<a href="#newlineseq">"Newline sequences"</a> |
<a href="#newlineseq">"Newline sequences"</a> |
above. There are also the (*UTF8) and (*UCP) leading sequences that can be used | above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that |
to set UTF-8 and Unicode property modes; they are equivalent to setting the | can be used to set UTF and Unicode property modes; they are equivalent to |
PCRE_UTF8 and the PCRE_UCP options, respectively. | setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively. |
<a name="subpattern"></a></P> |
<a name="subpattern"></a></P> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">SUBPATTERNS</a><br> |
<P> |
<P> |
Line 1289 match "cataract", "erpillar" or an empty string.
|
Line 1298 match "cataract", "erpillar" or an empty string.
|
<br> |
<br> |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
2. It sets up the subpattern as a capturing subpattern. This means that, when |
the whole pattern matches, that portion of the subject string that matched the |
the whole pattern matches, that portion of the subject string that matched the |
subpattern is passed back to the caller via the <i>ovector</i> argument of | subpattern is passed back to the caller via the <i>ovector</i> argument of the |
<b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting | matching function. (This applies only to the traditional matching functions; |
from 1) to obtain numbers for the capturing subpatterns. For example, if the | the DFA matching functions do not support capturing.) |
string "the red king" is matched against the pattern | </P> |
| <P> |
| Opening parentheses are counted from left to right (starting from 1) to obtain |
| numbers for the capturing subpatterns. For example, if the string "the red |
| king" is matched against the pattern |
<pre> |
<pre> |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
</pre> |
</pre> |
Line 1452 items:
|
Line 1465 items:
|
a literal data character |
a literal data character |
the dot metacharacter |
the dot metacharacter |
the \C escape sequence |
the \C escape sequence |
the \X escape sequence (in UTF-8 mode with Unicode properties) | the \X escape sequence |
the \R escape sequence |
the \R escape sequence |
an escape such as \d or \pL that matches a single character |
an escape such as \d or \pL that matches a single character |
a character class |
a character class |
Line 1484 quantifier, is taken as a literal character. For examp
|
Line 1497 quantifier, is taken as a literal character. For examp
|
quantifier, but a literal string of four characters. |
quantifier, but a literal string of four characters. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual | In UTF modes, quantifiers apply to characters rather than to individual data |
bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of | units. Thus, for example, \x{100}{2} matches two characters, each of |
which is represented by a two-byte sequence. Similarly, when Unicode property | which is represented by a two-byte sequence in a UTF-8 string. Similarly, |
support is available, \X{3} matches three Unicode extended sequences, each of | \X{3} matches three Unicode extended sequences, each of which may be several |
which may be several bytes long (and they may be of different lengths). | data units long (and they may be of different lengths). |
</P> |
</P> |
<P> |
<P> |
The quantifier {0} is permitted, causing the expression to behave as if the |
The quantifier {0} is permitted, causing the expression to behave as if the |
Line 1950 match. If there are insufficient characters before the
|
Line 1963 match. If there are insufficient characters before the
|
assertion fails. |
assertion fails. |
</P> |
</P> |
<P> |
<P> |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte, | In a UTF mode, PCRE does not allow the \C escape (which matches a single data |
even in UTF-8 mode) to appear in lookbehind assertions, because it makes it | unit even in a UTF mode) to appear in lookbehind assertions, because it makes |
impossible to calculate the length of the lookbehind. The \X and \R escapes, | it impossible to calculate the length of the lookbehind. The \X and \R |
which can match different numbers of bytes, are also not permitted. | escapes, which can match different numbers of data units, are also not |
| permitted. |
</P> |
</P> |
<P> |
<P> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
<a href="#subpatternsassubroutines">"Subroutine"</a> |
Line 2192 closing parenthesis. Nested parentheses are not permit
|
Line 2206 closing parenthesis. Nested parentheses are not permit
|
option is set, an unescaped # character also introduces a comment, which in |
option is set, an unescaped # character also introduces a comment, which in |
this case continues to immediately after the next newline character or |
this case continues to immediately after the next newline character or |
character sequence in the pattern. Which characters are interpreted as newlines |
character sequence in the pattern. Which characters are interpreted as newlines |
is controlled by the options passed to <b>pcre_compile()</b> or by a special | is controlled by the options passed to a compiling function or by a special |
sequence at the start of the pattern, as described in the section entitled |
sequence at the start of the pattern, as described in the section entitled |
<a href="#newlines">"Newline conventions"</a> |
<a href="#newlines">"Newline conventions"</a> |
above. Note that the end of this type of comment is a literal newline sequence |
above. Note that the end of this type of comment is a literal newline sequence |
Line 2491 same pair of parentheses when there is a repetition.
|
Line 2505 same pair of parentheses when there is a repetition.
|
<P> |
<P> |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
PCRE provides a similar feature, but of course it cannot obey arbitrary Perl |
code. The feature is called "callout". The caller of PCRE provides an external |
code. The feature is called "callout". The caller of PCRE provides an external |
function by putting its entry point in the global variable <i>pcre_callout</i>. | function by putting its entry point in the global variable <i>pcre_callout</i> |
By default, this variable contains NULL, which disables all calling out. | (8-bit library) or <i>pcre16_callout</i> (16-bit library). By default, this |
| variable contains NULL, which disables all calling out. |
</P> |
</P> |
<P> |
<P> |
Within a regular expression, (?C) indicates the points at which the external |
Within a regular expression, (?C) indicates the points at which the external |
Line 2502 For example, this pattern has two callout points:
|
Line 2517 For example, this pattern has two callout points:
|
<pre> |
<pre> |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
</pre> |
</pre> |
If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are | If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, callouts are |
automatically installed before each item in the pattern. They are all numbered |
automatically installed before each item in the pattern. They are all numbered |
255. |
255. |
</P> |
</P> |
<P> |
<P> |
During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is | During matching, when PCRE reaches a callout point, the external function is |
set), the external function is called. It is provided with the number of the | called. It is provided with the number of the callout, the position in the |
callout, the position in the pattern, and, optionally, one item of data | pattern, and, optionally, one item of data originally supplied by the caller of |
originally supplied by the caller of <b>pcre_exec()</b>. The callout function | the matching function. The callout function may cause matching to proceed, to |
may cause matching to proceed, to backtrack, or to fail altogether. A complete | backtrack, or to fail altogether. A complete description of the interface to |
description of the interface to the callout function is given in the | the callout function is given in the |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
documentation. |
documentation. |
<a name="backtrackcontrol"></a></P> |
<a name="backtrackcontrol"></a></P> |
Line 2526 remarks apply to the PCRE features described in this s
|
Line 2541 remarks apply to the PCRE features described in this s
|
</P> |
</P> |
<P> |
<P> |
Since these verbs are specifically related to backtracking, most of them can be |
Since these verbs are specifically related to backtracking, most of them can be |
used only when the pattern is to be matched using <b>pcre_exec()</b>, which uses | used only when the pattern is to be matched using one of the traditional |
a backtracking algorithm. With the exception of (*FAIL), which behaves like a | matching functions, which use a backtracking algorithm. With the exception of |
failing negative assertion, they cause an error if encountered by | (*FAIL), which behaves like a failing negative assertion, they cause an error |
<b>pcre_dfa_exec()</b>. | if encountered by a DFA matching function. |
</P> |
</P> |
<P> |
<P> |
If any of these verbs are used in an assertion or in a subpattern that is |
If any of these verbs are used in an assertion or in a subpattern that is |
Line 2613 A name is always required with this verb. There may be
|
Line 2628 A name is always required with this verb. There may be
|
</P> |
</P> |
<P> |
<P> |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
When a match succeeds, the name of the last-encountered (*MARK) on the matching |
path is passed back to the caller via the <i>pcre_extra</i> data structure, as | path is passed back to the caller as described in the section entitled |
described in the | <a href="pcreapi.html#extradata">"Extra data for <b>pcre_exec()</b>"</a> |
<a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a> | |
in the |
in the |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
documentation. Here is an example of <b>pcretest</b> output, where the /K |
Line 2816 overrides.
|
Line 2830 overrides.
|
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC26" href="#TOC1">SEE ALSO</a><br> |
<P> |
<P> |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), |
<b>pcresyntax</b>(3), <b>pcre</b>(3). | <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>. |
</P> |
</P> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC27" href="#TOC1">AUTHOR</a><br> |
<P> |
<P> |
Line 2829 Cambridge CB2 3QH, England.
|
Line 2843 Cambridge CB2 3QH, England.
|
</P> |
</P> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<br><a name="SEC28" href="#TOC1">REVISION</a><br> |
<P> |
<P> |
Last updated: 29 November 2011 | Last updated: 09 January 2012 |
<br> |
<br> |
Copyright © 1997-2011 University of Cambridge. | Copyright © 1997-2012 University of Cambridge. |
<br> |
<br> |
<p> |
<p> |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |