version 1.1.1.4, 2013/07/22 08:25:57
|
version 1.1.1.5, 2014/06/15 19:46:05
|
Line 1
|
Line 1
|
.TH PCREPATTERN 3 "26 April 2013" "PCRE 8.33" | .TH PCREPATTERN 3 "03 December 2013" "PCRE 8.34" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
Line 80 appearance causes an error.
|
Line 80 appearance causes an error.
|
.SS "Unicode property support" |
.SS "Unicode property support" |
.rs |
.rs |
.sp |
.sp |
Another special sequence that may appear at the start of a pattern is | Another special sequence that may appear at the start of a pattern is (*UCP). |
.sp | |
(*UCP) | |
.sp | |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
such as \ed and \ew to use Unicode properties to determine character types, |
such as \ed and \ew to use Unicode properties to determine character types, |
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
table. |
table. |
. |
. |
. |
. |
|
.SS "Disabling auto-possessification" |
|
.rs |
|
.sp |
|
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting |
|
the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making |
|
quantifiers possessive when what follows cannot match the repeated item. For |
|
example, by default a+b is treated as a++b. For more details, see the |
|
.\" HREF |
|
\fBpcreapi\fP |
|
.\" |
|
documentation. |
|
. |
|
. |
.SS "Disabling start-up optimizations" |
.SS "Disabling start-up optimizations" |
.rs |
.rs |
.sp |
.sp |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. | PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables |
| several optimizations for quickly reaching "no match" results. For more |
| details, see the |
| .\" HREF |
| \fBpcreapi\fP |
| .\" |
| documentation. |
. |
. |
. |
. |
.\" HTML <a name="newlines"></a> |
.\" HTML <a name="newlines"></a> |
Line 164 pattern of the form
|
Line 180 pattern of the form
|
(*LIMIT_RECURSION=d) |
(*LIMIT_RECURSION=d) |
.sp |
.sp |
where d is any number of decimal digits. However, the value of the setting must |
where d is any number of decimal digits. However, the value of the setting must |
be less than the value set by the caller of \fBpcre_exec()\fP for it to have | be less than the value set (or defaulted) by the caller of \fBpcre_exec()\fP |
any effect. In other words, the pattern writer can lower the limit set by the | for it to have any effect. In other words, the pattern writer can lower the |
programmer, but not raise it. If there is more than one setting of one of these | limits set by the programmer, but not raise them. If there is more than one |
limits, the lower value is used. | setting of one of these limits, the lower value is used. |
. |
. |
. |
. |
.SH "EBCDIC CHARACTER CODES" |
.SH "EBCDIC CHARACTER CODES" |
Line 257 In a UTF mode, only ASCII numbers and letters have any
|
Line 273 In a UTF mode, only ASCII numbers and letters have any
|
backslash. All other characters (in particular, those whose codepoints are |
backslash. All other characters (in particular, those whose codepoints are |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
.P |
.P |
If a pattern is compiled with the PCRE_EXTENDED option, white space in the | If a pattern is compiled with the PCRE_EXTENDED option, most white space in the |
pattern (other than in a character class) and characters between a # outside | pattern (other than in a character class), and characters between a # outside a |
a character class and the next newline are ignored. An escaping backslash can | character class and the next newline, inclusive, are ignored. An escaping |
be used to include a white space or # character as part of the pattern. | backslash can be used to include a white space or # character as part of the |
| pattern. |
.P |
.P |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
can do so by putting them between \eQ and \eE. This is different from Perl in |
can do so by putting them between \eQ and \eE. This is different from Perl in |
Line 300 one of the following escape sequences than the binary
|
Line 317 one of the following escape sequences than the binary
|
\en linefeed (hex 0A) |
\en linefeed (hex 0A) |
\er carriage return (hex 0D) |
\er carriage return (hex 0D) |
\et tab (hex 09) |
\et tab (hex 09) |
|
\e0dd character with octal code 0dd |
\eddd character with octal code ddd, or back reference |
\eddd character with octal code ddd, or back reference |
|
\eo{ddd..} character with octal code ddd.. |
\exhh character with hex code hh |
\exhh character with hex code hh |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\ex{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
\euhhhh character with hex code hhhh (JavaScript mode only) |
Line 321 byte are inverted. Thus \ecA becomes hex 01, as in ASC
|
Line 340 byte are inverted. Thus \ecA becomes hex 01, as in ASC
|
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other |
the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other |
characters also generate different values. |
characters also generate different values. |
.P |
.P |
By default, after \ex, from zero to two hexadecimal digits are read (letters |
|
can be in upper or lower case). Any number of hexadecimal digits may appear |
|
between \ex{ and }, but the character code is constrained as follows: |
|
.sp |
|
8-bit non-UTF mode less than 0x100 |
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
|
16-bit non-UTF mode less than 0x10000 |
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
32-bit non-UTF mode less than 0x80000000 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
|
.sp |
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
|
"surrogate" codepoints), and 0xffef. |
|
.P |
|
If characters other than hexadecimal digits appear between \ex{ and }, or if |
|
there is no terminating }, this form of escape is not recognized. Instead, the |
|
initial \ex will be interpreted as a basic hexadecimal escape, with no |
|
following digits, giving a character whose value is zero. |
|
.P |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is |
|
as just described only when it is followed by two hexadecimal digits. |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
code points greater than 256 is provided by \eu, which must be followed by |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
Character codes specified by \eu in JavaScript mode are constrained in the same |
|
was as those specified by \ex in non-JavaScript mode. |
|
.P |
|
Characters whose value is less than 256 can be defined by either of the two |
|
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
|
way they are handled. For example, \exdc is exactly the same as \ex{dc} (or |
|
\eu00dc in JavaScript mode). |
|
.P |
|
After \e0 up to two further octal digits are read. If there are fewer than two |
After \e0 up to two further octal digits are read. If there are fewer than two |
digits, just those that are present are used. Thus the sequence \e0\ex\e07 |
digits, just those that are present are used. Thus the sequence \e0\ex\e07 |
specifies two binary zeros followed by a BEL character (code value 7). Make |
specifies two binary zeros followed by a BEL character (code value 7). Make |
sure you supply two digits after the initial zero if the pattern character that |
sure you supply two digits after the initial zero if the pattern character that |
follows is itself an octal digit. |
follows is itself an octal digit. |
.P |
.P |
The handling of a backslash followed by a digit other than 0 is complicated. | The escape \eo must be followed by a sequence of octal digits, enclosed in |
Outside a character class, PCRE reads it and any following digits as a decimal | braces. An error occurs if this is not the case. This escape is a recent |
number. If the number is less than 10, or if there have been at least that many | addition to Perl; it provides way of specifying character code points as octal |
| numbers greater than 0777, and it also allows octal numbers and back references |
| to be unambiguously specified. |
| .P |
| For greater clarity and unambiguity, it is best to avoid following \e by a |
| digit greater than zero. Instead, use \eo{} or \ex{} to specify character |
| numbers, and \eg{} to specify back references. The following paragraphs |
| describe the old, ambiguous syntax. |
| .P |
| The handling of a backslash followed by a digit other than 0 is complicated, |
| and Perl has changed in recent releases, causing PCRE also to change. Outside a |
| character class, PCRE reads the digit and any following digits as a decimal |
| number. If the number is less than 8, or if there have been at least that many |
previous capturing left parentheses in the expression, the entire sequence is |
previous capturing left parentheses in the expression, the entire sequence is |
taken as a \fIback reference\fP. A description of how this works is given |
taken as a \fIback reference\fP. A description of how this works is given |
.\" HTML <a href="#backreferences"> |
.\" HTML <a href="#backreferences"> |
Line 374 following the discussion of
|
Line 373 following the discussion of
|
parenthesized subpatterns. |
parenthesized subpatterns. |
.\" |
.\" |
.P |
.P |
Inside a character class, or if the decimal number is greater than 9 and there | Inside a character class, or if the decimal number following \e is greater than |
have not been that many capturing subpatterns, PCRE re-reads up to three octal | 7 and there have not been that many capturing subpatterns, PCRE handles \e8 and |
digits following the backslash, and uses them to generate a data character. Any | \e9 as the literal characters "8" and "9", and otherwise re-reads up to three |
subsequent digits stand for themselves. The value of the character is | octal digits following the backslash, using them to generate a data character. |
constrained in the same way as characters specified in hexadecimal. | Any subsequent digits stand for themselves. For example: |
For example: | |
.sp |
.sp |
\e040 is another way of writing an ASCII space |
\e040 is another way of writing an ASCII space |
.\" JOIN |
.\" JOIN |
Line 398 For example:
|
Line 396 For example:
|
\e377 might be a back reference, otherwise |
\e377 might be a back reference, otherwise |
the value 255 (decimal) |
the value 255 (decimal) |
.\" JOIN |
.\" JOIN |
\e81 is either a back reference, or a binary zero | \e81 is either a back reference, or the two |
followed by the two characters "8" and "1" | characters "8" and "1" |
.sp |
.sp |
Note that octal values of 100 or greater must not be introduced by a leading | Note that octal values of 100 or greater that are specified using this syntax |
zero, because no more than three octal digits are ever read. | must not be introduced by a leading zero, because no more than three octal |
| digits are ever read. |
.P |
.P |
|
By default, after \ex that is not followed by {, from zero to two hexadecimal |
|
digits are read (letters can be in upper or lower case). Any number of |
|
hexadecimal digits may appear between \ex{ and }. If a character other than |
|
a hexadecimal digit appears between \ex{ and }, or if there is no terminating |
|
}, an error occurs. |
|
.P |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is |
|
as just described only when it is followed by two hexadecimal digits. |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
code points greater than 256 is provided by \eu, which must be followed by |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
.P |
|
Characters whose value is less than 256 can be defined by either of the two |
|
syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the |
|
way they are handled. For example, \exdc is exactly the same as \ex{dc} (or |
|
\eu00dc in JavaScript mode). |
|
. |
|
. |
|
.SS "Constraints on character values" |
|
.rs |
|
.sp |
|
Characters that are specified using octal or hexadecimal numbers are |
|
limited to certain values, as follows: |
|
.sp |
|
8-bit non-UTF mode less than 0x100 |
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
|
16-bit non-UTF mode less than 0x10000 |
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
32-bit non-UTF mode less than 0x100000000 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
|
.sp |
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
|
"surrogate" codepoints), and 0xffef. |
|
. |
|
. |
|
.SS "Escape sequences in character classes" |
|
.rs |
|
.sp |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
and outside character classes. In addition, inside a character class, \eb is |
and outside character classes. In addition, inside a character class, \eb is |
interpreted as the backspace character (hex 08). |
interpreted as the backspace character (hex 08). |
Line 494 classes. They each match one character of the appropri
|
Line 531 classes. They each match one character of the appropri
|
matching point is at the end of the subject string, all of them fail, because |
matching point is at the end of the subject string, all of them fail, because |
there is no character to match. |
there is no character to match. |
.P |
.P |
For compatibility with Perl, \es does not match the VT character (code 11). | For compatibility with Perl, \es did not used to match the VT character (code |
This makes it different from the the POSIX "space" class. The \es characters | 11), which made it different from the the POSIX "space" class. However, Perl |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is | added VT at release 5.18, and PCRE followed suit at release 8.34. The default |
included in a Perl script, \es may match the VT character. In PCRE, it never | \es characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space |
does. | (32), which are defined as white space in the "C" locale. This list may vary if |
| locale-specific matching is taking place. For example, in some locales the |
| "non-breaking space" character (\exA0) is recognized as white space, and in |
| others the VT character is not. |
.P |
.P |
A "word" character is an underscore or any character that is a letter or digit. |
A "word" character is an underscore or any character that is a letter or digit. |
By default, the definition of letters and digits is controlled by PCRE's |
By default, the definition of letters and digits is controlled by PCRE's |
Line 513 in the
|
Line 553 in the
|
\fBpcreapi\fP |
\fBpcreapi\fP |
.\" |
.\" |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
or "french" in Windows, some character codes greater than 128 are used for | or "french" in Windows, some character codes greater than 127 are used for |
accented letters, and these are then matched by \ew. The use of locales with |
accented letters, and these are then matched by \ew. The use of locales with |
Unicode is discouraged. |
Unicode is discouraged. |
.P |
.P |
By default, in a UTF mode, characters with values greater than 128 never match | By default, characters whose code points are greater than 127 never match \ed, |
\ed, \es, or \ew, and always match \eD, \eS, and \eW. These sequences retain | \es, or \ew, and always match \eD, \eS, and \eW, although this may vary for |
their original meanings from before UTF support was available, mainly for | characters in the range 128-255 when locale-specific matching is happening. |
efficiency reasons. However, if PCRE is compiled with Unicode property support, | These escape sequences retain their original meanings from before Unicode |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode | support was available, mainly for efficiency reasons. If PCRE is compiled with |
properties are used to determine character types, as follows: | Unicode property support, and the PCRE_UCP option is set, the behaviour is |
| changed so that Unicode properties are used to determine character types, as |
| follows: |
.sp |
.sp |
\ed any character that \ep{Nd} matches (decimal digit) | \ed any character that matches \ep{Nd} (decimal digit) |
\es any character that \ep{Z} matches, plus HT, LF, FF, CR | \es any character that matches \ep{Z} or \eh or \ev |
\ew any character that \ep{L} or \ep{N} matches, plus underscore | \ew any character that matches \ep{L} or \ep{N}, plus underscore |
.sp |
.sp |
The upper case escapes match the inverse sets of characters. Note that \ed |
The upper case escapes match the inverse sets of characters. Note that \ed |
matches only decimal digits, whereas \ew matches any Unicode digit, as well as |
matches only decimal digits, whereas \ew matches any Unicode digit, as well as |
Line 536 is noticeably slower when PCRE_UCP is set.
|
Line 578 is noticeably slower when PCRE_UCP is set.
|
.P |
.P |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
characters by default, these always match certain high-valued codepoints, | characters by default, these always match certain high-valued code points, |
whether or not PCRE_UCP is set. The horizontal space characters are: |
whether or not PCRE_UCP is set. The horizontal space characters are: |
.sp |
.sp |
U+0009 Horizontal tab (HT) |
U+0009 Horizontal tab (HT) |
Line 906 the "mark" property always have the "extend" grapheme
|
Line 948 the "mark" property always have the "extend" grapheme
|
.sp |
.sp |
As well as the standard Unicode properties described above, PCRE supports four |
As well as the standard Unicode properties described above, PCRE supports four |
more that make it possible to convert traditional escape sequences such as \ew |
more that make it possible to convert traditional escape sequences such as \ew |
and \es and POSIX character classes to use Unicode properties. PCRE uses these | and \es to use Unicode properties. PCRE uses these non-standard, non-Perl |
non-standard, non-Perl properties internally when PCRE_UCP is set. However, | properties internally when PCRE_UCP is set. However, they may also be used |
they may also be used explicitly. These properties are: | explicitly. These properties are: |
.sp |
.sp |
Xan Any alphanumeric character |
Xan Any alphanumeric character |
Xps Any POSIX space character |
Xps Any POSIX space character |
Line 918 they may also be used explicitly. These properties are
|
Line 960 they may also be used explicitly. These properties are
|
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the | Xsp is the same as Xps; it used to exclude vertical tab, for Perl |
same characters as Xan, plus underscore. | compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd |
| matches the same characters as Xan, plus underscore. |
.P |
.P |
There is another non-standard property, Xuc, which matches any character that |
There is another non-standard property, Xuc, which matches any character that |
can be represented by a Universal Character Name in C++ and other programming |
can be represented by a Universal Character Name in C++ and other programming |
Line 1215 The minus (hyphen) character can be used to specify a
|
Line 1258 The minus (hyphen) character can be used to specify a
|
character class. For example, [d-m] matches any letter between d and m, |
character class. For example, [d-m] matches any letter between d and m, |
inclusive. If a minus character is required in a class, it must be escaped with |
inclusive. If a minus character is required in a class, it must be escaped with |
a backslash or appear in a position where it cannot be interpreted as |
a backslash or appear in a position where it cannot be interpreted as |
indicating a range, typically as the first or last character in the class. | indicating a range, typically as the first or last character in the class, or |
| immediately after a range. For example, [b-d-z] matches letters in the range b |
| to d, a hyphen character, or z. |
.P |
.P |
It is not possible to have the literal character "]" as the end character of a |
It is not possible to have the literal character "]" as the end character of a |
range. A pattern such as [W-]46] is interpreted as a class of two characters |
range. A pattern such as [W-]46] is interpreted as a class of two characters |
Line 1225 the end of range, so [W-\e]46] is interpreted as a cla
|
Line 1270 the end of range, so [W-\e]46] is interpreted as a cla
|
followed by two other characters. The octal or hexadecimal representation of |
followed by two other characters. The octal or hexadecimal representation of |
"]" can also be used to end a range. |
"]" can also be used to end a range. |
.P |
.P |
|
An error is generated if a POSIX character class (see below) or an escape |
|
sequence other than one that defines a single character appears at a point |
|
where a range ending character is expected. For example, [z-\exff] is valid, |
|
but [A-\ed] and [A-[:digit:]] are not. |
|
.P |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
used for characters specified numerically, for example [\e000-\e037]. Ranges |
used for characters specified numerically, for example [\e000-\e037]. Ranges |
can include any characters that are valid for the current mode. |
can include any characters that are valid for the current mode. |
Line 1263 something AND NOT ...".
|
Line 1313 something AND NOT ...".
|
The only metacharacters that are recognized in character classes are backslash, |
The only metacharacters that are recognized in character classes are backslash, |
hyphen (only where it can be interpreted as specifying a range), circumflex |
hyphen (only where it can be interpreted as specifying a range), circumflex |
(only at the start), opening square bracket (only when it can be interpreted as |
(only at the start), opening square bracket (only when it can be interpreted as |
introducing a POSIX class name - see the next section), and the terminating | introducing a POSIX class name, or for a special compatibility feature - see |
closing square bracket. However, escaping other non-alphanumeric characters | the next two sections), and the terminating closing square bracket. However, |
does no harm. | escaping other non-alphanumeric characters does no harm. |
. |
. |
. |
. |
.SH "POSIX CHARACTER CLASSES" |
.SH "POSIX CHARACTER CLASSES" |
Line 1290 are:
|
Line 1340 are:
|
lower lower case letters |
lower lower case letters |
print printing characters, including space |
print printing characters, including space |
punct printing characters, excluding letters and digits and space |
punct printing characters, excluding letters and digits and space |
space white space (not quite the same as \es) | space white space (the same as \es from PCRE 8.34) |
upper upper case letters |
upper upper case letters |
word "word" characters (same as \ew) |
word "word" characters (same as \ew) |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
.sp |
.sp |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and | The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
space (32). Notice that this list includes the VT character (code 11). This | and space (32). If locale-specific matching is taking place, the list of space |
makes "space" different to \es, which does not include VT (for Perl | characters may be different; there may be fewer or more of them. "Space" used |
compatibility). | to be different to \es, which did not include VT, for Perl compatibility. |
| However, Perl changed at release 5.18, and PCRE followed at release 8.34. |
| "Space" and \es now match the same set of characters. |
.P |
.P |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
5.8. Another Perl extension is negation, which is indicated by a ^ character |
5.8. Another Perl extension is negation, which is indicated by a ^ character |
Line 1310 matches "1", "2", or any non-digit. PCRE (and Perl) al
|
Line 1362 matches "1", "2", or any non-digit. PCRE (and Perl) al
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
.P |
.P |
By default, in UTF modes, characters with values greater than 128 do not match | By default, characters with values greater than 128 do not match any of the |
any of the POSIX character classes. However, if the PCRE_UCP option is passed | POSIX character classes. However, if the PCRE_UCP option is passed to |
to \fBpcre_compile()\fP, some of the classes are changed so that Unicode | \fBpcre_compile()\fP, some of the classes are changed so that Unicode character |
character properties are used. This is achieved by replacing the POSIX classes | properties are used. This is achieved by replacing certain POSIX classes by |
by other sequences, as follows: | other sequences, as follows: |
.sp |
.sp |
[:alnum:] becomes \ep{Xan} |
[:alnum:] becomes \ep{Xan} |
[:alpha:] becomes \ep{L} |
[:alpha:] becomes \ep{L} |
Line 1325 by other sequences, as follows:
|
Line 1377 by other sequences, as follows:
|
[:upper:] becomes \ep{Lu} |
[:upper:] becomes \ep{Lu} |
[:word:] becomes \ep{Xwd} |
[:word:] becomes \ep{Xwd} |
.sp |
.sp |
Negated versions, such as [:^alpha:] use \eP instead of \ep. The other POSIX | Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX |
classes are unchanged, and match only characters with code points less than | classes are handled specially in UCP mode: |
128. | .TP 10 |
| [:graph:] |
| This matches characters that have glyphs that mark the page when printed. In |
| Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf |
| properties, except for: |
| .sp |
| U+061C Arabic Letter Mark |
| U+180E Mongolian Vowel Separator |
| U+2066 - U+2069 Various "isolate"s |
| .sp |
| .TP 10 |
| [:print:] |
| This matches the same characters as [:graph:] plus space characters that are |
| not controls, that is, characters with the Zs property. |
| .TP 10 |
| [:punct:] |
| This matches all characters that have the Unicode P (punctuation) property, |
| plus those characters whose code points are less than 128 that have the S |
| (Symbol) property. |
| .P |
| The other POSIX classes are unchanged, and match only characters with code |
| points less than 128. |
. |
. |
. |
. |
|
.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES" |
|
.rs |
|
.sp |
|
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly |
|
syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of |
|
word". PCRE treats these items as follows: |
|
.sp |
|
[[:<:]] is converted to \eb(?=\ew) |
|
[[:>:]] is converted to \eb(?<=\ew) |
|
.sp |
|
Only these exact character sequences are recognized. A sequence such as |
|
[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is |
|
not compatible with Perl. It is provided to help migrations from other |
|
environments, and is best not used in any new patterns. Note that \eb matches |
|
at the start and the end of a word (see |
|
.\" HTML <a href="#smallassertions"> |
|
.\" </a> |
|
"Simple assertions" |
|
.\" |
|
above), and in a Perl-style pattern the preceding or following character |
|
normally shows which is wanted, without the need for the assertions that are |
|
used above in order to give exactly the POSIX behaviour. |
|
. |
|
. |
.SH "VERTICAL BAR" |
.SH "VERTICAL BAR" |
.rs |
.rs |
.sp |
.sp |
Line 1547 conditions,
|
Line 1644 conditions,
|
.\" |
.\" |
can be made by name as well as by number. |
can be made by name as well as by number. |
.P |
.P |
Names consist of up to 32 alphanumeric characters and underscores. Named | Names consist of up to 32 alphanumeric characters and underscores, but must |
capturing parentheses are still allocated numbers as well as names, exactly as | start with a non-digit. Named capturing parentheses are still allocated numbers |
if the names were not present. The PCRE API provides function calls for | as well as names, exactly as if the names were not present. The PCRE API |
extracting the name-to-number translation table from a compiled pattern. There | provides function calls for extracting the name-to-number translation table |
is also a convenience function for extracting a captured substring by name. | from a compiled pattern. There is also a convenience function for extracting a |
| captured substring by name. |
.P |
.P |
By default, a name must be unique within a pattern, but it is possible to relax |
By default, a name must be unique within a pattern, but it is possible to relax |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
Line 1577 for the first (and in this example, the only) subpatte
|
Line 1675 for the first (and in this example, the only) subpatte
|
matched. This saves searching to find which numbered subpattern it was. |
matched. This saves searching to find which numbered subpattern it was. |
.P |
.P |
If you make a back reference to a non-unique named subpattern from elsewhere in |
If you make a back reference to a non-unique named subpattern from elsewhere in |
the pattern, the one that corresponds to the first occurrence of the name is | the pattern, the subpatterns to which the name refers are checked in the order |
used. In the absence of duplicate numbers (see the previous section) this is | in which they appear in the overall pattern. The first one that is set is used |
the one with the lowest number. If you use a named reference in a condition | for the reference. For example, this pattern matches both "foofoo" and |
| "barbar" but not "foobar" or "barfoo": |
| .sp |
| (?:(?<n>foo)|(?<n>bar))\ek<n> |
| .sp |
| .P |
| If you make a subroutine call to a non-unique named subpattern, the one that |
| corresponds to the first occurrence of the name is used. In the absence of |
| duplicate numbers (see the previous section) this is the one with the lowest |
| number. |
| .P |
| If you use a named reference in a condition |
test (see the |
test (see the |
.\" |
.\" |
.\" HTML <a href="#conditions"> |
.\" HTML <a href="#conditions"> |
Line 1599 documentation.
|
Line 1708 documentation.
|
\fBWarning:\fP You cannot use different names to distinguish between two |
\fBWarning:\fP You cannot use different names to distinguish between two |
subpatterns with the same number because PCRE uses only the numbers when |
subpatterns with the same number because PCRE uses only the numbers when |
matching. For this reason, an error is given at compile time if different names |
matching. For this reason, an error is given at compile time if different names |
are given to subpatterns with the same number. However, you can give the same | are given to subpatterns with the same number. However, you can always give the |
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. | same name to subpatterns with the same number, even when PCRE_DUPNAMES is not |
| set. |
. |
. |
. |
. |
.SH REPETITION |
.SH REPETITION |
Line 2271 This makes the fragment independent of the parentheses
|
Line 2381 This makes the fragment independent of the parentheses
|
.sp |
.sp |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
this facility before Perl, the syntax (?(name)...) is also recognized. However, | this facility before Perl, the syntax (?(name)...) is also recognized. |
there is a possible ambiguity with this syntax, because subpattern names may | |
consist entirely of digits. PCRE looks first for a named subpattern; if it | |
cannot find one and the name consists entirely of digits, PCRE looks for a | |
subpattern of that number, which must be greater than zero. Using subpattern | |
names that consist entirely of digits is not recommended. | |
.P |
.P |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
.sp |
.sp |
Line 2698 During matching, when PCRE reaches a callout point, th
|
Line 2803 During matching, when PCRE reaches a callout point, th
|
called. It is provided with the number of the callout, the position in the |
called. It is provided with the number of the callout, the position in the |
pattern, and, optionally, one item of data originally supplied by the caller of |
pattern, and, optionally, one item of data originally supplied by the caller of |
the matching function. The callout function may cause matching to proceed, to |
the matching function. The callout function may cause matching to proceed, to |
backtrack, or to fail altogether. A complete description of the interface to | backtrack, or to fail altogether. |
the callout function is given in the | .P |
| By default, PCRE implements a number of optimizations at compile time and |
| matching time, and one side-effect is that sometimes callouts are skipped. If |
| you need all possible callouts to happen, you need to set options that disable |
| the relevant optimizations. More details, and a complete description of the |
| interface to the callout function, are given in the |
.\" HREF |
.\" HREF |
\fBpcrecallout\fP |
\fBpcrecallout\fP |
.\" |
.\" |
Line 3060 example:
|
Line 3170 example:
|
.sp |
.sp |
...(*COMMIT)(*PRUNE)... |
...(*COMMIT)(*PRUNE)... |
.sp |
.sp |
If there is a matching failure to the right, backtracking onto (*PRUNE) cases | If there is a matching failure to the right, backtracking onto (*PRUNE) causes |
it to be triggered, and its action is taken. There can never be a backtrack |
it to be triggered, and its action is taken. There can never be a backtrack |
onto (*COMMIT). |
onto (*COMMIT). |
. |
. |
Line 3145 Cambridge CB2 3QH, England.
|
Line 3255 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 26 April 2013 | Last updated: 03 December 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
.fi |
.fi |