version 1.1.1.4, 2013/07/22 08:25:56
|
version 1.1.1.5, 2014/06/15 19:46:04
|
Line 53 INTRODUCTION
|
Line 53 INTRODUCTION
|
5.12, including support for UTF-8/16/32 encoded strings and Unicode |
5.12, including support for UTF-8/16/32 encoded strings and Unicode |
general category properties. However, UTF-8/16/32 and Unicode support |
general category properties. However, UTF-8/16/32 and Unicode support |
has to be explicitly enabled; it is not the default. The Unicode tables |
has to be explicitly enabled; it is not the default. The Unicode tables |
correspond to Unicode release 6.2.0. | correspond to Unicode release 6.3.0. |
|
|
In addition to the Perl-compatible matching function, PCRE contains an |
In addition to the Perl-compatible matching function, PCRE contains an |
alternative function that matches the same compiled patterns in a dif- |
alternative function that matches the same compiled patterns in a dif- |
Line 532 PCRE 32-BIT API BASIC FUNCTIONS
|
Line 532 PCRE 32-BIT API BASIC FUNCTIONS
|
|
|
pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options, |
pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options, |
int *errorcodeptr, |
int *errorcodeptr, |
const char **errptr, int *erroffset, |
|
const unsigned char *tableptr); |
const unsigned char *tableptr); |
|
|
pcre32_extra *pcre32_study(const pcre32 *code, int options, |
pcre32_extra *pcre32_study(const pcre32 *code, int options, |
Line 1458 THE ALTERNATIVE MATCHING ALGORITHM
|
Line 1457 THE ALTERNATIVE MATCHING ALGORITHM
|
at the fifth character of the subject. The algorithm does not automati- |
at the fifth character of the subject. The algorithm does not automati- |
cally move on to find matches that start at later positions. |
cally move on to find matches that start at later positions. |
|
|
|
PCRE's "auto-possessification" optimization usually applies to charac- |
|
ter repeats at the end of a pattern (as well as internally). For exam- |
|
ple, the pattern "a\d+" is compiled as if it were "a\d++" because there |
|
is no point even considering the possibility of backtracking into the |
|
repeated digits. For DFA matching, this means that only one possible |
|
match is found. If you really do want multiple matches in such cases, |
|
either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS |
|
option when compiling. |
|
|
There are a number of features of PCRE regular expressions that are not |
There are a number of features of PCRE regular expressions that are not |
supported by the alternative matching algorithm. They are as follows: |
supported by the alternative matching algorithm. They are as follows: |
|
|
1. Because the algorithm finds all possible matches, the greedy or | 1. Because the algorithm finds all possible matches, the greedy or |
ungreedy nature of repetition quantifiers is not relevant. Greedy and | ungreedy nature of repetition quantifiers is not relevant. Greedy and |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
sessive quantifiers can make a difference when what follows could also | sessive quantifiers can make a difference when what follows could also |
match what is quantified, for example in a pattern like this: |
match what is quantified, for example in a pattern like this: |
|
|
^a++\w! |
^a++\w! |
|
|
This pattern matches "aaab!" but not "aaa!", which would be matched by | This pattern matches "aaab!" but not "aaa!", which would be matched by |
a non-possessive quantifier. Similarly, if an atomic group is present, | a non-possessive quantifier. Similarly, if an atomic group is present, |
it is matched as if it were a standalone pattern at the current point, | it is matched as if it were a standalone pattern at the current point, |
and the longest match is then "locked in" for the rest of the overall | and the longest match is then "locked in" for the rest of the overall |
pattern. |
pattern. |
|
|
2. When dealing with multiple paths through the tree simultaneously, it |
2. When dealing with multiple paths through the tree simultaneously, it |
is not straightforward to keep track of captured substrings for the | is not straightforward to keep track of captured substrings for the |
different matching possibilities, and PCRE's implementation of this | different matching possibilities, and PCRE's implementation of this |
algorithm does not attempt to do this. This means that no captured sub- |
algorithm does not attempt to do this. This means that no captured sub- |
strings are available. |
strings are available. |
|
|
3. Because no substrings are captured, back references within the pat- | 3. Because no substrings are captured, back references within the pat- |
tern are not supported, and cause errors if encountered. |
tern are not supported, and cause errors if encountered. |
|
|
4. For the same reason, conditional expressions that use a backrefer- | 4. For the same reason, conditional expressions that use a backrefer- |
ence as the condition or test for a specific group recursion are not | ence as the condition or test for a specific group recursion are not |
supported. |
supported. |
|
|
5. Because many paths through the tree may be active, the \K escape | 5. Because many paths through the tree may be active, the \K escape |
sequence, which resets the start of the match when encountered (but may |
sequence, which resets the start of the match when encountered (but may |
be on some paths and not on others), is not supported. It causes an | be on some paths and not on others), is not supported. It causes an |
error if encountered. |
error if encountered. |
|
|
6. Callouts are supported, but the value of the capture_top field is | 6. Callouts are supported, but the value of the capture_t 6. Callouts are supported, but the value of the capture_t |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
|
|
7. The \C escape sequence, which (in the standard algorithm) always | 7. The \C escape sequence, which (in the standard algorithm) always |
matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is | matches a single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is |
not supported in these modes, because the alternative algorithm moves | not supported in these modes, because the alternative algorithm moves |
through the subject string one character (not data unit) at a time, for |
through the subject string one character (not data unit) at a time, for |
all active paths through the tree. |
all active paths through the tree. |
|
|
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) | 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
are not supported. (*FAIL) is supported, and behaves like a failing | are not supported. (*FAIL) is supported, and behaves like a failing |
negative assertion. |
negative assertion. |
|
|
|
|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
|
|
Using the alternative matching algorithm provides the following advan- | Using the alternative matching algorithm provides the following advan- |
tages: |
tages: |
|
|
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
ically found, and in particular, the longest match is found. To find | ically found, and in particular, the longest match is found. To find |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
things with callouts. |
things with callouts. |
|
|
2. Because the alternative algorithm scans the subject string just | 2. Because the alternative algorithm scans the subject string just |
once, and never needs to backtrack (except for lookbehinds), it is pos- |
once, and never needs to backtrack (except for lookbehinds), it is pos- |
sible to pass very long subject strings to the matching function in | sible to pass very long subject strings to the matching function in |
several pieces, checking for partial matching each time. Although it is |
several pieces, checking for partial matching each time. Although it is |
possible to do multi-segment matching using the standard algorithm by | possible to do multi-segment matching using the standard algorithm by |
retaining partially matched substrings, it is more complicated. The | retaining partially matched substrings, it is more complicated. The |
pcrepartial documentation gives details of partial matching and dis- | pcrepartial documentation gives details of partial matching and dis- |
cusses multi-segment matching. |
cusses multi-segment matching. |
|
|
|
|
Line 1531 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
Line 1539 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
|
|
|
The alternative algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
|
|
1. It is substantially slower than the standard algorithm. This is | 1. It is substantially slower than the standard algorithm. This is |
partly because it has to search for all possible matches, but is also | partly because it has to search for all possible matches, but is also |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
|
|
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
Line 1550 AUTHOR
|
Line 1558 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 08 January 2012 | Last updated: 12 November 2013 |
Copyright (c) 1997-2012 University of Cambridge. |
Copyright (c) 1997-2012 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 1958 CHECKING BUILD-TIME OPTIONS
|
Line 1966 CHECKING BUILD-TIME OPTIONS
|
POSIX interface uses malloc() for output vectors. Further details are |
POSIX interface uses malloc() for output vectors. Further details are |
given in the pcreposix documentation. |
given in the pcreposix documentation. |
|
|
|
PCRE_CONFIG_PARENS_LIMIT |
|
|
|
The output is a long integer that gives the maximum depth of nesting of |
|
parentheses (of any kind) in a pattern. This limit is imposed to cap |
|
the amount of system stack used when a pattern is compiled. It is spec- |
|
ified when PCRE is built; the default is 250. |
|
|
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
|
|
The output is a long integer that gives the default limit for the num- | The output is a long integer that gives the default limit for the num- |
ber of internal matching function calls in a pcre_exec() execution. | ber of internal matching function calls in a pcre_exec() execution. |
Further details are given with pcre_exec() below. |
Further details are given with pcre_exec() below. |
|
|
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
|
|
The output is a long integer that gives the default limit for the depth |
The output is a long integer that gives the default limit for the depth |
of recursion when calling the internal matching function in a | of recursion when calling the internal matching function in a |
pcre_exec() execution. Further details are given with pcre_exec() | pcre_exec() execution. Further details are given with pcre_exec() |
below. |
below. |
|
|
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
|
|
The output is an integer that is set to one if internal recursion when | The output is an integer that is set to one if internal recursion when |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
the stack to remember their state. This is the usual way that PCRE is | the stack to remember their state. This is the usual way th the stack to remember their state. This is the usual way that PCRE is |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
on the heap instead of recursive function calls. In this case, | on the heap instead of recursive function calls. In this case, |
pcre_stack_malloc and pcre_stack_free are called to manage memory | pcre_stack_malloc and pcre_stack_free are called to manage memory |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
|
|
|
|
Line 1995 COMPILING A PATTERN
|
Line 2010 COMPILING A PATTERN
|
|
|
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
the two interfaces is that pcre_compile2() has an additional argument, | the two interfaces is that pcre_compile2() has an additional argument, |
errorcodeptr, via which a numerical error code can be returned. To | errorcodeptr, via which a numerical error code can be returned. To |
avoid too much repetition, we refer just to pcre_compile() below, but | avoid too much repetition, we refer just to pcre_compile() below, but |
the information applies equally to pcre_compile2(). |
the information applies equally to pcre_compile2(). |
|
|
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
the pattern argument. A pointer to a single block of memory that is | the pattern argument. A pointer to a single block of memory that is |
obtained via pcre_malloc is returned. This contains the compiled code | obtained via pcre_malloc is returned. This contains the compiled code |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
It is up to the caller to free the memory (via pcre_free) when it is no |
It is up to the caller to free the memory (via pcre_free) when it is no |
longer required. |
longer required. |
|
|
Although the compiled code of a PCRE regex is relocatable, that is, it | Although the compiled code of a PCRE regex is relocatable, that is, it |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
fully relocatable, because it may contain a copy of the tableptr argu- | fully relocatable, because it may contain a copy of the tableptr argu- |
ment, which is an address (see below). |
ment, which is an address (see below). |
|
|
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
pilation. It should be zero if no options are required. The available | pilation. It should be zero if no options are required. The available |
options are described below. Some of them (in particular, those that | options are described below. Some of them (in particular, those that |
are compatible with Perl, but some others as well) can also be set and | are compatible with Perl, but some others as well) can also be set and |
unset from within the pattern (see the detailed description in the | unset from within the pattern (see the detailed description in the |
pcrepattern documentation). For those options that can be different in | pcrepattern documentation). For those options that can be different in |
different parts of the pattern, the contents of the options argument | different parts of the pattern, the contents of the options argument |
specifies their settings at the start of compilation and execution. The |
specifies their settings at the start of compilation and execution. The |
PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and | PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and |
PCRE_NO_START_OPTIMIZE options can be set at the time of matching as | PCRE_NO_START_OPTIMIZE options can be set at the time of matching as |
well as at compile time. |
well as at compile time. |
|
|
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
if compilation of a pattern fails, pcre_compile() returns NULL, and | if compilation of a pattern fails, pcre_compile() returns NULL, and |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
try to free it. Normally, the offset from the start of the pattern to | try to free it. Normally, the offset from the start of the pattern to |
the data unit that was being processed when the error was discovered is |
the data unit that was being processed when the error was discovered is |
placed in the variable pointed to by erroffset, which must not be NULL | placed in the variable pointed to by erroffset, which must not be NULL |
(if it is, an immediate error is given). However, for an invalid UTF-8 | (if it is, an immediate error is given). However, for an invalid UTF-8 |
or UTF-16 string, the offset is that of the first data unit of the | or UTF-16 string, the offset is that of the first data unit of the |
failing character. |
failing character. |
|
|
Some errors are not detected until the whole pattern has been scanned; | Some errors are not detected until the whole pattern has been scanned; |
in these cases, the offset passed back is the length of the pattern. | in these cases, the offset passed back is the length of the pattern. |
Note that the offset is in data units, not characters, even in a UTF | Note that the offset is in data units, not characters, even in a UTF |
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- |
mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- |
acter. |
acter. |
|
|
If pcre_compile2() is used instead of pcre_compile(), and the error- | If pcre_compile2() is used instead of pcre_compile(), and the error- |
codeptr argument is not NULL, a non-zero error code number is returned | codeptr argument is not NULL, a non-zero error code number is returned |
via this argument in the event of an error. This is in addition to the | via this argument in the event of an error. This is in addition to the |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
|
|
If the final argument, tableptr, is NULL, PCRE uses a default set of | If the final argument, tableptr, is NULL, PCRE uses a default set of |
character tables that are built when PCRE is compiled, using the | character tables that are built when PCRE is compiled, using the |
default C locale. Otherwise, tableptr must be an address that is the | default C locale. Otherwise, tableptr must be an address that is the |
result of a call to pcre_maketables(). This value is stored with the | result of a call to pcre_maketables(). This value is stored with the |
compiled pattern, and used again by pcre_exec(), unless another table | compiled pattern, and used again by pcre_exec() and pcre_dfa_exec() |
pointer is passed to it. For more discussion, see the section on locale | when the pattern is matched. For more discussion, see the section on |
support below. | locale support below. |
|
|
This code fragment shows a typical straightforward call to pcre_com- | This code fragment shows a typical straightforward call to pcre_com- |
pile(): |
pile(): |
|
|
pcre *re; |
pcre *re; |
Line 2068 COMPILING A PATTERN
|
Line 2083 COMPILING A PATTERN
|
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
|
|
The following names for option bits are defined in the pcre.h header | The following names for option bits are defined in the pcre.h header |
file: |
file: |
|
|
PCRE_ANCHORED |
PCRE_ANCHORED |
|
|
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
is constrained to match only at the first matching point in the string | is constrained to match only at the first matching point in the string |
that is being searched (the "subject string"). This effect can also be | that is being searched (the "subject string"). This effect can also be |
achieved by appropriate constructs in the pattern itself, which is the | achieved by appropriate constructs in the pattern itself, which is the |
only way to do it in Perl. |
only way to do it in Perl. |
|
|
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
|
|
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
all with number 255, before each pattern item. For discussion of the | all with number 255, before each pattern item. For discussion of the |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
|
|
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
|
|
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
sequence matches. The choice is either to match only CR, LF, or CRLF, | sequence matches. The choice is either to match only CR, LF, or CRLF, |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
|
|
PCRE_CASELESS |
PCRE_CASELESS |
|
|
If this bit is set, letters in the pattern match both upper and lower | If this bit is set, letters in the pattern match both upper and lower |
case letters. It is equivalent to Perl's /i option, and it can be | case letters. It is equivalent to Perl's /i option, and it can be |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE | changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
always understands the concept of case for characters whose values are | always understands the concept of case for characters whose values are |
less than 128, so caseless matching is always possible. For characters | less than 128, so caseless matching is always possible. For characters |
with higher values, the concept of case is supported if PCRE is com- | with higher values, the concept of case is supported if PCRE is com- |
piled with Unicode property support, but not otherwise. If you want to | piled with Unicode property support, but not otherwise. If you want to |
use caseless matching for characters 128 and above, you must ensure | use caseless matching for characters 128 and above, you must ensure |
that PCRE is compiled with Unicode property support as well as with | that PCRE is compiled with Unicode property support as well as with |
UTF-8 support. |
UTF-8 support. |
|
|
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
|
|
If this bit is set, a dollar metacharacter in the pattern matches only | If this bit is set, a dollar metacharacter in the pattern matches only |
at the end of the subject string. Without this option, a dollar also | at the end of the subject string. Without this option, a dollar also |
matches immediately before a newline at the end of the string (but not | matches immediately before a newline at the end of the string (but not |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored | before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
if PCRE_MULTILINE is set. There is no equivalent to this option in | if PCRE_MULTILINE is set. There is no equivalent to this option in |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
|
|
PCRE_DOTALL |
PCRE_DOTALL |
|
|
If this bit is set, a dot metacharacter in the pattern matches a char- | If this bit is set, a dot metacharacter in the pattern matches a char- |
acter of any value, including one that indicates a newline. However, it |
acter of any value, including one that indicates a newline. However, it |
only ever matches one character, even if newlines are coded as CRLF. | only ever matches one character, even if newlines are coded as CRLF. |
Without this option, a dot does not match when the current position is | Without this option, a dot does not match when the current position is |
at a newline. This option is equivalent to Perl's /s option, and it can |
at a newline. This option is equivalent to Perl's /s option, and it can |
be changed within a pattern by a (?s) option setting. A negative class | be changed within a pattern by a (?s) option setting. A negative class |
such as [^a] always matches newline characters, independent of the set- |
such as [^a] always matches newline characters, independent of the set- |
ting of this option. |
ting of this option. |
|
|
PCRE_DUPNAMES |
PCRE_DUPNAMES |
|
|
If this bit is set, names used to identify capturing subpatterns need | If this bit is set, names used to identify capturing subpatterns need |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
is known that only one instance of the named subpattern can ever be | is known that only one instance of the named subpattern can ever be |
matched. There are more details of named subpatterns below; see also | matched. There are more details of named subpatterns below; see also |
the pcrepattern documentation. |
the pcrepattern documentation. |
|
|
PCRE_EXTENDED |
PCRE_EXTENDED |
|
|
If this bit is set, white space data characters in the pattern are | If this bit is set, most white space characters in the pattern are |
totally ignored except when escaped or inside a character class. White | totally ignored except when escaped or inside a character class. How- |
space does not include the VT character (code 11). In addition, charac- | ever, white space is not allowed within sequences such as (?> that |
ters between an unescaped # outside a character class and the next new- | introduce various parenthesized subpatterns, nor within a numerical |
line, inclusive, are also ignored. This is equivalent to Perl's /x | quantifier such as {1,3}. However, ignorable white space is permitted |
option, and it can be changed within a pattern by a (?x) option set- | between an item and a following quantifier and between a quantifier and |
ting. | a following + that indicates possessiveness. |
|
|
Which characters are interpreted as newlines is controlled by the | White space did not used to include the VT character (code 11), because |
options passed to pcre_compile() or by a special sequence at the start | Perl did not treat this character as white space. However, Perl changed |
of the pattern, as described in the section entitled "Newline conven- | at release 5.18, so PCRE followed at release 8.34, and VT is now |
| treated as white space. |
| |
| PCRE_EXTENDED also causes characters between an unescaped # outside a |
| character class and the next newline, inclusive, to be ignored. |
| PCRE_EXTENDED is equivalent to Perl's /x option, and it can be changed |
| within a pattern by a (?x) option setting. |
| |
| Which characters are interpreted as newlines is controlled by the |
| options passed to pcre_compile() or by a special sequence at the start |
| of the pattern, as described in the section entitled "Newline conven- |
tions" in the pcrepattern documentation. Note that the end of this type |
tions" in the pcrepattern documentation. Note that the end of this type |
of comment is a literal newline sequence in the pattern; escape | of comment is a literal newline sequence in the pattern; escape |
sequences that happen to represent a newline do not count. |
sequences that happen to represent a newline do not count. |
|
|
This option makes it possible to include comments inside complicated | This option makes it possible to include comments inside complicated |
patterns. Note, however, that this applies only to data characters. | patterns. Note, however, that this applies only to data characters. |
White space characters may never appear within special character | White space characters may never appear within special character |
sequences in a pattern, for example within the sequence (?( that intro- |
sequences in a pattern, for example within the sequence (?( that intro- |
duces a conditional subpattern. |
duces a conditional subpattern. |
|
|
PCRE_EXTRA |
PCRE_EXTRA |
|
|
This option was invented in order to turn on additional functionality | This option was invented in order to turn on additional functionality |
of PCRE that is incompatible with Perl, but it is currently of very | of PCRE that is incompatible with Perl, but it is currently of very |
little use. When set, any backslash in a pattern that is followed by a | little use. When set, any backslash in a pattern that is followed by a |
letter that has no special meaning causes an error, thus reserving | letter that has no special meaning causes an error, thus reserving |
these combinations for future expansion. By default, as in Perl, a | these combinations for future expansion. By default, as in Perl, a |
backslash followed by a letter with no special meaning is treated as a | backslash followed by a letter with no special meaning is treated as a |
literal. (Perl can, however, be persuaded to give an error for this, by |
literal. (Perl can, however, be persuaded to give an error for this, by |
running it with the -w option.) There are at present no other features | running it with the -w option.) There are at present no other features |
controlled by this option. It can also be set by a (?X) option setting | controlled by this option. It can also be set by a (?X) option setting |
within a pattern. |
within a pattern. |
|
|
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
|
|
If this option is set, an unanchored pattern is required to match | If this option is set, an unanchored pattern is required to match |
before or at the first newline in the subject string, though the | before or at the first newline in the subject string, though the |
matched text may continue over the newline. |
matched text may continue over the newline. |
|
|
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
|
|
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
it is compatible with JavaScript rather than Perl. The changes are as | it is compatible with JavaScript rather than Perl. The changes are as |
follows: |
follows: |
|
|
(1) A lone closing square bracket in a pattern causes a compile-time | (1) A lone closing square bracket in a pattern causes a compile-time |
error, because this is illegal in JavaScript (by default it is treated | error, because this is illegal in JavaScript (by default it is treated |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
option is set. |
option is set. |
|
|
(2) At run time, a back reference to an unset subpattern group matches | (2) At run time, a back reference to an unset subpattern group matches |
an empty string (by default this causes the current matching alterna- | an empty string (by default this causes the current matching alterna- |
tive to fail). A pattern such as (\1)(a) succeeds when this option is | tive to fail). A pattern such as (\1)(a) succeeds when this option is |
set (assuming it can find an "a" in the subject), whereas it fails by | set (assuming it can find an "a" in the subject), whereas it fails by |
default, for Perl compatibility. |
default, for Perl compatibility. |
|
|
(3) \U matches an upper case "U" character; by default \U causes a com- |
(3) \U matches an upper case "U" character; by default \U causes a com- |
pile time error (Perl uses \U to upper case subsequent characters). |
pile time error (Perl uses \U to upper case subsequent characters). |
|
|
(4) \u matches a lower case "u" character unless it is followed by four |
(4) \u matches a lower case "u" character unless it is followed by four |
hexadecimal digits, in which case the hexadecimal number defines the | hexadecimal digits, in which case the hexadecimal number defines the |
code point to match. By default, \u causes a compile time error (Perl | code point to match. By default, \u causes a compile time error (Perl |
uses it to upper case the following character). |
uses it to upper case the following character). |
|
|
(5) \x matches a lower case "x" character unless it is followed by two | (5) \x matches a lower case "x" character unless it is followed by two |
hexadecimal digits, in which case the hexadecimal number defines the | hexadecimal digits, in which case the hexadecimal number defines the |
code point to match. By default, as in Perl, a hexadecimal number is | code point to match. By default, as in Perl, a hexadecimal number is |
always expected after \x, but it may have zero, one, or two digits (so, |
always expected after \x, but it may have zero, one, or two digits (so, |
for example, \xz matches a binary zero character followed by z). |
for example, \xz matches a binary zero character followed by z). |
|
|
PCRE_MULTILINE |
PCRE_MULTILINE |
|
|
By default, for the purposes of matching "start of line" and "end of | By default, f By default, for the purposes of matching "start of line" and "end of |
line", PCRE treats the subject string as consisting of a single line of |
line", PCRE treats the subject string as consisting of a single line of |
characters, even if it actually contains newlines. The "start of line" | characters, even if it actually contains newlines. The "start of line" |
metacharacter (^) matches only at the start of the string, and the "end |
metacharacter (^) matches only at the start of the string, and the "end |
of line" metacharacter ($) matches only at the end of the string, or | of line" metacharacter ($) matches only at the end of the string, or |
before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set). | before a terminating newline (except when PCRE_DOLLAR_ENDONLY is set). |
Note, however, that unless PCRE_DOTALL is set, the "any character" | Note, however, that unless PCRE_DOTALL is set, the "any character" |
metacharacter (.) does not match at a newline. This behaviour (for ^, | metacharacter (.) does not match at a newline. This behaviour (for ^, |
$, and dot) is the same as Perl. |
$, and dot) is the same as Perl. |
|
|
When PCRE_MULTILINE it is set, the "start of line" and "end of line" | When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
constructs match immediately following or immediately before internal | constructs match immediately following or immediately before internal |
newlines in the subject string, respectively, as well as at the very | newlines in the subject string, respectively, as well as at the very |
start and end. This is equivalent to Perl's /m option, and it can be | start and end. This is equivalent to Perl's /m option, and it can be |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
lines in a subject string, or no occurrences of ^ or $ in a pattern, | lines in a subject string, or no occurrences of ^ or $ in a pattern, |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
|
|
PCRE_NEVER_UTF |
PCRE_NEVER_UTF |
|
|
This option locks out interpretation of the pattern as UTF-8 (or UTF-16 |
This option locks out interpretation of the pattern as UTF-8 (or UTF-16 |
or UTF-32 in the 16-bit and 32-bit libraries). In particular, it pre- | or UTF-32 in the 16-bit and 32-bit libraries). In particular, it pre- |
vents the creator of the pattern from switching to UTF interpretation | vents the creator of the pattern from switching to UTF interpretation |
by starting the pattern with (*UTF). This may be useful in applications |
by starting the pattern with (*UTF). This may be useful in applications |
that process patterns from external sources. The combination of |
that process patterns from external sources. The combination of |
PCRE_UTF8 and PCRE_NEVER_UTF also causes an error. |
PCRE_UTF8 and PCRE_NEVER_UTF also causes an error. |
Line 2243 COMPILING A PATTERN
|
Line 2268 COMPILING A PATTERN
|
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
|
|
These options override the default newline definition that was chosen | These options override the default newline definition that was chosen |
when PCRE was built. Setting the first or the second specifies that a | when PCRE was built. Setting the first or the second specifies that a |
newline is indicated by a single character (CR or LF, respectively). | newline is indicated by a single character (CR or LF, respectively). |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the | Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies | two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be | PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
recognized. |
recognized. |
|
|
In an ASCII/Unicode environment, the Unicode newline sequences are the | In an ASCII/Unicode environment, the Unicode newline sequences are the |
three just mentioned, plus the single characters VT (vertical tab, | three just mentioned, plus the single characters VT (vertical tab, |
U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep- |
U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep- |
arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit | arator, U+2028), and PS (paragraph separator, U+2029). For the 8-bit |
library, the last two are recognized only in UTF-8 mode. |
library, the last two are recognized only in UTF-8 mode. |
|
|
When PCRE is compiled to run in an EBCDIC (mainframe) environment, the | When PCRE is compiled to run in an EBCDIC (mainframe) environment, the |
code for CR is 0x0d, the same as ASCII. However, the character code for |
code for CR is 0x0d, the same as ASCII. However, the character code for |
LF is normally 0x15, though in some EBCDIC environments 0x25 is used. | LF is normally 0x15, though in some EBCDIC environments 0x25 is used. |
Whichever of these is not LF is made to correspond to Unicode's NEL | Whichever of these is not LF is made to correspond to Unicode's NEL |
character. EBCDIC codes are all less than 256. For more details, see | character. EBCDIC codes are all less than 256. For more details, see |
the pcrebuild documentation. |
the pcrebuild documentation. |
|
|
The newline setting in the options word uses three bits that are | The newline setting in the options word uses three bits that are |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
used (default plus the five values above). This means that if you set | used (default plus the five values above). This means that if you set |
more than one newline option, the combination may or may not be sensi- | more than one newline option, the combination may or may not be sensi- |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and | PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
cause an error. |
cause an error. |
|
|
The only time that a line break in a pattern is specially recognized | The only time that a line break in a pattern is specially recognized |
when compiling is when PCRE_EXTENDED is set. CR and LF are white space | when compiling is when PCRE_EXTENDED is set. CR and LF are white space |
characters, and so are ignored in this mode. Also, an unescaped # out- | characters, and so are ignored in this mode. Also, an unescaped # out- |
side a character class indicates a comment that lasts until after the | side a character class indicates a comment that lasts until after the |
next line break sequence. In other circumstances, line break sequences | next line break sequence. In other circumstances, line break sequences |
in patterns are treated as literal data. |
in patterns are treated as literal data. |
|
|
The newline option that is set at compile time becomes the default that |
The newline option that is set at compile time becomes the default that |
Line 2286 COMPILING A PATTERN
|
Line 2311 COMPILING A PATTERN
|
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
|
|
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
theses in the pattern. Any opening parenthesis that is not followed by | theses in the pattern. Any opening parenthesis that is not followed by |
? behaves as if it were followed by ?: but named parentheses can still | ? behaves as if it were followed by ?: but named parentheses can still |
be used for capturing (and they acquire numbers in the usual way). | be used for capturing (and they acquire numbers in the usual way). |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
|
|
|
PCRE_NO_AUTO_POSSESS |
|
|
|
If this option is set, it disables "auto-possessification". This is an |
|
optimization that, for example, turns a+b into a++b in order to avoid |
|
backtracks into a+ that can never be successful. However, if callouts |
|
are in use, auto-possessification means that some of them are never |
|
taken. You can set this option if you want the matching functions to do |
|
a full unoptimized search and run all the callouts, but it is mainly |
|
provided for testing purposes. |
|
|
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
|
|
This is an option that acts at matching time; that is, it is really an | This is an option that acts at matching time; that is, it is really an |
option for pcre_exec() or pcre_dfa_exec(). If it is set at compile | option for pcre_exec() or pcre_dfa_exec(). If it is set at compile |
time, it is remembered with the compiled pattern and assumed at match- | time, it is remembered with the compiled pattern and assumed at match- |
ing time. This is necessary if you want to use JIT execution, because | ing time. This is necessary if you want to use JIT execution, because |
the JIT compiler needs to know whether or not this option is set. For | the JIT compiler needs to know whether or not this option is set. For |
details see the discussion of PCRE_NO_START_OPTIMIZE below. |
details see the discussion of PCRE_NO_START_OPTIMIZE below. |
|
|
PCRE_UCP |
PCRE_UCP |
|
|
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, | This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, |
\w, and some of the POSIX character classes. By default, only ASCII | \w, and some of the POSIX character classes. By default, only ASCII |
characters are recognized, but if PCRE_UCP is set, Unicode properties | characters are recognized, but if PCRE_UCP is set, Unicode properties |
are used instead to classify characters. More details are given in the | are used instead to classify characters. More details are given in the |
section on generic character types in the pcrepattern page. If you set | section on generic character types in the pcrepattern page. If you set |
PCRE_UCP, matching one of the items it affects takes much longer. The | PCRE_UCP, matching one of the items it affects takes much longer. The |
option is available only if PCRE has been compiled with Unicode prop- | option is available only if PCRE has been compiled with Unicode prop- |
erty support. |
erty support. |
|
|
PCRE_UNGREEDY |
PCRE_UNGREEDY |
|
|
This option inverts the "greediness" of the quantifiers so that they | This option inverts the "greediness" of the quantifiers so that they |
are not greedy by default, but become greedy if followed by "?". It is | are not greedy by default, but become greedy if followed by "?". It is |
not compatible with Perl. It can also be set by a (?U) option setting | not compatible with Perl. It can also be set by a (?U) option setting |
within the pattern. |
within the pattern. |
|
|
PCRE_UTF8 |
PCRE_UTF8 |
|
|
This option causes PCRE to regard both the pattern and the subject as | This option causes PCRE to regard both the pattern and the subject as |
strings of UTF-8 characters instead of single-byte strings. However, it |
strings of UTF-8 characters instead of single-byte strings. However, it |
is available only when PCRE is built to include UTF support. If not, | is available only when PCRE is built to include UTF support. If not, |
the use of this option provokes an error. Details of how this option | the use of this option provokes an error. Details of how this option |
changes the behaviour of PCRE are given in the pcreunicode page. |
changes the behaviour of PCRE are given in the pcreunicode page. |
|
|
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
|
|
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
automatically checked. There is a discussion about the validity of | automatically checked. There is a discussion about the validity of |
UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is | UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence is |
found, pcre_compile() returns an error. If you already know that your | found, pcre_compile() returns an error. If you already know that your |
pattern is valid, and you want to skip this check for performance rea- | pattern is valid, and you want to skip this check for performance rea- |
sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the | sons, you can set the PCRE_NO_UTF8_CHECK option. When it is set, the |
effect of passing an invalid UTF-8 string as a pattern is undefined. It |
effect of passing an invalid UTF-8 string as a pattern is undefined. It |
may cause your program to crash. Note that this option can also be | may cause your program to crash or loop. Note that this option can also |
passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity | be passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity |
checking of subject strings only. If the same string is being matched | checking of subject strings only. If the same string is being matched |
many times, the option can be safely set for the second and subsequent | many times, the option can be safely set for the second and subsequent |
matchings to improve performance. |
matchings to improve performance. |
|
|
|
|
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
|
|
The following table lists the error codes than may be returned by | The following table lists the error codes than may be returned by |
pcre_compile2(), along with the error messages that may be returned by | pcre_compile2(), along with the error messages that may be returned by |
both compiling functions. Note that error messages are always 8-bit | both compiling functions. Note that error messages are always 8-bit |
ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed, | ASCII strings, even in 16-bit or 32-bit mode. As PCRE has developed, |
some error codes have fallen out of use. To avoid confusion, they have | some error codes have fallen out of use. To avoid confusion, they have |
not been re-used. |
not been re-used. |
|
|
0 no error |
0 no error |
Line 2385 COMPILATION ERROR CODES
|
Line 2420 COMPILATION ERROR CODES
|
31 POSIX collating elements are not supported |
31 POSIX collating elements are not supported |
32 this version of PCRE is compiled without UTF support |
32 this version of PCRE is compiled without UTF support |
33 [this code is not in use] |
33 [this code is not in use] |
34 character value in \x{...} sequence is too large | 34 character value in \x{} or \o{} is too large |
35 invalid condition (?(0) |
35 invalid condition (?(0) |
36 \C not allowed in lookbehind assertion |
36 \C not allowed in lookbehind assertion |
37 PCRE does not support \L, \l, \N{name}, \U, or \u |
37 PCRE does not support \L, \l, \N{name}, \U, or \u |
Line 2433 COMPILATION ERROR CODES
|
Line 2468 COMPILATION ERROR CODES
|
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN) |
75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN) |
76 character value in \u.... sequence is too large |
76 character value in \u.... sequence is too large |
77 invalid UTF-32 string (specifically UTF-32) |
77 invalid UTF-32 string (specifically UTF-32) |
|
78 setting UTF is disabled by the application |
|
79 non-hex character in \x{} (closing brace missing?) |
|
80 non-octal character in \o{} (closing brace missing?) |
|
81 missing opening brace after \o |
|
82 parentheses are too deeply nested |
|
83 invalid range in character class |
|
|
The numbers 32 and 10000 in errors 48 and 49 are defaults; different | The numbers 32 and 10000 in errors 48 and 49 are defaults The numbers 32 and 10000 in errors 48 and 49 are defaults |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
|
|
|
|
STUDYING A PATTERN |
STUDYING A PATTERN |
|
|
pcre_extra *pcre_study(const pcre *code, int options | pcre_extra *pcre_study(const pcre *code, int options, |
const char **errptr); |
const char **errptr); |
|
|
If a compiled pattern is going to be used several times, it is worth | If a compiled pattern is going to be used several times, it is worth |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
matching. The function pcre_study() takes a pointer to a compiled pat- | matching. The function pcre_study() takes a pointer to a compiled pat- |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
information that will help speed up matching, pcre_study() returns a | information that will help speed up matching, pcre_study() returns a |
pointer to a pcre_extra block, in which the study_data field points to | pointer to a pcre_extra block, in which the study_data field points to |
the results of the study. |
the results of the study. |
|
|
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- | pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- |
tains other fields that can be set by the caller before the block is | tains other fields that can be set by the caller before the block is |
passed; these are described below in the section on matching a pattern. |
passed; these are described below in the section on matching a pattern. |
|
|
If studying the pattern does not produce any useful information, | If studying the pattern does not produce any useful information, |
pcre_study() returns NULL by default. In that circumstance, if the | pcre_study() returns NULL by default. In that circumstance, if the |
calling program wants to pass any of the other fields to pcre_exec() or |
calling program wants to pass any of the other fields to pcre_exec() or |
pcre_dfa_exec(), it must set up its own pcre_extra block. However, if | pcre_dfa_exec(), it must set up its own pcre_extra block. However, if |
pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it | pcre_study() is called with the PCRE_STUDY_EXTRA_NEEDED option, it |
returns a pcre_extra block even if studying did not find any additional |
returns a pcre_extra block even if studying did not find any additional |
information. It may still return NULL, however, if an error occurs in | information. It may still return NULL, however, if an error occurs in |
pcre_study(). |
pcre_study(). |
|
|
The second argument of pcre_study() contains option bits. There are | The second argument of pcre_study() contains option bits. There are |
three further options in addition to PCRE_STUDY_EXTRA_NEEDED: |
three further options in addition to PCRE_STUDY_EXTRA_NEEDED: |
|
|
PCRE_STUDY_JIT_COMPILE |
PCRE_STUDY_JIT_COMPILE |
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE |
PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE |
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE |
PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE |
|
|
If any of these are set, and the just-in-time compiler is available, | If any of these are set, and the just-in-time compiler is available, |
the pattern is further compiled into machine code that executes much | the pattern is further compiled into machine code that executes much |
faster than the pcre_exec() interpretive matching function. If the | faster than the pcre_exec() interpretive matching function. If the |
just-in-time compiler is not available, these options are ignored. All | just-in-time compiler is not available, these options are ignored. All |
undefined bits in the options argument must be zero. |
undefined bits in the options argument must be zero. |
|
|
JIT compilation is a heavyweight optimization. It can take some time | JIT compilation is a heavyweight optimization. It can take some time |
for patterns to be analyzed, and for one-off matches and simple pat- | for patterns to be analyzed, and for one-off matches and simple pat- |
terns the benefit of faster execution might be offset by a much slower | terns the benefit of faster execution might be offset by a much slower |
study time. Not all patterns can be optimized by the JIT compiler. For |
study time. Not all patterns can be optimized by the JIT compiler. For |
those that cannot be handled, matching automatically falls back to the | those that cannot be handled, matching automatically falls back to the |
pcre_exec() interpreter. For more details, see the pcrejit documenta- | pcre_exec() interpreter. For more details, see the pcrejit documenta- |
tion. |
tion. |
|
|
The third argument for pcre_study() is a pointer for an error message. | The third argument for pcre_study() is a pointer for an error message. |
If studying succeeds (even if no data is returned), the variable it | If studying succeeds (even if no data is returned), the variable it |
points to is set to NULL. Otherwise it is set to point to a textual | points to is set to NULL. Otherwise it is set to point to a textual |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
must not try to free it. You should test the error pointer for NULL | must not try to free it. You should test the error pointer for NULL |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
|
|
When you are finished with a pattern, you can free the memory used for | When you are finished with a pattern, you can free the memory used for |
the study data by calling pcre_free_study(). This function was added to |
the study data by calling pcre_free_study(). This function was added to |
the API for release 8.20. For earlier versions, the memory could be | the API for release 8.20. For earlier versions, the memory could be |
freed with pcre_free(), just like the pattern itself. This will still | freed with pcre_free(), just like the pattern itself. This will still |
work in cases where JIT optimization is not used, but it is advisable | work in cases where JIT optimization is not used, but it is advisable |
to change to the new function when convenient. |
to change to the new function when convenient. |
|
|
This is a typical way in which pcre_study() is used (except that in a | This is a typical way in which pcre_study() is used (except that in a |
real application there should be tests for errors): |
real application there should be tests for errors): |
|
|
int rc; |
int rc; |
Line 2520 STUDYING A PATTERN
|
Line 2561 STUDYING A PATTERN
|
Studying a pattern does two things: first, a lower bound for the length |
Studying a pattern does two things: first, a lower bound for the length |
of subject string that is needed to match the pattern is computed. This |
of subject string that is needed to match the pattern is computed. This |
does not mean that there are any strings of that length that match, but |
does not mean that there are any strings of that length that match, but |
it does guarantee that no shorter strings match. The value is used to | it does guarantee that no shorter strings match. The value it does guarantee that no shorter strings match. The value is used to |
avoid wasting time by trying to match strings that are shorter than the |
avoid wasting time by trying to match strings that are shorter than the |
lower bound. You can find out the value in a calling program via the | lower bound. You can find out the value in a calling program via the |
pcre_fullinfo() function. |
pcre_fullinfo() function. |
|
|
Studying a pattern is also useful for non-anchored patterns that do not |
Studying a pattern is also useful for non-anchored patterns that do not |
have a single fixed starting character. A bitmap of possible starting | have a single fixed starting character. A bitmap of possible starting |
bytes is created. This speeds up finding a position in the subject at | bytes is created. This speeds up finding a position in the subject at |
which to start matching. (In 16-bit mode, the bitmap is used for 16-bit |
which to start matching. (In 16-bit mode, the bitmap is used for 16-bit |
values less than 256. In 32-bit mode, the bitmap is used for 32-bit | values less than 256. In 32-bit mode, the bitmap is used for 32-bit |
values less than 256.) |
values less than 256.) |
|
|
These two optimizations apply to both pcre_exec() and pcre_dfa_exec(), | These two optimizations apply to both pcre_exec() and pcre_dfa_exec(), |
and the information is also used by the JIT compiler. The optimiza- | and the information is also used by the JIT compiler. The optimiza- |
tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option. | tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option. |
You might want to do this if your pattern contains callouts or (*MARK) | You might want to do this if your pattern contains callouts or (*MARK) |
and you want to make use of these facilities in cases where matching | and you want to make use of these facilities in cases where matching |
fails. |
fails. |
|
|
PCRE_NO_START_OPTIMIZE can be specified at either compile time or exe- | PCRE_NO_START_OPTIMIZE can be specified at either compile time or exe- |
cution time. However, if PCRE_NO_START_OPTIMIZE is passed to | cution time. However, if PCRE_NO_START_OPTIMIZE is passed to |
pcre_exec(), (that is, after any JIT compilation has happened) JIT exe- |
pcre_exec(), (that is, after any JIT compilation has happened) JIT exe- |
cution is disabled. For JIT execution to work with PCRE_NO_START_OPTI- | cution is disabled. For JIT execution to work with PCRE_NO_START_OPTI- |
MIZE, the option must be set at compile time. |
MIZE, the option must be set at compile time. |
|
|
There is a longer discussion of PCRE_NO_START_OPTIMIZE below. |
There is a longer discussion of PCRE_NO_START_OPTIMIZE below. |
Line 2550 STUDYING A PATTERN
|
Line 2591 STUDYING A PATTERN
|
|
|
LOCALE SUPPORT |
LOCALE SUPPORT |
|
|
PCRE handles caseless matching, and determines whether characters are | PCRE handles caseless matching, and determines whether characters are |
letters, digits, or whatever, by reference to a set of tables, indexed | letters, digits, or whatever, by reference to a set of tables, indexed |
by character value. When running in UTF-8 mode, this applies only to | by character code point. When running in UTF-8 mode, or in the 16- or |
characters with codes less than 128. By default, higher-valued codes | 32-bit libraries, this applies only to characters with code points less |
never match escapes such as \w or \d, but they can be tested with \p if | than 256. By default, higher-valued code points never match escapes |
PCRE is built with Unicode character property support. Alternatively, | such as \w or \d. However, if PCRE is built with Unicode property sup- |
the PCRE_UCP option can be set at compile time; this causes \w and | port, all characters can be tested with \p and \P, or, alternatively, |
friends to use Unicode property support instead of built-in tables. The | the PCRE_UCP option can be set when a pattern is compiled; this causes |
use of locales with Unicode is discouraged. If you are handling charac- | \w and friends to use Unicode property support instead of the built-in |
ters with codes greater than 128, you should either use UTF-8 and Uni- | tables. |
code, or use locales, but not try to mix the two. | |
|
|
|
The use of locales with Unicode is discouraged. If you are handling |
|
characters with code points greater than 128, you should either use |
|
Unicode support, or use locales, but not try to mix the two. |
|
|
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
Line 2576 LOCALE SUPPORT
|
Line 2620 LOCALE SUPPORT
|
|
|
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
passed to pcre_compile() or pcre_exec() as often as necessary. For | passed to pcre_compile() as often as necessary. For example, to build |
example, to build and use tables that are appropriate for the French | and use tables that are appropriate for the French locale (where |
locale (where accented characters with values greater than 128 are | accented characters with values greater than 128 are treated as let- |
treated as letters), the following code could be used: | ters), the following code could be used: |
|
|
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
Line 2595 LOCALE SUPPORT
|
Line 2639 LOCALE SUPPORT
|
|
|
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
and normally also by pcre_exec(). Thus, by default, for any single pat- | and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single pat- |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
but different patterns can be compiled in different locales. | but different patterns can be processed in different locales. |
|
|
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
the internal tables) to pcre_exec(). Although not intended for this | the internal tables) to pcre_exec() or pcre_dfa_exec() (see the discus- |
purpose, this facility could be used to match a pattern in a different | sion below in the section on matching a pattern). This facility is pro- |
locale from the one in which it was compiled. Passing table pointers at | vided for use with pre-compiled patterns that have been saved and |
run time is discussed below in the section on matching a pattern. | reloaded. Character tables are not saved with patterns, so if a non- |
| standard table was used at compile time, it must be provided again when |
| the reloaded pattern is matched. Attempting to use this facility to |
| match a pattern in a different locale from the one in which it was com- |
| piled is likely to lead to anomalous (usually incorrect) results. |
|
|
|
|
INFORMATION ABOUT A PATTERN |
INFORMATION ABOUT A PATTERN |
Line 2744 INFORMATION ABOUT A PATTERN
|
Line 2792 INFORMATION ABOUT A PATTERN
|
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
|
|
Since for the 32-bit library using the non-UTF-32 mode, this function |
Since for the 32-bit library using the non-UTF-32 mode, this function |
is unable to return the full 32-bit range of the character, this value | is unable to return the full 32-bit range of characters, this value is |
is deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and | deprecated; instead the PCRE_INFO_REQUIREDCHARFLAGS and |
PCRE_INFO_REQUIREDCHAR values should be used. |
PCRE_INFO_REQUIREDCHAR values should be used. |
|
|
|
PCRE_INFO_MATCH_EMPTY |
|
|
|
Return 1 if the pattern can match an empty string, otherwise 0. The |
|
fourth argument should point to an int variable. |
|
|
PCRE_INFO_MATCHLIMIT |
PCRE_INFO_MATCHLIMIT |
|
|
If the pattern set a match limit by including an item of the form | If the pattern set a match limit by including an item of the form |
(*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth | (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth |
argument should point to an unsigned 32-bit integer. If no such value | argument should point to an unsigned 32-bit integer. If no such value |
has been set, the call to pcre_fullinfo() returns the error | has been set, the call to pcre_fullinfo() returns the error |
PCRE_ERROR_UNSET. |
PCRE_ERROR_UNSET. |
|
|
PCRE_INFO_MAXLOOKBEHIND |
PCRE_INFO_MAXLOOKBEHIND |
|
|
Return the number of characters (NB not data units) in the longest | Return the number of characters (NB not data units) in the longest |
lookbehind assertion in the pattern. This information is useful when | lookbehind assertion in the pattern. This information is useful when |
doing multi-segment matching using the partial matching facilities. | doing multi-segment matching using the partial matching facilities. |
Note that the simple assertions \b and \B require a one-character look- |
Note that the simple assertions \b and \B require a one-character look- |
behind. \A also registers a one-character lookbehind, though it does | behind. \A also registers a one-character lookbehind, though it does |
not actually inspect the previous character. This is to ensure that at | not actually inspect the previous character. This is to ensure that at |
least one character from the old segment is retained when a new segment |
least one character from the old segment is retained when a new segment |
is processed. Otherwise, if there are no lookbehinds in the pattern, \A |
is processed. Otherwise, if there are no lookbehinds in the pattern, \A |
might match incorrectly at the start of a new segment. |
might match incorrectly at the start of a new segment. |
|
|
PCRE_INFO_MINLENGTH |
PCRE_INFO_MINLENGTH |
|
|
If the pattern was studied and a minimum length for matching subject | If the pattern was studied and a minimum length for matching subject |
strings was computed, its value is returned. Otherwise the returned | strings was computed, its value is returned. Otherwise the returned |
value is -1. The value is a number of characters, which in UTF mode may |
value is -1. The value is a number of characters, which in UTF mode may |
be different from the number of data units. The fourth argument should | be different from the number of data units. The fourth argument should |
point to an int variable. A non-negative value is a lower bound to the | point to an int variable. A non-negative value is a lower bound to the |
length of any matching string. There may not be any strings of that | length of any matching string. There may not be any strings of that |
length that do actually match, but every string that does match is at | length that do actually match, but every string that does match is at |
least that long. |
least that long. |
|
|
PCRE_INFO_NAMECOUNT |
PCRE_INFO_NAMECOUNT |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
|
|
PCRE supports the use of named as well as numbered capturing parenthe- | PCRE supports the use of named as well as numbered capturing parenthe- |
ses. The names are just an additional way of identifying the parenthe- | ses. The names are just an additional way of identifying the parenthe- |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
pcre_get_named_substring() are provided for extracting captured sub- | pcre_get_named_substring() are provided for extracting captured sub- |
strings by name. It is also possible to extract the data directly, by | strings by name. It is also possible to extract the data directly, by |
first converting the name to a number in order to access the correct | first converting the name to a number in order to access the correct |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
the conversion, you need to use the name-to-number map, which is | the conversion, you need to use the name-to-number map, which is |
described by these three values. |
described by these three values. |
|
|
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
of each entry; both of these return an int value. The entry size | of each entry; both of these return an int value. The entry size |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns | depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
a pointer to the first entry of the table. This is a pointer to char in |
a pointer to the first entry of the table. This is a pointer to char in |
the 8-bit library, where the first two bytes of each entry are the num- |
the 8-bit library, where the first two bytes of each entry are the num- |
ber of the capturing parenthesis, most significant byte first. In the | ber of the capturing parenthesis, most significant byte first. In the |
16-bit library, the pointer points to 16-bit data units, the first of | 16-bit library, the pointer points to 16-bit data units, the first of |
which contains the parenthesis number. In the 32-bit library, the | which contains the parenthesis number. In the 32-bit library, the |
pointer points to 32-bit data units, the first of which contains the | pointer points to 32-bit data units, the first of which contains the |
parenthesis number. The rest of the entry is the corresponding name, | parenthesis number. The rest of the entry is the corresponding name, |
zero terminated. |
zero terminated. |
|
|
The names are in alphabetical order. Duplicate names may appear if (?| | The names are in alphabetical order. If (?| is used to create multiple |
is used to create multiple groups with the same number, as described in | groups with the same number, as described in the section on duplicate |
the section on duplicate subpattern numbers in the pcrepattern page. | subpattern numbers in the pcrepattern page, the groups may be given the |
Duplicate names for subpatterns with different numbers are permitted | same name, but there is only one entry in the table. Different names |
only if PCRE_DUPNAMES is set. In all cases of duplicate names, they | for groups of the same number are not permitted. Duplicate names for |
appear in the table in the order in which they were found in the pat- | subpatterns with different numbers are permitted, but only if PCRE_DUP- |
tern. In the absence of (?| this is the order of increasing number; | NAMES is set. They appear in the table in the order in which they were |
when (?| is used this is not necessarily the case because later subpat- | found in the pattern. In the absence of (?| this is the order of |
terns may have lower numbers. | increasing number; when (?| is used this is not necessarily the case |
| because later subpatterns may have lower numbers. |
|
|
As a simple example of the name/number table, consider the following |
As a simple example of the name/number table, consider the following |
pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is |
pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is |
Line 2924 INFORMATION ABOUT A PATTERN
|
Line 2978 INFORMATION ABOUT A PATTERN
|
|
|
PCRE_INFO_FIRSTCHARACTER |
PCRE_INFO_FIRSTCHARACTER |
|
|
Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTER- | Return the fixed first character value in the situation where |
FLAGS returned 1; otherwise returns 0. The fourth argument should point | PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth |
to an uint_t variable. | argument should point to an uint_t variable. |
|
|
In the 8-bit library, the value is always less than 256. In the 16-bit |
In the 8-bit library, the value is always less than 256. In the 16-bit |
library the value can be up to 0xffff. In the 32-bit library in UTF-32 |
library the value can be up to 0xffff. In the 32-bit library in UTF-32 |
mode the value can be up to 0x10ffff, and up to 0xffffffff when not |
mode the value can be up to 0x10ffff, and up to 0xffffffff when not |
using UTF-32 mode. |
using UTF-32 mode. |
|
|
If there is no fixed first value, and if either |
|
|
|
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
|
branch starts with "^", or |
|
|
|
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
|
set (if it were set, the pattern would be anchored), |
|
|
|
-1 is returned, indicating that the pattern matches only at the start |
|
of a subject string or after any newline within the string. Otherwise |
|
-2 is returned. For anchored patterns, -2 is returned. |
|
|
|
PCRE_INFO_REQUIREDCHARFLAGS |
PCRE_INFO_REQUIREDCHARFLAGS |
|
|
Returns 1 if there is a rightmost literal data unit that must exist in |
Returns 1 if there is a rightmost literal data unit that must exist in |
Line 3133 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
Line 3175 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
The callout_data field is used in conjunction with the "callout" fea- |
The callout_data field is used in conjunction with the "callout" fea- |
ture, and is described in the pcrecallout documentation. |
ture, and is described in the pcrecallout documentation. |
|
|
The tables field is used to pass a character tables pointer to | The tables field is provided for use with patterns that have been pre- |
pcre_exec(); this overrides the value that is stored with the compiled | compiled using custom character tables, saved to disc or elsewhere, and |
pattern. A non-NULL value is stored with the compiled pattern only if | then reloaded, because the tables that were used to compile a pattern |
custom tables were supplied to pcre_compile() via its tableptr argu- | are not saved with it. See the pcreprecompile documentation for a dis- |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces | cussion of saving compiled patterns for later use. If NULL is passed |
PCRE's internal tables to be used. This facility is helpful when re- | using this mechanism, it forces PCRE's internal tables to be used. |
using patterns that have been saved after compiling with an external | |
set of tables, because the external tables might be at a different | |
address when pcre_exec() is called. See the pcreprecompile documenta- | |
tion for a discussion of saving compiled patterns for later use. | |
|
|
|
Warning: The tables that pcre_exec() uses must be the same as those |
|
that were used when the pattern was compiled. If this is not the case, |
|
the behaviour of pcre_exec() is undefined. Therefore, when a pattern is |
|
compiled and matched in the same process, this field should never be |
|
set. In this (the most common) case, the correct table pointer is auto- |
|
matically passed with the compiled pattern from pcre_compile() to |
|
pcre_exec(). |
|
|
If PCRE_EXTRA_MARK is set in the flags field, the mark field must be |
If PCRE_EXTRA_MARK is set in the flags field, the mark field must be |
set to point to a suitable variable. If the pattern contains any back- |
set to point to a suitable variable. If the pattern contains any back- |
tracking control verbs such as (*MARK:NAME), and the execution ends up |
tracking control verbs such as (*MARK:NAME), and the execution ends up |
Line 3351 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
Line 3397 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
|
points to the start of a character (or the end of the subject). When |
points to the start of a character (or the end of the subject). When |
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a |
PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a |
subject or an invalid value of startoffset is undefined. Your program |
subject or an invalid value of startoffset is undefined. Your program |
may crash. | may crash or loop. |
|
|
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
Line 4131 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
Line 4177 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
|
filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec() |
filled with the longest matches. Unlike pcre_exec(), pcre_dfa_exec() |
can use the entire ovector for returning matched strings. |
can use the entire ovector for returning matched strings. |
|
|
|
NOTE: PCRE's "auto-possessification" optimization usually applies to |
|
character repeats at the end of a pattern (as well as internally). For |
|
example, the pattern "a\d+" is compiled as if it were "a\d++" because |
|
there is no point even considering the possibility of backtracking into |
|
the repeated digits. For DFA matching, this means that only one possi- |
|
ble match is found. If you really do want multiple matches in such |
|
cases, either use an ungreedy repeat ("a\d+?") or set the |
|
PCRE_NO_AUTO_POSSESS option when compiling. |
|
|
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
|
|
The pcre_dfa_exec() function returns a negative number when it fails. | The pcre_dfa_exec() function returns a negative number when it fails. |
Many of the errors are the same as for pcre_exec(), and these are | Many of the errors are the same as for pcre_exec(), and these are |
described above. There are in addition the following errors that are | described above. There are in addition the following errors that are |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
|
|
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
|
|
This return is given if pcre_dfa_exec() encounters an item in the pat- | This return is given if pcre_dfa_exec() encounters an item in the pat- |
tern that it does not support, for instance, the use of \C or a back | tern that it does not support, for instance, the use of \C or a back |
reference. |
reference. |
|
|
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
|
|
This return is given if pcre_dfa_exec() encounters a condition item | This return is given if pcre_dfa_exec() encounters a condition item |
that uses a back reference for the condition, or a test for recursion | that uses a back reference for the condition, or a test for recursion |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
|
|
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
|
|
This return is given if pcre_dfa_exec() is called with an extra block | This return is given if pcre_dfa_exec() is called with an extra block |
that contains a setting of the match_limit or match_limit_recursion | that contains a setting of the match_limit or match_limit_recursion |
fields. This is not supported (these fields are meaningless for DFA | fields. This is not supported (these fields are meaningless for DFA |
matching). |
matching). |
|
|
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
|
|
This return is given if pcre_dfa_exec() runs out of space in the | This return is given if pcre_dfa_exec() runs out of space in the |
workspace vector. |
workspace vector. |
|
|
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
|
|
When a recursive subpattern is processed, the matching function calls | When a recursive subpattern is processed, the matching function calls |
itself recursively, using private vectors for ovector and workspace. | itself recursively, using private vectors for ovector and workspace. |
This error is given if the output vector is not large enough. This | This error is given if the output vector is not large enough. This |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
|
|
PCRE_ERROR_DFA_BADRESTART (-30) |
PCRE_ERROR_DFA_BADRESTART (-30) |
|
|
When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some | When pcre_dfa_exec() is called with the PCRE_DFA_RESTART option, some |
plausibility checks are made on the contents of the workspace, which | plausibility checks are made on the contents of the workspace, which |
should contain data about the previous partial match. If any of these | should contain data about the previous partial match. If any of these |
checks fail, this error is given. |
checks fail, this error is given. |
|
|
|
|
SEE ALSO |
SEE ALSO |
|
|
pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), | pcre16(3), pcre32(3), pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), |
pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre- |
pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre- |
sample(3), pcrestack(3). |
sample(3), pcrestack(3). |
|
|
Line 4193 AUTHOR
|
Line 4248 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 12 May 2013 | Last updated: 12 November 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 4256 DESCRIPTION
|
Line 4311 DESCRIPTION
|
independent groups). |
independent groups). |
|
|
Automatic callouts can be used for tracking the progress of pattern |
Automatic callouts can be used for tracking the progress of pattern |
matching. The pcretest command has an option that sets automatic call- | matching. The pcretest program has a pattern qualifier (/C) that sets |
outs; when it is used, the output indicates how the pattern is matched. | automatic callouts; when it is used, the output indicates how the pat- |
This is useful information when you are trying to optimize the perfor- | tern is being matched. This is useful information when you are trying |
mance of a particular pattern. | to optimize the performance of a particular pattern. |
|
|
|
|
MISSING CALLOUTS |
MISSING CALLOUTS |
|
|
You should be aware that, because of optimizations in the way PCRE | You should be aware that, because of optimizations in the way PCRE com- |
matches patterns by default, callouts sometimes do not happen. For | piles and matches patterns, callouts sometimes do not happen exactly as |
example, if the pattern is | you might expect. |
|
|
|
At compile time, PCRE "auto-possessifies" repeated items when it knows |
|
that what follows cannot be part of the repeat. For example, a+[bc] is |
|
compiled as if it were a++[bc]. The pcretest output when this pattern |
|
is anchored and then applied with automatic callouts to the string |
|
"aaaa" is: |
|
|
|
--->aaaa |
|
+0 ^ ^ |
|
+1 ^ a+ |
|
+3 ^ ^ [bc] |
|
No match |
|
|
|
This indicates that when matching [bc] fails, there is no backtracking |
|
into a+ and therefore the callouts that would be taken for the back- |
|
tracks do not occur. You can disable the auto-possessify feature by |
|
passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern |
|
with (*NO_AUTO_POSSESS). If this is done in pcretest (using the /O |
|
qualifier), the output changes to this: |
|
|
|
--->aaaa |
|
+0 ^ ^ |
|
+1 ^ a+ |
|
+3 ^ ^ [bc] |
|
+3 ^ ^ [bc] |
|
+3 ^ ^ [bc] |
|
+3 ^^ [bc] |
|
No match |
|
|
|
This time, when matching [bc] fails, the matcher backtracks into a+ and |
|
tries again, repeatedly, until a+ itself fails. |
|
|
|
Other optimizations that provide fast "no match" results also affect |
|
callouts. For example, if the pattern is |
|
|
ab(?C4)cd |
ab(?C4)cd |
|
|
PCRE knows that any matching string must contain the letter "d". If the |
PCRE knows that any matching string must contain the letter "d". If the |
subject string is "abyz", the lack of "d" means that matching doesn't | subject string is "abyz", the lack of "d" means that matching doesn't |
ever start, and the callout is never reached. However, with "abyd", | ever start, and the callout is never reached. However, with "abyd", |
though the result is still no match, the callout is obeyed. |
though the result is still no match, the callout is obeyed. |
|
|
If the pattern is studied, PCRE knows the minimum length of a matching | If the pattern is studied, PCRE knows the minimum length of a matching |
string, and will immediately give a "no match" return without actually | string, and will immediately give a "no match" return without actually |
running a match if the subject is not long enough, or, for unanchored | running a match if the subject is not long enough, or, for unanchored |
patterns, if it has been scanned far enough. |
patterns, if it has been scanned far enough. |
|
|
You can disable these optimizations by passing the PCRE_NO_START_OPTI- | You can disable these optimizations by passing the PCRE_NO_START_OPTI- |
MIZE option to the matching function, or by starting the pattern with | MIZE option to the matching function, or by starting the pattern with |
(*NO_START_OPT). This slows down the matching process, but does ensure | (*NO_START_OPT). This slows down the matching process, but does ensure |
that callouts such as the example above are obeyed. |
that callouts such as the example above are obeyed. |
|
|
|
|
THE CALLOUT INTERFACE |
THE CALLOUT INTERFACE |
|
|
During matching, when PCRE reaches a callout point, the external func- | During matching, when PCRE reaches a callout point, the external func- |
tion defined by pcre_callout or pcre[16|32]_callout is called (if it is |
tion defined by pcre_callout or pcre[16|32]_callout is called (if it is |
set). This applies to both normal and DFA matching. The only argument | set). This applies to both normal and DFA matching. The only argument |
to the callout function is a pointer to a pcre_callout or | to the callout function is a pointer to a pcre_callout or |
pcre[16|32]_callout block. These structures contains the following | pcre[16|32]_callout block. These structures contains the following |
fields: |
fields: |
|
|
int version; |
int version; |
Line 4313 THE CALLOUT INTERFACE
|
Line 4402 THE CALLOUT INTERFACE
|
const PCRE_UCHAR16 *mark; (16-bit version) |
const PCRE_UCHAR16 *mark; (16-bit version) |
const PCRE_UCHAR32 *mark; (32-bit version) |
const PCRE_UCHAR32 *mark; (32-bit version) |
|
|
The version field is an integer containing the version number of the | The version field is an integer containing the version number of the |
block format. The initial version was 0; the current version is 2. The | block format. The initial version was 0; the current version is 2. The |
version number will change again in future if additional fields are | version number will change again in future if additional fields are |
added, but the intention is never to remove any of the existing fields. |
added, but the intention is never to remove any of the existing fields. |
|
|
The callout_number field contains the number of the callout, as com- | The callout_number field contains the number of the callout, as com- |
piled into the pattern (that is, the number after ?C for manual call- | piled into the pattern (that is, the number after ?C for manual call- |
outs, and 255 for automatically generated callouts). |
outs, and 255 for automatically generated callouts). |
|
|
The offset_vector field is a pointer to the vector of offsets that was | The offset_vector field is a pointer to the vector of offsets that was |
passed by the caller to the matching function. When pcre_exec() or | passed by the caller to the matching function. When pcre_exec() or |
pcre[16|32]_exec() is used, the contents can be inspected, in order to | pcre[16|32]_exec() is used, the contents can be inspected, in order to |
extract substrings that have been matched so far, in the same way as | extract substrings that have been matched so far, in the same way as |
for extracting substrings after a match has completed. For the DFA | for extracting substrings after a match has completed. For the DFA |
matching functions, this field is not useful. |
matching functions, this field is not useful. |
|
|
The subject and subject_length fields contain copies of the values that |
The subject and subject_length fields contain copies of the values that |
were passed to the matching function. |
were passed to the matching function. |
|
|
The start_match field normally contains the offset within the subject | The start_match field normally contains the offset within the subject |
at which the current match attempt started. However, if the escape | at which the current match attempt started. However, if the escape |
sequence \K has been encountered, this value is changed to reflect the | sequence \K has been encountered, this value is changed to reflect the |
modified starting point. If the pattern is not anchored, the callout | modified starting point. If the pattern is not anchored, the callout |
function may be called several times from the same point in the pattern |
function may be called several times from the same point in the pattern |
for different starting points in the subject. |
for different starting points in the subject. |
|
|
The current_position field contains the offset within the subject of | The current_position fie The current_position fie |
the current match pointer. |
the current match pointer. |
|
|
When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top | When the pcre_exec() or pcre[16|32]_exec() is used, the capture_top |
field contains one more than the number of the highest numbered cap- | field contains one more than the number of the highest numbered cap- |
tured substring so far. If no substrings have been captured, the value | tured substring so far. If no substrings have been captured, the value |
of capture_top is one. This is always the case when the DFA functions | of capture_top is one. This is always the case when the DFA functions |
are used, because they do not support captured substrings. |
are used, because they do not support captured substrings. |
|
|
The capture_last field contains the number of the most recently cap- | The capture_last field contains the number of the most recently cap- |
tured substring. However, when a recursion exits, the value reverts to | tured substring. However, when a recursion exits, the value reverts to |
what it was outside the recursion, as do the values of all captured | what it was outside the recursion, as do the values of all captured |
substrings. If no substrings have been captured, the value of cap- | substrings. If no substrings have been captured, the value of cap- |
ture_last is -1. This is always the case for the DFA matching func- | ture_last is -1. This is always the case for the DFA matching func- |
tions. |
tions. |
|
|
The callout_data field contains a value that is passed to a matching | The callout_data field contains a value that is passed to a matching |
function specifically so that it can be passed back in callouts. It is | function specifically so that it can be passed back in callouts. It is |
passed in the callout_data field of a pcre_extra or pcre[16|32]_extra | passed in the callout_data field of a pcre_extra or pcre[16|32]_extra |
data structure. If no such data was passed, the value of callout_data | data structure. If no such data was passed, the value of callout_data |
in a callout block is NULL. There is a description of the pcre_extra | in a callout block is NULL. There is a description of the pcre_extra |
structure in the pcreapi documentation. |
structure in the pcreapi documentation. |
|
|
The pattern_position field is present from version 1 of the callout | The pattern_position field is present from version 1 of the callout |
structure. It contains the offset to the next item to be matched in the |
structure. It contains the offset to the next item to be matched in the |
pattern string. |
pattern string. |
|
|
The next_item_length field is present from version 1 of the callout | The next_item_length field is present from version 1 of the callout |
structure. It contains the length of the next item to be matched in the |
structure. It contains the length of the next item to be matched in the |
pattern string. When the callout immediately precedes an alternation | pattern string. When the callout immediately precedes an alternation |
bar, a closing parenthesis, or the end of the pattern, the length is | bar, a closing parenthesis, or the end of the pattern, the length is |
zero. When the callout precedes an opening parenthesis, the length is | zero. When the callout precedes an opening parenthesis, the length is |
that of the entire subpattern. |
that of the entire subpattern. |
|
|
The pattern_position and next_item_length fields are intended to help | The pattern_position and next_item_length fields are intended to help |
in distinguishing between different automatic callouts, which all have | in distinguishing between different automatic callouts, which all have |
the same callout number. However, they are set for all callouts. |
the same callout number. However, they are set for all callouts. |
|
|
The mark field is present from version 2 of the callout structure. In | The mark field is present from version 2 of the callout structure. In |
callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer | callouts from pcre_exec() or pcre[16|32]_exec() it contains a pointer |
to the zero-terminated name of the most recently passed (*MARK), | to the zero-terminated name of the most recently passed (*MARK), |
(*PRUNE), or (*THEN) item in the match, or NULL if no such items have | (*PRUNE), or (*THEN) item in the match, or NULL if no such items have |
been passed. Instances of (*PRUNE) or (*THEN) without a name do not | been passed. Instances of (*PRUNE) or (*THEN) without a name do not |
obliterate a previous (*MARK). In callouts from the DFA matching func- | obliterate a previous (*MARK). In callouts from the DFA matching func- |
tions this field always contains NULL. |
tions this field always contains NULL. |
|
|
|
|
RETURN VALUES |
RETURN VALUES |
|
|
The external callout function returns an integer to PCRE. If the value | The external callout function returns an integer to PCRE. If the value |
is zero, matching proceeds as normal. If the value is greater than | is zero, matching proceeds as normal. If the value is greater than |
zero, matching fails at the current point, but the testing of other | zero, matching fails at the current point, but the testing of other |
matching possibilities goes ahead, just as if a lookahead assertion had |
matching possibilities goes ahead, just as if a lookahead assertion had |
failed. If the value is less than zero, the match is abandoned, the | failed. If the value is less than zero, the match is abandoned, the |
matching function returns the negative value. |
matching function returns the negative value. |
|
|
Negative values should normally be chosen from the set of | Negative values should normally be chosen from the set of |
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan- |
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan- |
dard "no match" failure. The error number PCRE_ERROR_CALLOUT is | dard "no match" failure. The error number PCRE_ERROR_CALLOUT is |
reserved for use by callout functions; it will never be used by PCRE | reserved for use by callout functions; it will never be used by PCRE |
itself. |
itself. |
|
|
|
|
Line 4411 AUTHOR
|
Line 4500 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 03 March 2013 | Last updated: 12 November 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 4533 DIFFERENCES BETWEEN PCRE AND PERL
|
Line 4622 DIFFERENCES BETWEEN PCRE AND PERL
|
|
|
15. Perl recognizes comments in some places that PCRE does not, for |
15. Perl recognizes comments in some places that PCRE does not, for |
example, between the ( and ? at the start of a subpattern. If the /x |
example, between the ( and ? at the start of a subpattern. If the /x |
modifier is set, Perl allows white space between ( and ? but PCRE never | modifier is set, Perl allows white space between ( and ? (though cur- |
does, even if the PCRE_EXTENDED option is set. | rent Perls warn that this is deprecated) but PCRE never does, even if |
| the PCRE_EXTENDED option is set. |
|
|
16. In PCRE, the upper/lower case character properties Lu and Ll are | 16. Perl, when in warning mode, gives warnings for character classes |
| such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter- |
| als. PCRE has no warning features, so it gives an error in these cases |
| because they are almost certainly user mistakes. |
| |
| 17. In PCRE, the upper/lower case character properties Lu and Ll are |
not affected when case-independent matching is specified. For example, |
not affected when case-independent matching is specified. For example, |
\p{Lu} always matches an upper case letter. I think Perl has changed in |
\p{Lu} always matches an upper case letter. I think Perl has changed in |
this respect; in the release at the time of writing (5.16), \p{Lu} and |
this respect; in the release at the time of writing (5.16), \p{Lu} and |
\p{Ll} match all letters, regardless of case, when case independence is |
\p{Ll} match all letters, regardless of case, when case independence is |
specified. |
specified. |
|
|
17. PCRE provides some extensions to the Perl regular expression facil- | 18. PCRE provides some extensions to the Perl regular expression facil- |
ities. Perl 5.10 includes new features that are not in earlier ver- |
ities. Perl 5.10 includes new features that are not in earlier ver- |
sions of Perl, some of which (such as named parentheses) have been in |
sions of Perl, some of which (such as named parentheses) have been in |
PCRE for some time. This list is with respect to Perl 5.10: |
PCRE for some time. This list is with respect to Perl 5.10: |
Line 4600 AUTHOR
|
Line 4695 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 19 March 2013 | Last updated: 10 November 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 4679 SPECIAL START-OF-PATTERN ITEMS
|
Line 4774 SPECIAL START-OF-PATTERN ITEMS
|
|
|
Unicode property support |
Unicode property support |
|
|
Another special sequence that may appear at the start of a pattern is | Another special sequence that may appear at the start of a pattern is |
| (*UCP). This has the same effect as setting the PCRE_UCP option: it |
| causes sequences such as \d and \w to use Unicode properties to deter- |
| mine character types, instead of recognizing only characters with codes |
| less than 128 via a lookup table. |
|
|
(*UCP) | Disabling auto-possessification |
|
|
This has the same effect as setting the PCRE_UCP option: it causes | If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as |
sequences such as \d and \w to use Unicode properties to determine | setting the PCRE_NO_AUTO_POSSESS option at compile time. This stops |
character types, instead of recognizing only characters with codes less | PCRE from making quantifiers possessive when what follows cannot match |
than 128 via a lookup table. | the repeated item. For example, by default a+b is treated as a++b. For |
| more details, see the pcreapi documentation. |
|
|
Disabling start-up optimizations |
Disabling start-up optimizations |
|
|
If a pattern starts with (*NO_START_OPT), it has the same effect as | If a pattern starts with (*NO_START_OPT), it has the same effect as |
setting the PCRE_NO_START_OPTIMIZE option either at compile or matching |
setting the PCRE_NO_START_OPTIMIZE option either at compile or matching |
time. | time. This disables several optimizations for quickly reaching "no |
| match" results. For more details, see the pcreapi documentation. |
|
|
Newline conventions |
Newline conventions |
|
|
Line 4746 SPECIAL START-OF-PATTERN ITEMS
|
Line 4847 SPECIAL START-OF-PATTERN ITEMS
|
(*LIMIT_RECURSION=d) |
(*LIMIT_RECURSION=d) |
|
|
where d is any number of decimal digits. However, the value of the set- |
where d is any number of decimal digits. However, the value of the set- |
ting must be less than the value set by the caller of pcre_exec() for | ting must be less than the value set (or defaulted) by the caller of |
it to have any effect. In other words, the pattern writer can lower the | pcre_exec() for it to have any effect. In other words, the pattern |
limit set by the programmer, but not raise it. If there is more than | writer can lower the limits set by the programmer, but not raise them. |
one setting of one of these limits, the lower value is used. | If there is more than one setting of one of these limits, the lower |
| value is used. |
|
|
|
|
EBCDIC CHARACTER CODES |
EBCDIC CHARACTER CODES |
|
|
PCRE can be compiled to run in an environment that uses EBCDIC as its | PCRE can be compiled to run in an environment that uses EBCDIC as its |
character code rather than ASCII or Unicode (typically a mainframe sys- |
character code rather than ASCII or Unicode (typically a mainframe sys- |
tem). In the sections below, character code values are ASCII or Uni- | tem). In the sections below, character code values are ASCII or Uni- |
code; in an EBCDIC environment these characters may have different code |
code; in an EBCDIC environment these characters may have different code |
values, and there are no code points greater than 255. |
values, and there are no code points greater than 255. |
|
|
|
|
CHARACTERS AND METACHARACTERS |
CHARACTERS AND METACHARACTERS |
|
|
A regular expression is a pattern that is matched against a subject | A regular expression is a pattern that is matched against a subject |
string from left to right. Most characters stand for themselves in a | string from left to right. Most characters stand for themselves in a |
pattern, and match the corresponding characters in the subject. As a | pattern, and match the corresponding characters in the subject. As a |
trivial example, the pattern |
trivial example, the pattern |
|
|
The quick brown fox |
The quick brown fox |
|
|
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
caseless matching is specified (the PCRE_CASELESS option), letters are | caseless matching is specified (the PCRE_CASELESS option), letters are |
matched independently of case. In a UTF mode, PCRE always understands | matched independently of case. In a UTF mode, PCRE always understands |
the concept of case for characters whose values are less than 128, so | the concept of case for characters whose values are less than 128, so |
caseless matching is always possible. For characters with higher val- | caseless matching is always possible. For characters with higher val- |
ues, the concept of case is supported if PCRE is compiled with Unicode | ues, the concept of case is supported if PCRE is compiled with Unicode |
property support, but not otherwise. If you want to use caseless | property support, but not otherwise. If you want to use caseless |
matching for characters 128 and above, you must ensure that PCRE is | matching for characters 128 and above, you must ensure that PCRE is |
compiled with Unicode property support as well as with UTF support. |
compiled with Unicode property support as well as with UTF support. |
|
|
The power of regular expressions comes from the ability to include | The power of regular expressions comes from the ability to include |
alternatives and repetitions in the pattern. These are encoded in the | alternatives and repetitions in the pattern. These are encoded in the |
pattern by the use of metacharacters, which do not stand for themselves |
pattern by the use of metacharacters, which do not stand for themselves |
but instead are interpreted in some special way. |
but instead are interpreted in some special way. |
|
|
There are two different sets of metacharacters: those that are recog- | There are two different sets of metacharacters: those that are recog- |
nized anywhere in the pattern except within square brackets, and those | nized anywhere in the pattern except within square brackets, and those |
that are recognized within square brackets. Outside square brackets, | that are recognized within square brackets. Outside square brackets, |
the metacharacters are as follows: |
the metacharacters are as follows: |
|
|
\ general escape character with several uses |
\ general escape character with several uses |
Line 4806 CHARACTERS AND METACHARACTERS
|
Line 4908 CHARACTERS AND METACHARACTERS
|
also "possessive quantifier" |
also "possessive quantifier" |
{ start min/max quantifier |
{ start min/max quantifier |
|
|
Part of a pattern that is in square brackets is called a "character | Part of a pattern that is in square brackets is called a "character |
class". In a character class the only metacharacters are: |
class". In a character class the only metacharacters are: |
|
|
\ general escape character |
\ general escape character |
Line 4823 BACKSLASH
|
Line 4925 BACKSLASH
|
|
|
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
a character that is not a number or a letter, it takes away any special |
a character that is not a number or a letter, it takes away any special |
meaning that character may have. This use of backslash as an escape | meaning that character may have. This use of backslash as an escape |
character applies both inside and outside character classes. |
character applies both inside and outside character classes. |
|
|
For example, if you want to match a * character, you write \* in the | For example, if you want to match a * character, you write \* in the |
pattern. This escaping action applies whether or not the following | pattern. This escaping action applies whether or not the following |
character would otherwise be interpreted as a metacharacter, so it is | character would otherwise be interpreted as a metacharacter, so it is |
always safe to precede a non-alphanumeric with backslash to specify | always safe to precede a non-alphanumeric with backslash to specify |
that it stands for itself. In particular, if you want to match a back- | that it stands for itself. In particular, if you want to match a back- |
slash, you write \\. |
slash, you write \\. |
|
|
In a UTF mode, only ASCII numbers and letters have any special meaning | In a UTF mode, only ASCII numbers and letters have any special meaning |
after a backslash. All other characters (in particular, those whose | after a backslash. All other characters (in particular, those whose |
codepoints are greater than 127) are treated as literals. |
codepoints are greater than 127) are treated as literals. |
|
|
If a pattern is compiled with the PCRE_EXTENDED option, white space in | If a pattern is compiled with the PCRE_EXTENDED option, most white |
the pattern (other than in a character class) and characters between a | space in the pattern (other than in a character class), and characters |
# outside a character class and the next newline are ignored. An escap- | between a # outside a character class and the next newline, inclusive, |
ing backslash can be used to include a white space or # character as | are ignored. An escaping backslash can be used to include a white space |
part of the pattern. | or # character as part of the pattern. |
|
|
If you want to remove the special meaning from a sequence of charac- | If you want to remove the special meaning from a sequence of charac- |
ters, you can do so by putting them between \Q and \E. This is differ- | ters, you can do so by putting them between \Q and \E. This is differ- |
ent from Perl in that $ and @ are handled as literals in \Q...\E | ent from Perl in that $ and @ are handled as literals in \Q...\E |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- | sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
tion. Note the following examples: |
tion. Note the following examples: |
|
|
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
Line 4856 BACKSLASH
|
Line 4958 BACKSLASH
|
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
|
|
The \Q...\E sequence is recognized both inside and outside character | The \Q...\E sequence is recognized both inside and outside character |
classes. An isolated \E that is not preceded by \Q is ignored. If \Q | classes. An isolated \E that is not preceded by \Q is ignored. If \Q |
is not followed by \E later in the pattern, the literal interpretation | is not followed by \E later in the pattern, the literal interpretation |
continues to the end of the pattern (that is, \E is assumed at the | continues to the end of the pattern (that is, \E is assumed at the |
end). If the isolated \Q is inside a character class, this causes an | end). If the isolated \Q is inside a character class, this causes an |
error, because the character class is not terminated. |
error, because the character class is not terminated. |
|
|
Non-printing characters |
Non-printing characters |
|
|
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
acters in patterns in a visible manner. There is no restriction on the | acters in patterns in a visible manner. There is no restriction on the |
appearance of non-printing characters, apart from the binary zero that | appearance of non-printing characters, apart from the binary zero that |
terminates a pattern, but when a pattern is being prepared by text | terminates a pattern, but when a pattern is being prepared by text |
editing, it is often easier to use one of the following escape | editing, it is often easier to use one of the following escape |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
|
|
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
Line 4879 BACKSLASH
|
Line 4981 BACKSLASH
|
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
\t tab (hex 09) |
\t tab (hex 09) |
|
\0dd character with octal code 0dd |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
|
\o{ddd..} character with octal code ddd.. |
\xhh character with hex code hh |
\xhh character with hex code hh |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\uhhhh character with hex code hhhh (JavaScript mode only) |
\uhhhh character with hex code hhhh (JavaScript mode only) |
|
|
The precise effect of \cx on ASCII characters is as follows: if x is a | The precise effect of \cx on ASCII characters is as follows: if x is a |
lower case letter, it is converted to upper case. Then bit 6 of the | lower case letter, it is converted to upper case. Then bit 6 of the |
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A |
character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A |
(A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes | (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes |
hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c | hex 7B (; is 3B). If the data item (byte or 16-bit value) following \c |
has a value greater than 127, a compile-time error occurs. This locks | has a value greater than 127, a compile-time error occurs. This locks |
out non-ASCII characters in all modes. |
out non-ASCII characters in all modes. |
|
|
The \c facility was designed for use with ASCII characters, but with | The \c facility was designed for use with ASCII characters, but with |
the extension to Unicode it is even less useful than it once was. It | the extension to Unicode it is even less useful than it once was. It |
is, however, recognized when PCRE is compiled in EBCDIC mode, where | is, however, recognized when PCRE is compiled in EBCDIC mode, where |
data items are always bytes. In this mode, all values are valid after | data items are always bytes. In this mode, all values are valid after |
\c. If the next character is a lower case letter, it is converted to | \c. If the next character is a lower case letter, it is converted to |
upper case. Then the 0xc0 bits of the byte are inverted. Thus \cA | upper case. Then the 0xc0 bits of the byte are inverted. Thus \cA |
becomes hex 01, as in ASCII (A is C1), but because the EBCDIC letters | becomes hex 01, as in ASCII (A is C1), but because the EBCDIC letters |
are disjoint, \cZ becomes hex 29 (Z is E9), and other characters also | are disjoint, \cZ becomes hex 29 (Z is E9), and other characters also |
generate different values. |
generate different values. |
|
|
By default, after \x, from zero to two hexadecimal digits are read | After \0 up to two further octal digits are read. If there are fewer |
(letters can be in upper or lower case). Any number of hexadecimal dig- | than two digits, just those that are present are used. Thus the |
its may appear between \x{ and }, but the character code is constrained | |
as follows: | |
| |
8-bit non-UTF mode less than 0x100 | |
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint | |
16-bit non-UTF mode less than 0x10000 | |
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint | |
32-bit non-UTF mode less than 0x80000000 | |
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint | |
| |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- | |
called "surrogate" codepoints), and 0xffef. | |
| |
If characters other than hexadecimal digits appear between \x{ and }, | |
or if there is no terminating }, this form of escape is not recognized. | |
Instead, the initial \x will be interpreted as a basic hexadecimal | |
escape, with no following digits, giving a character whose value is | |
zero. | |
| |
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x | |
is as just described only when it is followed by two hexadecimal dig- | |
its. Otherwise, it matches a literal "x" character. In JavaScript | |
mode, support for code points greater than 256 is provided by \u, which | |
must be followed by four hexadecimal digits; otherwise it matches a | |
literal "u" character. Character codes specified by \u in JavaScript | |
mode are constrained in the same was as those specified by \x in non- | |
JavaScript mode. | |
| |
Characters whose value is less than 256 can be defined by either of the | |
two syntaxes for \x (or by \u in JavaScript mode). There is no differ- | |
ence in the way they are handled. For example, \xdc is exactly the same | |
as \x{dc} (or \u00dc in JavaScript mode). | |
| |
After \0 up to two further octal digits are read. If there are fewer | |
than two digits, just those that are present are used. Thus the | |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
(code value 7). Make sure you supply two digits after the initial zero | (code value 7). Make sure you supply two digits after the initial zero |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
|
|
|
The escape \o must be followed by a sequence of octal digits, enclosed |
|
in braces. An error occurs if this is not the case. This escape is a |
|
recent addition to Perl; it provides way of specifying character code |
|
points as octal numbers greater than 0777, and it also allows octal |
|
numbers and back references to be unambiguously specified. |
|
|
|
For greater clarity and unambiguity, it is best to avoid following \ by |
|
a digit greater than zero. Instead, use \o{} or \x{} to specify charac- |
|
ter numbers, and \g{} to specify back references. The following para- |
|
graphs describe the old, ambiguous syntax. |
|
|
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
cated. Outside a character class, PCRE reads it and any following dig- | cated, and Perl has changed in recent releases, causing PCRE also to |
its as a decimal number. If the number is less than 10, or if there | change. Outside a character class, PCRE reads the digit and any follow- |
have been at least that many previous capturing left parentheses in the | ing digits as a decimal number. If the number is less than 8, or if |
expression, the entire sequence is taken as a back reference. A | there have been at least that many previous capturing left parentheses |
description of how this works is given later, following the discussion | in the expression, the entire sequence is taken as a back reference. A |
| description of how this works is given later, following the discussion |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
|
|
Inside a character class, or if the decimal number is greater than 9 | Inside a character class, or if the decimal number following \ is |
and there have not been that many capturing subpatterns, PCRE re-reads | greater than 7 and there have not been that many capturing subpatterns, |
up to three octal digits following the backslash, and uses them to gen- | PCRE handles \8 and \9 as the literal characters "8" and "9", and oth- |
erate a data character. Any subsequent digits stand for themselves. The | erwise re-reads up to three octal digits following the backslash, using |
value of the character is constrained in the same way as characters | them to generate a data character. Any subsequent digits stand for |
specified in hexadecimal. For example: | themselves. For example: |
|
|
\040 is another way of writing an ASCII space |
\040 is another way of writing an ASCII space |
\40 is the same, provided there are fewer than 40 |
\40 is the same, provided there are fewer than 40 |
Line 4970 BACKSLASH
|
Line 5051 BACKSLASH
|
character with octal code 113 |
character with octal code 113 |
\377 might be a back reference, otherwise |
\377 might be a back reference, otherwise |
the value 255 (decimal) |
the value 255 (decimal) |
\81 is either a back reference, or a binary zero | \81 is either a back reference, or the two |
followed by the two characters "8" and "1" | characters "8" and "1" |
|
|
Note that octal values of 100 or greater must not be introduced by a | Note that octal values of 100 or greater that are specified using this |
leading zero, because no more than three octal digits are ever read. | syntax must not be introduced by a leading zero, because no more than |
| three octal digits are ever read. |
|
|
|
By default, after \x that is not followed by {, from zero to two hexa- |
|
decimal digits are read (letters can be in upper or lower case). Any |
|
number of hexadecimal digits may appear between \x{ and }. If a charac- |
|
ter other than a hexadecimal digit appears between \x{ and }, or if |
|
there is no terminating }, an error occurs. |
|
|
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x |
|
is as just described only when it is followed by two hexadecimal dig- |
|
its. Otherwise, it matches a literal "x" character. In JavaScript |
|
mode, support for code points greater than 256 is provided by \u, which |
|
must be followed by four hexadecimal digits; otherwise it matches a |
|
literal "u" character. |
|
|
|
Characters whose value is less than 256 can be defined by either of the |
|
two syntaxes for \x (or by \u in JavaScript mode). There is no differ- |
|
ence in the way they are handled. For example, \xdc is exactly the same |
|
as \x{dc} (or \u00dc in JavaScript mode). |
|
|
|
Constraints on character values |
|
|
|
Characters that are specified using octal or hexadecimal numbers are |
|
limited to certain values, as follows: |
|
|
|
8-bit non-UTF mode less than 0x100 |
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
|
16-bit non-UTF mode less than 0x10000 |
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
32-bit non-UTF mode less than 0x100000000 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
|
|
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- |
|
called "surrogate" codepoints), and 0xffef. |
|
|
|
Escape sequences in character classes |
|
|
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
class, \b is interpreted as the backspace character (hex 08). |
class, \b is interpreted as the backspace character (hex 08). |
Line 5039 BACKSLASH
|
Line 5156 BACKSLASH
|
the subject string, all of them fail, because there is no character to |
the subject string, all of them fail, because there is no character to |
match. |
match. |
|
|
For compatibility with Perl, \s does not match the VT character (code | For compatibility with Perl, \s did not used to match the VT character |
11). This makes it different from the the POSIX "space" class. The \s | (code 11), which made it different from the the POSIX "space" class. |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If | However, Perl added VT at release 5.18, and PCRE followed suit at |
"use locale;" is included in a Perl script, \s may match the VT charac- | release 8.34. The default \s characters are now HT (9), LF (10), VT |
ter. In PCRE, it never does. | (11), FF (12), CR (13), and space (32), which are defined as white |
| space in the "C" locale. This list may vary if locale-specific matching |
| is taking place. For example, in some locales the "non-breaking space" |
| character (\xA0) is recognized as white space, and in others the VT |
| character is not. |
|
|
A "word" character is an underscore or any character that is a letter |
A "word" character is an underscore or any character that is a letter |
or digit. By default, the definition of letters and digits is con- |
or digit. By default, the definition of letters and digits is con- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
systems, or "french" in Windows, some character codes greater than 128 | systems, or "french" in Windows, some character codes greater than 127 |
are used for accented letters, and these are then matched by \w. The |
are used for accented letters, and these are then matched by \w. The |
use of locales with Unicode is discouraged. |
use of locales with Unicode is discouraged. |
|
|
By default, in a UTF mode, characters with values greater than 128 | By default, characters whose code points are greater than 127 never |
never match \d, \s, or \w, and always match \D, \S, and \W. These | match \d, \s, or \w, and always match \D, \S, and \W, although this may |
sequences retain their original meanings from before UTF support was | vary for characters in the range 128-255 when locale-specific matching |
available, mainly for efficiency reasons. However, if PCRE is compiled | is happening. These escape sequences retain their original meanings |
with Unicode property support, and the PCRE_UCP option is set, the be- | from before Unicode support was available, mainly for efficiency rea- |
haviour is changed so that Unicode properties are used to determine | sons. If PCRE is compiled with Unicode property support, and the |
character types, as follows: | PCRE_UCP option is set, the behaviour is changed so that Unicode prop- |
| erties are used to determine character types, as follows: |
|
|
\d any character that \p{Nd} matches (decimal digit) | \d any character that matches \p{Nd} (decimal digit) |
\s any character that \p{Z} matches, plus HT, LF, FF, CR | \s any character that matches \p{Z} or \h or \v |
\w any character that \p{L} or \p{N} matches, plus underscore | \w any character that matches \p{L} or \p{N}, plus underscore |
|
|
The upper case escapes match the inverse sets of characters. Note that | The upper case escapes match the inverse sets of characters. Note that |
\d matches only decimal digits, whereas \w matches any Unicode digit, | \d matches only decimal digits, whereas \w matches any Unicode digit, |
as well as any Unicode letter, and underscore. Note also that PCRE_UCP | as well as any Unicode letter, and underscore. Note also that PCRE_UCP |
affects \b, and \B because they are defined in terms of \w and \W. | affects \b, and \B because they are defined in terms of \w and \W. |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
|
|
The sequences \h, \H, \v, and \V are features that were added to Perl | The sequences \h, \H, \v, and \V are features that were added to Perl |
at release 5.10. In contrast to the other sequences, which match only | at release 5.10. In contrast to the other sequences, which match only |
ASCII characters by default, these always match certain high-valued | ASCII characters by default, these always match certain high-valued |
codepoints, whether or not PCRE_UCP is set. The horizontal space char- | code points, whether or not PCRE_UCP is set. The horizontal space char- |
acters are: |
acters are: |
|
|
U+0009 Horizontal tab (HT) |
U+0009 Horizontal tab (HT) |
Line 5113 BACKSLASH
|
Line 5235 BACKSLASH
|
|
|
Newline sequences |
Newline sequences |
|
|
Outside a character class, by default, the escape sequence \R matches | Outside a character class, by default, the escape sequence \R matches |
any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent | any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent |
to the following: |
to the following: |
|
|
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
|
|
This is an example of an "atomic group", details of which are given | This is an example of an "atomic group", details of which are given |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
CR followed by LF, or one of the single characters LF (linefeed, | CR followed by LF, or one of the single characters LF (linefeed, |
U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- | U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car- |
riage return, U+000D), or NEL (next line, U+0085). The two-character | riage return, U+000D), or NEL (next line, U+0085). The two-character |
sequence is treated as a single unit that cannot be split. |
sequence is treated as a single unit that cannot be split. |
|
|
In other modes, two additional characters whose codepoints are greater | In other modes, two additional characters whose codepoints are greater |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
rator, U+2029). Unicode character property support is not needed for | rator, U+2029). Unicode character property support is not needed for |
these characters to be recognized. |
these characters to be recognized. |
|
|
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
the complete set of Unicode line endings) by setting the option | the complete set of Unicode line endings) by setting the option |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
when PCRE is built; if this is the case, the other behaviour can be | when PCRE is built; if this is the case, the other behaviour can be |
requested via the PCRE_BSR_UNICODE option. It is also possible to | requested via the PCRE_BSR_UNICODE option. It is also possible to |
specify these settings by starting a pattern string with one of the | specify these settings by starting a pattern string with one of the |
following sequences: |
following sequences: |
|
|
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
|
|
These override the default and the options given to the compiling func- |
These override the default and the options given to the compiling func- |
tion, but they can themselves be overridden by options given to a | tion, but they can themselves be overridden by options given to a |
matching function. Note that these special settings, which are not | matching function. Note that these special settings, which are not |
Perl-compatible, are recognized only at the very start of a pattern, | Perl-compatible, are recognized only at the very start of a pattern, |
and that they must be in upper case. If more than one of them is | and that they must be in upper case. If more than one of them is |
present, the last one is used. They can be combined with a change of | present, the last one is used. They can be combined with a change of |
newline convention; for example, a pattern can start with: |
newline convention; for example, a pattern can start with: |
|
|
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
|
|
They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) | They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) |
or (*UCP) special sequences. Inside a character class, \R is treated as |
or (*UCP) special sequences. Inside a character class, \R is treated as |
an unrecognized escape sequence, and so matches the letter "R" by | an unrecognized escape sequence, and so matches the lett an unrecognized escape sequence, and so matches the lett |
default, but causes an error if PCRE_EXTRA is set. |
default, but causes an error if PCRE_EXTRA is set. |
|
|
Unicode character properties |
Unicode character properties |
|
|
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
tional escape sequences that match characters with specific properties | tional escape sequences that match characters with specific properties |
are available. When in 8-bit non-UTF-8 mode, these sequences are of | are available. When in 8-bit non-UTF-8 mode, these sequences are of |
course limited to testing characters whose codepoints are less than | course limited to testing characters whose codepoints are less than |
256, but they do work in this mode. The extra escape sequences are: |
256, but they do work in this mode. The extra escape sequences are: |
|
|
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
\X a Unicode extended grapheme cluster |
\X a Unicode extended grapheme cluster |
|
|
The property names represented by xx above are limited to the Unicode | The property names represented by xx above are limited to the Unicode |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
character (including newline), and some special PCRE properties | character (including newline), and some special PCRE properties |
(described in the next section). Other Perl properties such as "InMu- | (described in the next section). Other Perl properties such as "InMu- |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} | sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
does not match any characters, so always causes a match failure. |
does not match any characters, so always causes a match failure. |
|
|
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
A character from one of these sets can be matched using a script name. | A character from one of these sets can be matched using a script name. |
For example: |
For example: |
|
|
\p{Greek} |
\p{Greek} |
\P{Han} |
\P{Han} |
|
|
Those that are not part of an identified script are lumped together as | Those that are not part of an identified script are lumped together as |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
|
|
Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, | Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, |
Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, | Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, |
Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, | Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, |
Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, | Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- | Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- | gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- |
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, | tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, |
Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, | Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, |
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, |
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, |
Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, | Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, | Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, |
Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- | Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- |
tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, | tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, |
Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, | Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, |
Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, | Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, |
Yi. |
Yi. |
|
|
Each character has exactly one Unicode general category property, spec- |
Each character has exactly one Unicode general category property, spec- |
ified by a two-letter abbreviation. For compatibility with Perl, nega- | ified by a two-letter abbreviation. For compatibility with Perl, nega- |
tion can be specified by including a circumflex between the opening | tion can be specified by including a circumflex between the opening |
brace and the property name. For example, \p{^Lu} is the same as | brace and the property name. For example, \p{^Lu} is the same as |
\P{Lu}. |
\P{Lu}. |
|
|
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
eral category properties that start with that letter. In this case, in | eral category properties that start with that letter. In this case, in |
the absence of negation, the curly brackets in the escape sequence are | the absence of negation, the curly brackets in the escape sequence are |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
|
|
\p{L} |
\p{L} |
Line 5264 BACKSLASH
|
Line 5386 BACKSLASH
|
Zp Paragraph separator |
Zp Paragraph separator |
Zs Space separator |
Zs Space separator |
|
|
The special property L& is also supported: it matches a character that | The special property L& is also supported: it matches a character that |
has the Lu, Ll, or Lt property, in other words, a letter that is not | has the Lu, Ll, or Lt property, in other words, a letter that is not |
classified as a modifier or "other". |
classified as a modifier or "other". |
|
|
The Cs (Surrogate) property applies only to characters in the range | The Cs (Surrogate) property applies only to characters in the range |
U+D800 to U+DFFF. Such characters are not valid in Unicode strings and | U+D800 to U+DFFF. Such characters are not valid in Unicode strings and |
so cannot be tested by PCRE, unless UTF validity checking has been | so cannot be tested by PCRE, unless UTF validity checking has been |
turned off (see the discussion of PCRE_NO_UTF8_CHECK, |
turned off (see the discussion of PCRE_NO_UTF8_CHECK, |
PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl | PCRE_NO_UTF16_CHECK and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl |
does not support the Cs property. |
does not support the Cs property. |
|
|
The long synonyms for property names that Perl supports (such as | The long synonyms for property names that Perl supports (such as |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix | \p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
any of these properties with "Is". |
any of these properties with "Is". |
|
|
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
in the Unicode table. |
in the Unicode table. |
|
|
Specifying caseless matching does not affect these escape sequences. | Specifying caseless matching does not affect these escape sequences. |
For example, \p{Lu} always matches only upper case letters. This is | For example, \p{Lu} always matches only upper case letters. This is |
different from the behaviour of current versions of Perl. |
different from the behaviour of current versions of Perl. |
|
|
Matching characters by Unicode property is not fast, because PCRE has | Matching characters by Unicode property is not fast, because PCRE has |
to do a multistage table lookup in order to find a character's prop- | to do a multistage table lookup in order to find a character's prop- |
erty. That is why the traditional escape sequences such as \d and \w do |
erty. That is why the traditional escape sequences such as \d and \w do |
not use Unicode properties in PCRE by default, though you can make them |
not use Unicode properties in PCRE by default, though you can make them |
do so by setting the PCRE_UCP option or by starting the pattern with | do so by setting the PCRE_UCP option or by starting the patter do so by setting the PCRE_UCP option or by starting the patter |
(*UCP). |
(*UCP). |
|
|
Extended grapheme clusters |
Extended grapheme clusters |
|
|
The \X escape matches any number of Unicode characters that form an | The \X escape matches any number of Unicode characters that form an |
"extended grapheme cluster", and treats the sequence as an atomic group |
"extended grapheme cluster", and treats the sequence as an atomic group |
(see below). Up to and including release 8.31, PCRE matched an ear- | (see below). Up to and including release 8.31, PCRE matched an ear- |
lier, simpler definition that was equivalent to |
lier, simpler definition that was equivalent to |
|
|
(?>\PM\pM*) |
(?>\PM\pM*) |
|
|
That is, it matched a character without the "mark" property, followed | That is, it matched a character without the "mark" property, followed |
by zero or more characters with the "mark" property. Characters with | by zero or more characters with the "mark" property. Characters with |
the "mark" property are typically non-spacing accents that affect the | the "mark" property are typically non-spacing accents that affect the |
preceding character. |
preceding character. |
|
|
This simple definition was extended in Unicode to include more compli- | This simple definition was extended in Unicode to include more compli- |
cated kinds of composite character by giving each character a grapheme | cated kinds of composite character by giving each character a grapheme |
breaking property, and creating rules that use these properties to | breaking property, and creating rules that use these properties to |
define the boundaries of extended grapheme clusters. In releases of | define the boundaries of extended grapheme clusters. In releases of |
PCRE later than 8.31, \X matches one of these clusters. |
PCRE later than 8.31, \X matches one of these clusters. |
|
|
\X always matches at least one character. Then it decides whether to | \X always matches at least one character. Then it decides whether to |
add additional characters according to the following rules for ending a |
add additional characters according to the following rules for ending a |
cluster: |
cluster: |
|
|
1. End at the end of the subject string. |
1. End at the end of the subject string. |
|
|
2. Do not end between CR and LF; otherwise end after any control char- | 2. Do not end between CR and LF; otherwise end after any control char- |
acter. |
acter. |
|
|
3. Do not break Hangul (a Korean script) syllable sequences. Hangul | 3. Do not break Hangul (a Korean script) syllable sequences. Hangul |
characters are of five types: L, V, T, LV, and LVT. An L character may | characters are of five types: L, V, T, LV, and LVT. An L character may |
be followed by an L, V, LV, or LVT character; an LV or V character may | be followed by an L, V, LV, or LVT character; an LV or V character may |
be followed by a V or T character; an LVT or T character may be follwed |
be followed by a V or T character; an LVT or T character may be follwed |
only by a T character. |
only by a T character. |
|
|
4. Do not end before extending characters or spacing marks. Characters | 4. Do not end before extending characters or spacing marks. Characters |
with the "mark" property always have the "extend" grapheme breaking | with the "mark" property always have the "extend" grapheme breaking |
property. |
property. |
|
|
5. Do not end after prepend characters. |
5. Do not end after prepend characters. |
Line 5339 BACKSLASH
|
Line 5461 BACKSLASH
|
|
|
PCRE's additional properties |
PCRE's additional properties |
|
|
As well as the standard Unicode properties described above, PCRE sup- | As well as the standard Unicode properties described above, PCRE sup- |
ports four more that make it possible to convert traditional escape | ports four more that make it possible to convert traditional escape |
sequences such as \w and \s and POSIX character classes to use Unicode | sequences such as \w and \s to use Unicode properties. PCRE uses these |
properties. PCRE uses these non-standard, non-Perl properties inter- | non-standard, non-Perl properties internally when PCRE_UCP is set. How- |
nally when PCRE_UCP is set. However, they may also be used explicitly. | |
These properties are: | |
|
|
Xan Any alphanumeric character |
Xan Any alphanumeric character |
Xps Any POSIX space character |
Xps Any POSIX space character |
Line 5354 BACKSLASH
|
Line 5475 BACKSLASH
|
Xan matches characters that have either the L (letter) or the N (num- |
Xan matches characters that have either the L (letter) or the N (num- |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
form feed, or carriage return, and any other character that has the Z |
form feed, or carriage return, and any other character that has the Z |
(separator) property. Xsp is the same as Xps, except that vertical tab | (separator) property. Xsp is the same as Xps; it used to exclude ver- |
is excluded. Xwd matches the same characters as Xan, plus underscore. | tical tab, for Perl compatibility, but Perl changed, and so PCRE fol- |
| lowed at release 8.34. Xwd matches the same characters as Xan, plus |
| underscore. |
|
|
There is another non-standard property, Xuc, which matches any charac- |
There is another non-standard property, Xuc, which matches any charac- |
ter that can be represented by a Universal Character Name in C++ and |
ter that can be represented by a Universal Character Name in C++ and |
Line 5628 SQUARE BRACKETS AND CHARACTER CLASSES
|
Line 5751 SQUARE BRACKETS AND CHARACTER CLASSES
|
between d and m, inclusive. If a minus character is required in a |
between d and m, inclusive. If a minus character is required in a |
class, it must be escaped with a backslash or appear in a position |
class, it must be escaped with a backslash or appear in a position |
where it cannot be interpreted as indicating a range, typically as the |
where it cannot be interpreted as indicating a range, typically as the |
first or last character in the class. | first or last character in the class, or immediately after a range. For |
| example, [b-d-z] matches letters in the range b to d, a hyphen charac- |
| ter, or z. |
|
|
It is not possible to have the literal character "]" as the end charac- |
It is not possible to have the literal character "]" as the end charac- |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
Line 5639 SQUARE BRACKETS AND CHARACTER CLASSES
|
Line 5764 SQUARE BRACKETS AND CHARACTER CLASSES
|
The octal or hexadecimal representation of "]" can also be used to end |
The octal or hexadecimal representation of "]" can also be used to end |
a range. |
a range. |
|
|
Ranges operate in the collating sequence of character values. They can | An error is generated if a POSIX character class (see below) or an |
also be used for characters specified numerically, for example | escape sequence other than one that defines a single character appears |
[\000-\037]. Ranges can include any characters that are valid for the | at a point where a range ending character is expected. For example, |
| [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not. |
| |
| Ranges operate in the collating sequence of character values. They can |
| also be used for characters specified numerically, for example |
| [\000-\037]. Ranges can include any characters that are valid for the |
current mode. |
current mode. |
|
|
If a range that includes letters is used when caseless matching is set, |
If a range that includes letters is used when caseless matching is set, |
it matches the letters in either case. For example, [W-c] is equivalent |
it matches the letters in either case. For example, [W-c] is equivalent |
to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if | to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if |
character tables for a French locale are in use, [\xc8-\xcb] matches | character tables for a French locale are in use, [\xc8-\xcb] matches |
accented E characters in both cases. In UTF modes, PCRE supports the | accented E characters in both cases. In UTF modes, PCRE supports the |
concept of case for characters with values greater than 128 only when | concept of case for characters with values greater than 128 only when |
it is compiled with Unicode property support. |
it is compiled with Unicode property support. |
|
|
The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, | The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, |
\w, and \W may appear in a character class, and add the characters that |
\w, and \W may appear in a character class, and add the characters that |
they match to the class. For example, [\dABCDEF] matches any hexadeci- | they match to the class. For example, [\dABCDEF] matches any hexadeci- |
mal digit. In UTF modes, the PCRE_UCP option affects the meanings of | mal digit. In UTF modes, the PCRE_UCP option affects the meanings of |
\d, \s, \w and their upper case partners, just as it does when they | \d, \s, \w and their upper case partners, just as it does when they |
appear outside a character class, as described in the section entitled | appear outside a character class, as described in the section entitled |
"Generic character types" above. The escape sequence \b has a different |
"Generic character types" above. The escape sequence \b has a different |
meaning inside a character class; it matches the backspace character. | meaning inside a character class; it matches the backspace character. |
The sequences \B, \N, \R, and \X are not special inside a character | The sequences \B, \N, \R, and \X are not special inside a character |
class. Like any other unrecognized escape sequences, they are treated | class. Like any other unrecognized escape sequences, they are treated |
as the literal characters "B", "N", "R", and "X" by default, but cause | as the literal characters "B", "N", "R", and "X" by default, but cause |
an error if the PCRE_EXTRA option is set. |
an error if the PCRE_EXTRA option is set. |
|
|
A circumflex can conveniently be used with the upper case character | A circumflex can conveniently be used with the upper case character |
types to specify a more restricted set of characters than the matching | types to specify a more restricted set of characters than the matching |
lower case type. For example, the class [^\W_] matches any letter or | lower case type. For example, the class [^\W_] matches any letter or |
digit, but not underscore, whereas [\w] includes underscore. A positive |
digit, but not underscore, whereas [\w] includes underscore. A positive |
character class should be read as "something OR something OR ..." and a |
character class should be read as "something OR something OR ..." and a |
negative class as "NOT something AND NOT something AND NOT ...". |
negative class as "NOT something AND NOT something AND NOT ...". |
|
|
The only metacharacters that are recognized in character classes are | The only metacharacters that are recognized in character classes are |
backslash, hyphen (only where it can be interpreted as specifying a | backslash, hyphen (only where it can be interpreted as specifying a |
range), circumflex (only at the start), opening square bracket (only | range), circumflex (only at the start), opening square bracket (only |
when it can be interpreted as introducing a POSIX class name - see the | when it can be interpreted as introducing a POSIX class name, or for a |
next section), and the terminating closing square bracket. However, | special compatibility feature - see the next two sections), and the |
escaping other non-alphanumeric characters does no harm. | terminating closing square bracket. However, escaping other non- |
| alphanumeric characters does no harm. |
|
|
|
|
POSIX CHARACTER CLASSES |
POSIX CHARACTER CLASSES |
Line 5701 POSIX CHARACTER CLASSES
|
Line 5832 POSIX CHARACTER CLASSES
|
lower lower case letters |
lower lower case letters |
print printing characters, including space |
print printing characters, including space |
punct printing characters, excluding letters and digits and space |
punct printing characters, excluding letters and digits and space |
space white space (not quite the same as \s) | space white space (the same as \s from PCRE 8.34) |
upper upper case letters |
upper upper case letters |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
|
|
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), | The default "space" characters are HT (9), LF (10), VT (11), FF (12), |
and space (32). Notice that this list includes the VT character (code | CR (13), and space (32). If locale-specific matching is taking place, |
11). This makes "space" different to \s, which does not include VT (for | the list of space characters may be different; there may be fewer or |
Perl compatibility). | more of them. "Space" used to be different to \s, which did not include |
| VT, for Perl compatibility. However, Perl changed at release 5.18, and |
| PCRE followed at release 8.34. "Space" and \s now match the same set |
| of characters. |
|
|
The name "word" is a Perl extension, and "blank" is a GNU extension | The name "word" is a Perl extension, and "blank" is a GNU extension |
from Perl 5.8. Another Perl extension is negation, which is indicated | from Perl 5.8. Another Perl extension is negation, which is indicated |
by a ^ character after the colon. For example, |
by a ^ character after the colon. For example, |
|
|
[12[:^digit:]] |
[12[:^digit:]] |
|
|
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the | matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
these are not supported, and an error is given if they are encountered. |
these are not supported, and an error is given if they are encountered. |
|
|
By default, in UTF modes, characters with values greater than 128 do | By default, characters with values greater than 128 do not match any of |
not match any of the POSIX character classes. However, if the PCRE_UCP | the POSIX character classes. However, if the PCRE_UCP option is passed |
option is passed to pcre_compile(), some of the classes are changed so | to pcre_compile(), some of the classes are changed so that Unicode |
that Unicode character properties are used. This is achieved by replac- | character properties are used. This is achieved by replacing certain |
ing the POSIX classes by other sequences, as follows: | POSIX classes by other sequences, as follows: |
|
|
[:alnum:] becomes \p{Xan} |
[:alnum:] becomes \p{Xan} |
[:alpha:] becomes \p{L} |
[:alpha:] becomes \p{L} |
Line 5736 POSIX CHARACTER CLASSES
|
Line 5870 POSIX CHARACTER CLASSES
|
[:upper:] becomes \p{Lu} |
[:upper:] becomes \p{Lu} |
[:word:] becomes \p{Xwd} |
[:word:] becomes \p{Xwd} |
|
|
Negated versions, such as [:^alpha:] use \P instead of \p. The other | Negated versions, such as [:^alpha:] use \P instead of \p. Three other |
POSIX classes are unchanged, and match only characters with code points | POSIX classes are handled specially in UCP mode: |
less than 128. | |
|
|
|
[:graph:] This matches characters that have glyphs that mark the page |
|
when printed. In Unicode property terms, it matches all char- |
|
acters with the L, M, N, P, S, or Cf properties, except for: |
|
|
|
U+061C Arabic Letter Mark |
|
U+180E Mongolian Vowel Separator |
|
U+2066 - U+2069 Various "isolate"s |
|
|
|
|
|
[:print:] This matches the same characters as [:graph:] plus space |
|
characters that are not controls, that is, characters with |
|
the Zs property. |
|
|
|
[:punct:] This matches all characters that have the Unicode P (punctua- |
|
tion) property, plus those characters whose code points are |
|
less than 128 that have the S (Symbol) property. |
|
|
|
The other POSIX classes are unchanged, and match only characters with |
|
code points less than 128. |
|
|
|
|
|
COMPATIBILITY FEATURE FOR WORD BOUNDARIES |
|
|
|
In the POSIX.2 compliant library that was included in 4.4BSD Unix, the |
|
ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word" |
|
and "end of word". PCRE treats these items as follows: |
|
|
|
[[:<:]] is converted to \b(?=\w) |
|
[[:>:]] is converted to \b(?<=\w) |
|
|
|
Only these exact character sequences are recognized. A sequence such as |
|
[a[:<:]b] provokes error for an unrecognized POSIX class name. This |
|
support is not compatible with Perl. It is provided to help migrations |
|
from other environments, and is best not used in any new patterns. Note |
|
that \b matches at the start and the end of a word (see "Simple asser- |
|
tions" above), and in a Perl-style pattern the preceding or following |
|
character normally shows which is wanted, without the need for the |
|
assertions that are used above in order to give exactly the POSIX be- |
|
haviour. |
|
|
|
|
VERTICAL BAR |
VERTICAL BAR |
|
|
Vertical bar characters are used to separate alternative patterns. For | Vertical bar characters are used to separate alternative patterns. For |
example, the pattern |
example, the pattern |
|
|
gilbert|sullivan |
gilbert|sullivan |
|
|
matches either "gilbert" or "sullivan". Any number of alternatives may | matches either "gilbert" or "sullivan". Any number of alternatives may |
appear, and an empty alternative is permitted (matching the empty | appear, and an empty alternative is permitted (matching the empty |
string). The matching process tries each alternative in turn, from left |
string). The matching process tries each alternative in turn, from left |
to right, and the first one that succeeds is used. If the alternatives | to right, and the first one that succeeds is used. If the alternatives |
are within a subpattern (defined below), "succeeds" means matching the | are within a subpattern (defined below), "succeeds" means matching the |
rest of the main pattern as well as the alternative in the subpattern. |
rest of the main pattern as well as the alternative in the subpattern. |
|
|
|
|
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
|
|
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and | The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from | PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
within the pattern by a sequence of Perl option letters enclosed | within the pattern by a sequence of Perl option letters enclosed |
between "(?" and ")". The option letters are |
between "(?" and ")". The option letters are |
|
|
i for PCRE_CASELESS |
i for PCRE_CASELESS |
Line 5770 INTERNAL OPTION SETTING
|
Line 5943 INTERNAL OPTION SETTING
|
|
|
For example, (?im) sets caseless, multiline matching. It is also possi- |
For example, (?im) sets caseless, multiline matching. It is also possi- |
ble to unset these options by preceding the letter with a hyphen, and a |
ble to unset these options by preceding the letter with a hyphen, and a |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- | combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, | LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
is also permitted. If a letter appears both before and after the | is also permitted. If a letter appears both before and after the |
hyphen, the option is unset. |
hyphen, the option is unset. |
|
|
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA | The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
can be changed in the same way as the Perl-compatible options by using | can be changed in the same way as the Perl-compatible options by using |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
|
|
When one of these option changes occurs at top level (that is, not | When one of these option changes occurs at top level (that is, not |
inside subpattern parentheses), the change applies to the remainder of | inside subpattern parentheses), the change applies to the remainder of |
the pattern that follows. If the change is placed right at the start of |
the pattern that follows. If the change is placed right at the start of |
a pattern, PCRE extracts it into the global options (and it will there- |
a pattern, PCRE extracts it into the global options (and it will there- |
fore show up in data extracted by the pcre_fullinfo() function). |
fore show up in data extracted by the pcre_fullinfo() function). |
|
|
An option change within a subpattern (see below for a description of | An option chan An option chan |
subpatterns) affects only that part of the subpattern that follows it, | subpatterns) affects only that part of the subpattern that follows it, |
so |
so |
|
|
(a(?i)b)c |
(a(?i)b)c |
|
|
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
used). By this means, options can be made to have different settings | used). By this means, options can be made to have different settings |
in different parts of the pattern. Any changes made in one alternative | in different parts of the pattern. Any changes made in one alternative |
do carry on into subsequent branches within the same subpattern. For | do carry on into subsequent branches within the same subpattern. For |
example, |
example, |
|
|
(a(?i)b|c) |
(a(?i)b|c) |
|
|
matches "ab", "aB", "c", and "C", even though when matching "C" the | matches "ab", "aB", "c", and "C", even though when matching "C" the |
first branch is abandoned before the option setting. This is because | first branch is abandoned before the option setting. This is because |
the effects of option settings happen at compile time. There would be | the effects of option settings happen at compile time. There would be |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
|
|
Note: There are other PCRE-specific options that can be set by the | Note: There are other PCRE-specific options that can be set by the |
application when the compiling or matching functions are called. In | application when the compiling or matching functions are called. In |
some cases the pattern can contain special leading sequences such as | some cases the pattern can contain special leading sequences such as |
(*CRLF) to override what the application has set or what has been | (*CRLF) to override what the application has set or what has been |
defaulted. Details are given in the section entitled "Newline | defaulted. Details are given in the section entitled "Newline |
sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and | sequences" above. There are also the (*UTF8), (*UTF16),(*UTF32), and |
(*UCP) leading sequences that can be used to set UTF and Unicode prop- | (*UCP) leading sequences that can be used to set UTF and Unicode prop- |
erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, | erty modes; they are equivalent to setting the PCRE_UTF8, PCRE_UTF16, |
PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence | PCRE_UTF32 and the PCRE_UCP options, respectively. The (*UTF) sequence |
is a generic version that can be used with any of the libraries. How- | is a generic version that can be used with any of the libraries. How- |
ever, the application can set the PCRE_NEVER_UTF option, which locks | ever, the application can set the PCRE_NEVER_UTF option, which locks |
out the use of the (*UTF) sequences. |
out the use of the (*UTF) sequences. |
|
|
|
|
Line 5827 SUBPATTERNS
|
Line 6000 SUBPATTERNS
|
|
|
cat(aract|erpillar|) |
cat(aract|erpillar|) |
|
|
matches "cataract", "caterpillar", or "cat". Without the parentheses, | matches "cataract", "caterpillar", or "cat". Without the parentheses, |
it would match "cataract", "erpillar" or an empty string. |
it would match "cataract", "erpillar" or an empty string. |
|
|
2. It sets up the subpattern as a capturing subpattern. This means | 2. It sets up the subpattern as a capturing subpattern. This means |
that, when the whole pattern matches, that portion of the subject | that, when the whole pattern matches, that portion of the subject |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
ovector argument of the matching function. (This applies only to the | ovector argument of the matching function. (This applies only to the |
traditional matching functions; the DFA matching functions do not sup- | traditional matching functions; the DFA matching functions do not sup- |
port capturing.) |
port capturing.) |
|
|
Opening parentheses are counted from left to right (starting from 1) to |
Opening parentheses are counted from left to right (starting from 1) to |
obtain numbers for the capturing subpatterns. For example, if the | obtain numbers for the capturing subpatterns. For example, if the |
string "the red king" is matched against the pattern |
string "the red king" is matched against the pattern |
|
|
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
Line 5846 SUBPATTERNS
|
Line 6019 SUBPATTERNS
|
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
|
|
The fact that plain parentheses fulfil two functions is not always | The fact that plain parentheses fulfil two functions is not always |
helpful. There are often times when a grouping subpattern is required | helpful. There are often times when a grouping subpattern is required |
without a capturing requirement. If an opening parenthesis is followed | without a capturing requirement. If an opening parenthesis is followed |
by a question mark and a colon, the subpattern does not do any captur- | by a question mark and a colon, the subpattern does not do any captur- |
ing, and is not counted when computing the number of any subsequent | ing, and is not counted when computing the number of any subsequent |
capturing subpatterns. For example, if the string "the white queen" is | capturing subpatterns. For example, if the string "the white queen" is |
matched against the pattern |
matched against the pattern |
|
|
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
Line 5859 SUBPATTERNS
|
Line 6032 SUBPATTERNS
|
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
1 and 2. The maximum number of capturing subpatterns is 65535. |
1 and 2. The maximum number of capturing subpatterns is 65535. |
|
|
As a convenient shorthand, if any option settings are required at the | As a convenient shorthand, if any option settings are required at the |
start of a non-capturing subpattern, the option letters may appear | start of a non-capturing subpattern, the option letters may appear |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
|
|
(?i:saturday|sunday) |
(?i:saturday|sunday) |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
|
|
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
tried from left to right, and options are not reset until the end of | tried from left to right, and options are not reset until the end of |
the subpattern is reached, an option setting in one branch does affect | the subpattern is reached, an option setting in one branch does affect |
subsequent branches, so the above patterns match "SUNDAY" as well as | subsequent branches, so the above patterns match "SUNDAY" as well as |
"Saturday". |
"Saturday". |
|
|
|
|
DUPLICATE SUBPATTERN NUMBERS |
DUPLICATE SUBPATTERN NUMBERS |
|
|
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
uses the same numbers for its capturing parentheses. Such a subpattern | uses the same numbers for its capturing parentheses. Such a subpattern |
starts with (?| and is itself a non-capturing subpattern. For example, | starts with (?| and is itself a non-capturing subpattern. For example, |
consider this pattern: |
consider this pattern: |
|
|
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
|
|
Because the two alternatives are inside a (?| group, both sets of cap- | Because the two alternatives are inside a (?| group, both sets of cap- |
turing parentheses are numbered one. Thus, when the pattern matches, | turing parentheses are numbered one. Thus, when the pattern matches, |
you can look at captured substring number one, whichever alternative | you can look at captured substring number one, whichever alternative |
matched. This construct is useful when you want to capture part, but | matched. This construct is useful when you want to capture part, but |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
theses are numbered as usual, but the number is reset at the start of | theses are numbered as usual, but the number is reset at the start of |
each branch. The numbers of any capturing parentheses that follow the | each branch. The numbers of any capturing parentheses that follow the |
subpattern start after the highest number used in any branch. The fol- | subpattern start after the highest number used in any branch. The fol- |
lowing example is taken from the Perl documentation. The numbers under- |
lowing example is taken from the Perl documentation. The numbers under- |
neath show in which buffer the captured content will be stored. |
neath show in which buffer the captured content will be stored. |
|
|
Line 5897 DUPLICATE SUBPATTERN NUMBERS
|
Line 6070 DUPLICATE SUBPATTERN NUMBERS
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
|
|
A back reference to a numbered subpattern uses the most recent value | A back reference to a numbered subpattern uses the most recent value |
that is set for that number by any subpattern. The following pattern | that is set for that number by any subpattern. The following pattern |
matches "abcabc" or "defdef": |
matches "abcabc" or "defdef": |
|
|
/(?|(abc)|(def))\1/ |
/(?|(abc)|(def))\1/ |
|
|
In contrast, a subroutine call to a numbered subpattern always refers | In contrast, a subroutine call to a numbered subpattern always refers |
to the first one in the pattern with the given number. The following | to the first one in the pattern with the given number. The following |
pattern matches "abcabc" or "defabc": |
pattern matches "abcabc" or "defabc": |
|
|
/(?|(abc)|(def))(?1)/ |
/(?|(abc)|(def))(?1)/ |
|
|
If a condition test for a subpattern's having matched refers to a non- | If a condition test for a subpattern's having matched refers to a non- |
unique number, the test is true if any of the subpatterns of that num- | unique number, the test is true if any of the subpatterns of that num- |
ber have matched. |
ber have matched. |
|
|
An alternative approach to using this "branch reset" feature is to use | An alternative approach to using this "branch reset" feature is to use |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
|
|
|
|
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
|
|
Identifying capturing parentheses by number is simple, but it can be | Identifying capturing parentheses by number is simple, but it can be |
very hard to keep track of the numbers in complicated regular expres- | very hard to keep track of the numbers in complicated regular expres- |
sions. Furthermore, if an expression is modified, the numbers may | sions. Furthermore, if an expression is modified, the numbers may |
change. To help with this difficulty, PCRE supports the naming of sub- | change. To help with this difficulty, PCRE supports the naming of sub- |
patterns. This feature was not added to Perl until release 5.10. Python |
patterns. This feature was not added to Perl until release 5.10. Python |
had the feature earlier, and PCRE introduced it at release 4.0, using | had the feature earlier, and PCRE introduced it at release 4.0, using |
the Python syntax. PCRE now supports both the Perl and the Python syn- | the Python syntax. PCRE now supports both the Perl and the Python syn- |
tax. Perl allows identically numbered subpatterns to have different | tax. Perl allows identically numbered subpatterns to have different |
names, but PCRE does not. |
names, but PCRE does not. |
|
|
In PCRE, a subpattern can be named in one of three ways: (?<name>...) | In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References | or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
to capturing parentheses from other parts of the pattern, such as back | to capturing parentheses from other parts of the pattern, such as back |
references, recursion, and conditions, can be made by name as well as | references, recursion, and conditions, can be made by name as well as |
by number. |
by number. |
|
|
Names consist of up to 32 alphanumeric characters and underscores. | Names consist of up to 32 alphanumeric characters and underscores, but |
Named capturing parentheses are still allocated numbers as well as | must start with a non-digit. Named capturing parentheses are still |
names, exactly as if the names were not present. The PCRE API provides | allocated numbers as well as names, exactly as if the names were not |
function calls for extracting the name-to-number translation table from | present. The PCRE API provides function calls for extracting the name- |
a compiled pattern. There is also a convenience function for extracting | to-number translation table from a compiled pattern. There is also a |
a captured substring by name. | convenience function for extracting a captured substring by name. |
|
|
By default, a name must be unique within a pattern, but it is possible | By default, a name must be unique within a pattern, but it is possible |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
time. (Duplicate names are also always permitted for subpatterns with | time. (Duplicate names are also always permitted for subpatterns with |
the same number, set up as described in the previous section.) Dupli- | the same number, set up as described in the previous section.) Dupli- |
cate names can be useful for patterns where only one instance of the | cate names can be useful for patterns where only one instance of the |
named parentheses can match. Suppose you want to match the name of a | named parentheses can match. Suppose you want to match the name of a |
weekday, either as a 3-letter abbreviation or as the full name, and in | weekday, either as a 3-letter abbreviation or as the full name, and in |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
the line breaks) does the job: |
the line breaks) does the job: |
|
|
Line 5958 NAMED SUBPATTERNS
|
Line 6131 NAMED SUBPATTERNS
|
(?<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
|
|
There are five capturing substrings, but only one is ever set after a | There are five capturing substrings, but only one is ever set after a |
match. (An alternative way of solving this problem is to use a "branch |
match. (An alternative way of solving this problem is to use a "branch |
reset" subpattern, as described in the previous section.) |
reset" subpattern, as described in the previous section.) |
|
|
The convenience function for extracting the data by name returns the | The convenience function for extracting the data by name returns the |
substring for the first (and in this example, the only) subpattern of | substring for the first (and in this example, the only) subpattern of |
that name that matched. This saves searching to find which numbered | that name that matched. This saves searching to find which numbered |
subpattern it was. |
subpattern it was. |
|
|
If you make a back reference to a non-unique named subpattern from | If you make a back reference to a non-unique named subpattern from |
elsewhere in the pattern, the one that corresponds to the first occur- | elsewhere in the pattern, the subpatterns to which the name refers are |
rence of the name is used. In the absence of duplicate numbers (see the | checked in the order in which they appear in the overall pattern. The |
previous section) this is the one with the lowest number. If you use a | first one that is set is used for the reference. For example, this pat- |
named reference in a condition test (see the section about conditions | tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo": |
below), either to check whether a subpattern has matched, or to check | |
for recursion, all subpatterns with the same name are tested. If the | |
condition is true for any one of them, the overall condition is true. | |
This is the same behaviour as testing by number. For further details of | |
the interfaces for handling named subpatterns, see the pcreapi documen- | |
tation. | |
|
|
|
(?:(?<n>foo)|(?<n>bar))\k<n> |
|
|
|
|
|
If you make a subroutine call to a non-unique named subpattern, the one |
|
that corresponds to the first occurrence of the name is used. In the |
|
absence of duplicate numbers (see the previous section) this is the one |
|
with the lowest number. |
|
|
|
If you use a named reference in a condition test (see the section about |
|
conditions below), either to check whether a subpattern has matched, or |
|
to check for recursion, all subpatterns with the same name are tested. |
|
If the condition is true for any one of them, the overall condition is |
|
true. This is the same behaviour as testing by number. For further |
|
details of the interfaces for handling named subpatterns, see the |
|
pcreapi documentation. |
|
|
Warning: You cannot use different names to distinguish between two sub- |
Warning: You cannot use different names to distinguish between two sub- |
patterns with the same number because PCRE uses only the numbers when |
patterns with the same number because PCRE uses only the numbers when |
matching. For this reason, an error is given at compile time if differ- |
matching. For this reason, an error is given at compile time if differ- |
ent names are given to subpatterns with the same number. However, you |
ent names are given to subpatterns with the same number. However, you |
can give the same name to subpatterns with the same number, even when | can always give the same name to subpatterns with the same number, even |
PCRE_DUPNAMES is not set. | when PCRE_DUPNAMES is not set. |
|
|
|
|
REPETITION |
REPETITION |
Line 6619 CONDITIONAL SUBPATTERNS
|
Line 6802 CONDITIONAL SUBPATTERNS
|
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
also recognized. However, there is a possible ambiguity with this syn- | also recognized. |
tax, because subpattern names may consist entirely of digits. PCRE | |
looks first for a named subpattern; if it cannot find one and the name | |
consists entirely of digits, PCRE looks for a subpattern of that num- | |
ber, which must be greater than zero. Using subpattern names that con- | |
sist entirely of digits is not recommended. | |
|
|
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
|
|
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
|
|
If the name used in a condition of this kind is a duplicate, the test | If the name used in a condition of this kind is a duplicate, the test |
is applied to all subpatterns of the same name, and is true if any one | is applied to all subpatterns of the same name, and is true if any one |
of them has matched. |
of them has matched. |
|
|
Checking for pattern recursion |
Checking for pattern recursion |
|
|
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
name R, the condition is true if a recursive call to the whole pattern | name R, the condition is true if a recursive call to the whole pattern |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
|
|
Line 6645 CONDITIONAL SUBPATTERNS
|
Line 6823 CONDITIONAL SUBPATTERNS
|
|
|
the condition is true if the most recent recursion is into a subpattern |
the condition is true if the most recent recursion is into a subpattern |
whose number or name is given. This condition does not check the entire |
whose number or name is given. This condition does not check the entire |
recursion stack. If the name used in a condition of this kind is a | recursion stack. If the name used in a condition of this kind is a |
duplicate, the test is applied to all subpatterns of the same name, and |
duplicate, the test is applied to all subpatterns of the same name, and |
is true if any one of them is the most recent recursion. |
is true if any one of them is the most recent recursion. |
|
|
At "top level", all these recursion test conditions are false. The | At "top level", all these recursion test conditions are false. The |
syntax for recursive patterns is described below. |
syntax for recursive patterns is described below. |
|
|
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
|
|
If the condition is the string (DEFINE), and there is no subpattern | If the condition is the string (DEFINE), and there is no subpattern |
with the name DEFINE, the condition is always false. In this case, | with the name DEFINE, the condition is always false. In this case, |
there may be only one alternative in the subpattern. It is always | there may be only one alternative in the subpattern. It is always |
skipped if control reaches this point in the pattern; the idea of | skipped if control reaches this point in the pattern; the idea of |
DEFINE is that it can be used to define subroutines that can be refer- | DEFINE is that it can be used to define subroutines that can be refer- |
enced from elsewhere. (The use of subroutines is described below.) For | enced from elsewhere. (The use of subroutines is described below.) For |
example, a pattern to match an IPv4 address such as "192.168.23.245" | example, a pattern to match an IPv4 address such as "192.168.23.245" |
could be written like this (ignore white space and line breaks): |
could be written like this (ignore white space and line breaks): |
|
|
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
|
|
The first part of the pattern is a DEFINE group inside which a another | The first part of the pattern is a DEFINE group inside which a another |
group named "byte" is defined. This matches an individual component of | group named "byte" is defined. This matches an individual component of |
an IPv4 address (a number less than 256). When matching takes place, | an IPv4 address (a number less than 256). When matching takes place, |
this part of the pattern is skipped because DEFINE acts like a false | this part of the pattern is skipped because DEFINE acts like a false |
condition. The rest of the pattern uses references to the named group | condition. The rest of the pattern uses references to the named group |
to match the four dot-separated components of an IPv4 address, insist- | to match the four dot-separated components of an IPv4 address, insist- |
ing on a word boundary at each end. |
ing on a word boundary at each end. |
|
|
Assertion conditions |
Assertion conditions |
|
|
If the condition is not in any of the above formats, it must be an | If the condition is not in any of the above formats, it must be an |
assertion. This may be a positive or negative lookahead or lookbehind | assertion. This may be a positive or negative lookahead or lookbehind |
assertion. Consider this pattern, again containing non-significant | assertion. Consider this pattern, again containing non-significant |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
|
|
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
|
|
The condition is a positive lookahead assertion that matches an | The condition is a positive lookahead assertion that matches an |
optional sequence of non-letters followed by a letter. In other words, | optional sequence of non-letters followed by a letter. In other words, |
it tests for the presence of at least one letter in the subject. If a | it tests for the presence of at least one letter in the subject. If a |
letter is found, the subject is matched against the first alternative; | letter is found, the subject is matched against the first alternative; |
otherwise it is matched against the second. This pattern matches | otherwise it is matched against the second. This pattern matches |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are | strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
letters and dd are digits. |
letters and dd are digits. |
|
|
|
|
Line 6698 COMMENTS
|
Line 6876 COMMENTS
|
There are two ways of including comments in patterns that are processed |
There are two ways of including comments in patterns that are processed |
by PCRE. In both cases, the start of the comment must not be in a char- |
by PCRE. In both cases, the start of the comment must not be in a char- |
acter class, nor in the middle of any other sequence of related charac- |
acter class, nor in the middle of any other sequence of related charac- |
ters such as (?: or a subpattern name or number. The characters that | ters such as (?: or a subpattern name or number. The characters that |
make up a comment play no part in the pattern matching. |
make up a comment play no part in the pattern matching. |
|
|
The sequence (?# marks the start of a comment that continues up to the | The sequence (?# marks the start of a comment that continues up to the |
next closing parenthesis. Nested parentheses are not permitted. If the | next closing parenthesis. Nested parentheses are not permitted. If the |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
comment, which in this case continues to immediately after the next | comment, which in this case continues to immediately after the next |
newline character or character sequence in the pattern. Which charac- | newline character or character sequence in the pattern. Which charac- |
ters are interpreted as newlines is controlled by the options passed to |
ters are interpreted as newlines is controlled by the options passed to |
a compiling function or by a special sequence at the start of the pat- | a compiling function or by a special sequence at the start of the pat- |
tern, as described in the section entitled "Newline conventions" above. |
tern, as described in the section entitled "Newline conventions" above. |
Note that the end of this type of comment is a literal newline sequence |
Note that the end of this type of comment is a literal newline sequence |
in the pattern; escape sequences that happen to represent a newline do | in the pattern; escape sequences that happen to represent a newline do |
not count. For example, consider this pattern when PCRE_EXTENDED is | not count. For example, consider this pattern when PCRE_EXTENDED is |
set, and the default newline convention is in force: |
set, and the default newline convention is in force: |
|
|
abc #comment \n still comment |
abc #comment \n still comment |
|
|
On encountering the # character, pcre_compile() skips along, looking | On encountering the # character, pcre_compile() skips along, looking |
for a newline in the pattern. The sequence \n is still literal at this | for a newline in the pattern. The sequence \n is still literal at this |
stage, so it does not terminate the comment. Only an actual character | stage, so it does not terminate the comment. Only an actual character |
with the code value 0x0a (the default newline) does so. |
with the code value 0x0a (the default newline) does so. |
|
|
|
|
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
|
|
Consider the problem of matching a string in parentheses, allowing for | Consider the problem of matching a string in parentheses, allowing for |
unlimited nested parentheses. Without the use of recursion, the best | unlimited nested parentheses. Without the use of recursion, the best |
that can be done is to use a pattern that matches up to some fixed | that can be done is to use a pattern that matches up to some fixed |
depth of nesting. It is not possible to handle an arbitrary nesting | depth of nesting. It is not possible to handle an arbitrary nesting |
depth. |
depth. |
|
|
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
sions to recurse (amongst other things). It does this by interpolating | sions to recurse (amongst other things). It does this by interpolating |
Perl code in the expression at run time, and the code can refer to the | Perl code in the expression at run time, and the code can refer to the |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
|
|
Line 6742 RECURSIVE PATTERNS
|
Line 6920 RECURSIVE PATTERNS
|
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
|
|
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
it supports special syntax for recursion of the entire pattern, and | it supports special syntax for recursion of the entire pattern, and |
also for individual subpattern recursion. After its introduction in | also for individual subpattern recursion. After its introduction in |
PCRE and Python, this kind of recursion was subsequently introduced | PCRE and Python, this kind of recursion was subsequently introduced |
into Perl at release 5.10. |
into Perl at release 5.10. |
|
|
A special item that consists of (? followed by a number greater than | A special item that consists of (? followed by a number greater than |
zero and a closing parenthesis is a recursive subroutine call of the | zero and a closing parenthesis is a recursive subroutine call of the |
subpattern of the given number, provided that it occurs inside that | subpattern of the given number, provided that it occurs inside that |
subpattern. (If not, it is a non-recursive subroutine call, which is | subpattern. (If not, it is a non-recursive subroutine call, which is |
described in the next section.) The special item (?R) or (?0) is a | described in the next section.) The special item (?R) or (?0) is a |
recursive call of the entire regular expression. |
recursive call of the entire regular expression. |
|
|
This PCRE pattern solves the nested parentheses problem (assume the | This PCRE pattern solves the nested parentheses problem (assume the |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
|
|
\( ( [^()]++ | (?R) )* \) |
\( ( [^()]++ | (?R) )* \) |
|
|
First it matches an opening parenthesis. Then it matches any number of | First it matches an opening parenthesis. Then it matches any number of |
substrings which can either be a sequence of non-parentheses, or a | substrings which can either be a sequence of non-parentheses, or a |
recursive match of the pattern itself (that is, a correctly parenthe- | recursive match of the pattern itself (that is, a correctly parenthe- |
sized substring). Finally there is a closing parenthesis. Note the use |
sized substring). Finally there is a closing parenthesis. Note the use |
of a possessive quantifier to avoid backtracking into sequences of non- |
of a possessive quantifier to avoid backtracking into sequences of non- |
parentheses. |
parentheses. |
|
|
If this were part of a larger pattern, you would not want to recurse | If this were part of a larger pattern, you would If this were part of a larger pattern, you would |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
|
|
( \( ( [^()]++ | (?1) )* \) ) |
( \( ( [^()]++ | (?1) )* \) ) |
|
|
We have put the pattern into parentheses, and caused the recursion to | We have put the pattern into parentheses, and caused the recursion to |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
|
|
In a larger pattern, keeping track of parenthesis numbers can be | In a larger pattern, keeping track of parenthesis numbers can be |
tricky. This is made easier by the use of relative references. Instead | tricky. This is made easier by the use of relative references. Instead |
of (?1) in the pattern above you can write (?-2) to refer to the second |
of (?1) in the pattern above you can write (?-2) to refer to the second |
most recently opened parentheses preceding the recursion. In other | most recently opened parentheses preceding the recursion. In other |
words, a negative number counts capturing parentheses leftwards from | words, a negative number counts capturing parentheses leftwards from |
the point at which it is encountered. |
the point at which it is encountered. |
|
|
It is also possible to refer to subsequently opened parentheses, by | It is also possible to refer to subsequently opened parentheses, by |
writing references such as (?+2). However, these cannot be recursive | writing references such as (?+2). However, these cannot be recursive |
because the reference is not inside the parentheses that are refer- | because the reference is not inside the parentheses that are refer- |
enced. They are always non-recursive subroutine calls, as described in | enced. They are always non-recursive subroutine calls, as described in |
the next section. |
the next section. |
|
|
An alternative approach is to use named parentheses instead. The Perl | An alternative approach is to use named parentheses instead. The Perl |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also | syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
|
|
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
|
|
If there is more than one subpattern with the same name, the earliest | If there is more than one subpattern with the same name, the earliest |
one is used. |
one is used. |
|
|
This particular example pattern that we have been looking at contains | This particular example pattern that we have been looking at contains |
nested unlimited repeats, and so the use of a possessive quantifier for |
nested unlimited repeats, and so the use of a possessive quantifier for |
matching strings of non-parentheses is important when applying the pat- |
matching strings of non-parentheses is important when applying the pat- |
tern to strings that do not match. For example, when this pattern is | tern to strings that do not match. For example, when this pattern is |
applied to |
applied to |
|
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
|
|
it yields "no match" quickly. However, if a possessive quantifier is | it yields "no match" quickly. However, if a possessive quantifier is |
not used, the match runs for a very long time indeed because there are | not used, the match runs for a very long time indeed because there are |
so many different ways the + and * repeats can carve up the subject, | so many different ways the + and * repeats can carve up the subject, |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
|
|
At the end of a match, the values of capturing parentheses are those | At the end of a match, the values of capturing parentheses are those |
from the outermost level. If you want to obtain intermediate values, a | from the outermost level. If you want to obtain intermediate values, a |
callout function can be used (see below and the pcrecallout documenta- | callout function can be used (see below and the pcrecallout documenta- |
tion). If the pattern above is matched against |
tion). If the pattern above is matched against |
|
|
(ab(cd)ef) |
(ab(cd)ef) |
|
|
the value for the inner capturing parentheses (numbered 2) is "ef", | the value for the inner capturing parentheses (numbered 2) is "ef", |
which is the last value taken on at the top level. If a capturing sub- | which is the last value taken on at the top level. If a capturing sub- |
pattern is not matched at the top level, its final captured value is | pattern is not matched at the top level, its final captured value is |
unset, even if it was (temporarily) set at a deeper level during the | unset, even if it was (temporarily) set at a deeper level during the |
matching process. |
matching process. |
|
|
If there are more than 15 capturing parentheses in a pattern, PCRE has | If there are more than 15 capturing parentheses in a pattern, PCRE has |
to obtain extra memory to store data during a recursion, which it does | to obtain extra memory to store data during a recursion, which it does |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
|
|
Do not confuse the (?R) item with the condition (R), which tests for | Do not confuse the (?R) item with the condition (R), which tests for |
recursion. Consider this pattern, which matches text in angle brack- | recursion. Consider this pattern, which matches text in angle brack- |
ets, allowing for arbitrary nesting. Only digits are allowed in nested | ets, allowing for arbitrary nesting. Only digits are allowed in nested |
brackets (that is, when recursing), whereas any characters are permit- | brackets (that is, when recursing), whereas any characters are permit- |
ted at the outer level. |
ted at the outer level. |
|
|
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
|
|
In this pattern, (?(R) is the start of a conditional subpattern, with | In this pattern, (?(R) is the start of a conditional subpattern, with |
two different alternatives for the recursive and non-recursive cases. | two different alternatives for the recursive and non-recursive cases. |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
|
|
Differences in recursion processing between PCRE and Perl |
Differences in recursion processing between PCRE and Perl |
|
|
Recursion processing in PCRE differs from Perl in two important ways. | Recursion processing in PCRE differs from Perl in two important ways. |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is | In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
alternatives and there is a subsequent matching failure. This can be | alternatives and there is a subsequent matching failure. This can be |
illustrated by the following pattern, which purports to match a palin- | illustrated by the following pattern, which purports to match a palin- |
dromic string that contains an odd number of characters (for example, | dromic string that contains an odd number of characters (for example, |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
|
|
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
|
|
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
characters surrounding a sub-palindrome. In Perl, this pattern works; | characters surrounding a sub-palindrome. In Perl, this pattern works; |
in PCRE it does not if the pattern is longer than three characters. | in PCRE it does not if the pattern is longer than three characters. |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
|
|
At the top level, the first character is matched, but as it is not at | At the top level, the first character is matched, but as it is not at |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tern 1 successfully matches the next character ("b"). (Note that the | tern 1 successfully matches the next character ("b"). (Note that the |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
|
|
Back at the top level, the next character ("c") is compared with what | Back at the top level, the next character ("c") is compared with what |
subpattern 2 matched, which was "a". This fails. Because the recursion | subpattern 2 matched, which was "a". This fails. Because the recursion |
is treated as an atomic group, there are now no backtracking points, | is treated as an atomic group, there are now no backtracking points, |
and so the entire match fails. (Perl is able, at this point, to re- | and so the entire match fails. (Perl is able, at this point, to re- |
enter the recursion and try the second alternative.) However, if the | enter the recursion and try the second alternative.) However, if the |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
different: |
different: |
|
|
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
|
|
This time, the recursing alternative is tried first, and continues to | This time, the recursing alternative is tried first, and continues to |
recurse until it runs out of characters, at which point the recursion | recurse until it runs out of characters, at which point the recursion |
fails. But this time we do have another alternative to try at the | fails. But this time we do have another alternative to try at the |
higher level. That is the big difference: in the previous case the | higher level. That is the big difference: in the previous case the |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
use. |
use. |
|
|
To change the pattern so that it matches all palindromic strings, not | To change the pattern so that it matches all palindromic strings, not |
just those with an odd number of characters, it is tempting to change | just those with an odd number of characters, it is tempting to change |
the pattern to this: |
the pattern to this: |
|
|
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
|
|
Again, this works in Perl, but not in PCRE, and for the same reason. | Again, this works in Perl, but not in PCRE, and for the same reason. |
When a deeper recursion has matched a single character, it cannot be | When a deeper recursion has matched a single character, it cannot be |
entered again in order to match an empty string. The solution is to | entered again in order to match an empty string. The solution is to |
separate the two cases, and write out the odd and even cases as alter- | separate the two cases, and write out the odd and even cases as alter- |
natives at the higher level: |
natives at the higher level: |
|
|
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
|
|
If you want to match typical palindromic phrases, the pattern has to | If you want to match typical palindromic phrases, the patte If you want to match typical palindromic phrases, the patte |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
|
|
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
|
|
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- | Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
ing into sequences of non-word characters. Without this, PCRE takes a | ing into sequences of non-word characters. Without this, PCRE takes a |
great deal longer (ten times or more) to match typical phrases, and | great deal longer (ten times or more) to match typical phrases, and |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
|
|
WARNING: The palindrome-matching patterns above work only if the sub- | WARNING: The palindrome-matching patterns above work only if the sub- |
ject string does not start with a palindrome that is shorter than the | ject string does not start with a palindrome that is shorter than the |
entire string. For example, although "abcba" is correctly matched, if | entire string. For example, although "abcba" is correctly matched, if |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, | the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
then fails at top level because the end of the string does not follow. | then fails at top level because the end of the string does not follow. |
Once again, it cannot jump back into the recursion to try other alter- | Once again, it cannot jump back into the recursion to try other alter- |
natives, so the entire match fails. |
natives, so the entire match fails. |
|
|
The second way in which PCRE and Perl differ in their recursion pro- | The second way in which PCRE and Perl differ in their recursion pro- |
cessing is in the handling of captured values. In Perl, when a subpat- | cessing is in the handling of captured values. In Perl, when a subpat- |
tern is called recursively or as a subpattern (see the next section), | tern is called recursively or as a subpattern (see the next section), |
it has no access to any values that were captured outside the recur- | it has no access to any values that were captured outside the recur- |
sion, whereas in PCRE these values can be referenced. Consider this | sion, whereas in PCRE these values can be referenced. Consider this |
pattern: |
pattern: |
|
|
^(.)(\1|a(?2)) |
^(.)(\1|a(?2)) |
|
|
In PCRE, this pattern matches "bab". The first capturing parentheses | In PCRE, this pattern matches "bab". The first capturing parentheses |
match "b", then in the second group, when the back reference \1 fails | match "b", then in the second group, when the back reference \1 fails |
to match "b", the second alternative matches "a" and then recurses. In | to match "b", the second alternative matches "a" and then recurses. In |
the recursion, \1 does now match "b" and so the whole match succeeds. | the recursion, \1 does now match "b" and so the whole match succeeds. |
In Perl, the pattern fails to match because inside the recursive call | In Perl, the pattern fails to match because inside the recursive call |
\1 cannot access the externally set value. |
\1 cannot access the externally set value. |
|
|
|
|
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
|
|
If the syntax for a recursive subpattern call (either by number or by | If the syntax for a recursive subpattern call (either by number or by |
name) is used outside the parentheses to which it refers, it operates | name) is used outside the parentheses to which it refers, it operates |
like a subroutine in a programming language. The called subpattern may | like a subroutine in a programming language. The called subpattern may |
be defined before or after the reference. A numbered reference can be | be defined before or after the reference. A numbered reference can be |
absolute or relative, as in these examples: |
absolute or relative, as in these examples: |
|
|
(...(absolute)...)...(?2)... |
(...(absolute)...)...(?2)... |
Line 6947 SUBPATTERNS AS SUBROUTINES
|
Line 7125 SUBPATTERNS AS SUBROUTINES
|
|
|
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
|
|
matches "sense and sensibility" and "response and responsibility", but | matches "sense and sensibility" and "response and responsibility", but |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
|
|
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
|
|
is used, it does match "sense and responsibility" as well as the other | is used, it does match "sense and responsibility" as well as the other |
two strings. Another example is given in the discussion of DEFINE | two strings. Another example is given in the discussion of DEFINE |
above. |
above. |
|
|
All subroutine calls, whether recursive or not, are always treated as | All subroutine calls, whether recursive or not, are always treated as |
atomic groups. That is, once a subroutine has matched some of the sub- | atomic groups. That is, once a subroutine has matched some of the sub- |
ject string, it is never re-entered, even if it contains untried alter- |
ject string, it is never re-entered, even if it contains untried alter- |
natives and there is a subsequent matching failure. Any capturing | natives and there is a subsequent matching failure. Any capturing |
parentheses that are set during the subroutine call revert to their | parentheses that are set during the subroutine call revert to their |
previous values afterwards. |
previous values afterwards. |
|
|
Processing options such as case-independence are fixed when a subpat- | Processing options such as case-independence are fixed when a subpat- |
tern is defined, so if it is used as a subroutine, such options cannot | tern is defined, so if it is used as a subroutine, such options cannot |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
|
|
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
|
|
It matches "abcabc". It does not match "abcABC" because the change of | It matches "abcabc". It does not match "abcABC" because t It matches "abcabc". It does not match "abcABC" because the change of |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
|
|
|
|
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
|
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a | For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
an alternative syntax for referencing a subpattern as a subroutine, | an alternative syntax for referencing a subpattern as a subroutine, |
possibly recursively. Here are two of the examples used above, rewrit- | possibly recursively. Here are two of the examples used above, rewrit- |
ten using this syntax: |
ten using this syntax: |
|
|
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
|
|
PCRE supports an extension to Oniguruma: if a number is preceded by a | PCRE supports an extension to Oniguruma: if a number is preceded by a |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
|
|
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
|
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not | Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
synonymous. The former is a back reference; the latter is a subroutine | synonymous. The former is a back reference; the latter is a subroutine |
call. |
call. |
|
|
|
|
CALLOUTS |
CALLOUTS |
|
|
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl code to be obeyed in the middle of matching a regular expression. | Perl code to be obeyed in the middle of matching a regular expression. |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
tion. |
tion. |
|
|
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
an external function by putting its entry point in the global variable | an external function by putting its entry point in the global variable |
pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit or 32-bit | pcre_callout (8-bit library) or pcre[16|32]_callout (16-bit or 32-bit |
library). By default, this variable contains NULL, which disables all | library). By default, this variable contains NULL, which disables all |
calling out. |
calling out. |
|
|
Within a regular expression, (?C) indicates the points at which the | Within a regular expression, (?C) indicates the points at which the |
external function is to be called. If you want to identify different | external function is to be called. If you want to identify different |
callout points, you can put a number less than 256 after the letter C. | callout points, you can put a number less than 256 after the letter C. |
The default value is zero. For example, this pattern has two callout | The default value is zero. For example, this pattern has two callout |
points: |
points: |
|
|
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
|
|
If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call- | If the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call- |
outs are automatically installed before each item in the pattern. They | outs are automatically installed before each item in the pattern. They |
are all numbered 255. If there is a conditional group in the pattern | are all numbered 255. If there is a conditional group in the pattern |
whose condition is an assertion, an additional callout is inserted just |
whose condition is an assertion, an additional callout is inserted just |
before the condition. An explicit callout may also be set at this posi- |
before the condition. An explicit callout may also be set at this posi- |
tion, as in this example: |
tion, as in this example: |
Line 7029 CALLOUTS
|
Line 7207 CALLOUTS
|
Note that this applies only to assertion conditions, not to other types |
Note that this applies only to assertion conditions, not to other types |
of condition. |
of condition. |
|
|
During matching, when PCRE reaches a callout point, the external func- | During matching, when PCRE reaches a callout point, the external func- |
tion is called. It is provided with the number of the callout, the | tion is called. It is provided with the number of the callout, the |
position in the pattern, and, optionally, one item of data originally | position in the pattern, and, optionally, one item of data originally |
supplied by the caller of the matching function. The callout function | supplied by the caller of the matching function. The callout function |
may cause matching to proceed, to backtrack, or to fail altogether. A | may cause matching to proceed, to backtrack, or to fail altogether. |
complete description of the interface to the callout function is given | |
in the pcrecallout documentation. | |
|
|
|
By default, PCRE implements a number of optimizations at compile time |
|
and matching time, and one side-effect is that sometimes callouts are |
|
skipped. If you need all possible callouts to happen, you need to set |
|
options that disable the relevant optimizations. More details, and a |
|
complete description of the interface to the callout function, are |
|
given in the pcrecallout documentation. |
|
|
|
|
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
|
|
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Line 7364 BACKTRACKING CONTROL
|
Line 7547 BACKTRACKING CONTROL
|
...(*COMMIT)(*PRUNE)... |
...(*COMMIT)(*PRUNE)... |
|
|
If there is a matching failure to the right, backtracking onto (*PRUNE) |
If there is a matching failure to the right, backtracking onto (*PRUNE) |
cases it to be triggered, and its action is taken. There can never be a | causes it to be triggered, and its action is taken. There can never be |
backtrack onto (*COMMIT). | a backtrack onto (*COMMIT). |
|
|
Backtracking verbs in repeated groups |
Backtracking verbs in repeated groups |
|
|
Line 7435 AUTHOR
|
Line 7618 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 26 April 2013 | Last updated: 03 December 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 7469 CHARACTERS
|
Line 7652 CHARACTERS
|
\n newline (hex 0A) |
\n newline (hex 0A) |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
\t tab (hex 09) |
\t tab (hex 09) |
|
\0dd character with octal code 0dd |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
|
\o{ddd..} character with octal code ddd.. |
\xhh character with hex code hh |
\xhh character with hex code hh |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
|
|
|
Note that \0dd is always an octal code, and that \8 and \9 are the lit- |
|
eral characters "8" and "9". |
|
|
|
|
CHARACTER TYPES |
CHARACTER TYPES |
|
|
. any character except newline; |
. any character except newline; |
Line 7495 CHARACTER TYPES
|
Line 7683 CHARACTER TYPES
|
\W a "non-word" character |
\W a "non-word" character |
\X a Unicode extended grapheme cluster |
\X a Unicode extended grapheme cluster |
|
|
In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII | By default, \d, \s, and \w match only ASCII characters, even in UTF-8 |
characters, even in a UTF mode. However, this can be changed by setting | mode or in the 16- bit and 32-bit libraries. However, if locale-spe- |
the PCRE_UCP option. | cific matching is happening, \s and \w may also match characters with |
| code points in the range 128-255. If the PCRE_UCP option is set, the |
| behaviour of these escape sequences is changed to use Unicode proper- |
| ties and they match many more characters. |
|
|
|
|
GENERAL CATEGORY PROPERTIES FOR \p and \P |
GENERAL CATEGORY PROPERTIES FOR \p and \P |
Line 7552 PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
Line 7743 PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
|
|
|
Xan Alphanumeric: union of properties L and N |
Xan Alphanumeric: union of properties L and N |
Xps POSIX space: property Z or tab, NL, VT, FF, CR |
Xps POSIX space: property Z or tab, NL, VT, FF, CR |
Xsp Perl space: property Z or tab, NL, FF, CR | Xsp Perl space: property Z or tab, NL, VT, FF, CR |
Xuc Univerally-named character: one that can be |
Xuc Univerally-named character: one that can be |
represented by a Universal Character Name |
represented by a Universal Character Name |
Xwd Perl word: property Xan or underscore |
Xwd Perl word: property Xan or underscore |
|
|
|
Perl and POSIX space are now the same. Perl added VT to its space char- |
|
acter set at release 5.18 and PCRE changed at release 8.34. |
|
|
|
|
SCRIPT NAMES FOR \p AND \P |
SCRIPT NAMES FOR \p AND \P |
|
|
Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, | Arabic, Armenian, Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo, |
Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, | Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Chakma, |
Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, | Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, |
Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, | Devanagari, Egyptian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- | Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- | gana, Imperial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscrip- |
tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, | tional_Parthian, Javanese, Kaithi, Kannada, Katakana, Kayah_Li, |
Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, | Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, |
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, |
Lydian, Malayalam, Mandaic, Meetei_Mayek, Meroitic_Cursive, |
Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, | Meroitic_Hieroglyphs, Miao, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, | Ogham, Old_Italic, Old_Persian, Old_South_Arabian, Old_Turkic, |
Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- | Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic, Samari- |
tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, | tan, Saurashtra, Sharada, Shavian, Sinhala, Sora_Sompeng, Sundanese, |
Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, | Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, |
Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, | Takri, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai, |
Yi. |
Yi. |
|
|
|
|
Line 7601 CHARACTER CLASSES
|
Line 7795 CHARACTER CLASSES
|
word same as \w |
word same as \w |
xdigit hexadecimal digit |
xdigit hexadecimal digit |
|
|
In PCRE, POSIX character set names recognize only ASCII characters by | In PCRE, POSIX character set names recognize only ASCII characters by |
default, but some of them use Unicode properties if PCRE_UCP is set. | default, but some of them use Unicode properties if PCRE_UCP is set. |
You can use \Q...\E inside a character class. |
You can use \Q...\E inside a character class. |
|
|
|
|
Line 7683 OPTION SETTING
|
Line 7877 OPTION SETTING
|
(?x) extended (ignore white space) |
(?x) extended (ignore white space) |
(?-...) unset option(s) |
(?-...) unset option(s) |
|
|
The following are recognized only at the start of a pattern or after | The following are recognized only at the start o The following are recognized only at the start of a pattern or after |
one of the newline-setting options with similar syntax: |
one of the newline-setting options with similar syntax: |
|
|
(*LIMIT_MATCH=d) set the match limit to d (decimal number) |
(*LIMIT_MATCH=d) set the match limit to d (decimal number) |
Line 7695 OPTION SETTING
|
Line 7889 OPTION SETTING
|
(*UTF) set appropriate UTF mode for the library in use |
(*UTF) set appropriate UTF mode for the library in use |
(*UCP) set PCRE_UCP (use Unicode properties for \d etc) |
(*UCP) set PCRE_UCP (use Unicode properties for \d etc) |
|
|
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of |
|
the limits set by the caller of pcre_exec(), not increase them. |
|
|
|
|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
|
|
(?=...) positive look ahead |
(?=...) positive look ahead |
Line 7819 AUTHOR
|
Line 8016 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 26 April 2013 | Last updated: 12 November 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 8743 MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16
|
Line 8940 MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16
|
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
to. |
to. |
|
|
|
That means that, for an unanchored pattern, if a continued match fails, |
|
it is not possible to try again at a new starting point. All this |
|
facility is capable of doing is continuing with the previous match |
|
attempt. In the previous example, if the second set of data is "ug23" |
|
the result is no match, even though there would be a match for "aug23" |
|
if the entire string were given at once. Depending on the application, |
|
this may or may not be what you want. The only way to allow for start- |
|
ing again at the next character is to retain the matched part of the |
|
subject and try a new complete match. |
|
|
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
PCRE_DFA_RESTART to continue partial matching over multiple segments. |
PCRE_DFA_RESTART to continue partial matching over multiple segments. |
This facility can be used to pass very long subject strings to the DFA |
This facility can be used to pass very long subject strings to the DFA |
Line 8926 AUTHOR
|
Line 9133 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 20 February 2013 | Last updated: 02 July 2013 |
Copyright (c) 1997-2013 University of Cambridge. |
Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
Line 9030 RE-USING A PRECOMPILED PATTERN
|
Line 9237 RE-USING A PRECOMPILED PATTERN
|
is used to pass this data, as described in the section on matching a |
is used to pass this data, as described in the section on matching a |
pattern in the pcreapi documentation. |
pattern in the pcreapi documentation. |
|
|
|
Warning: The tables that pcre_exec() and pcre_dfa_exec() use must be |
|
the same as those that were used when the pattern was compiled. If this |
|
is not the case, the behaviour is undefined. |
|
|
If you did not provide custom character tables when the pattern was |
If you did not provide custom character tables when the pattern was |
compiled, the pointer in the compiled pattern is NULL, which causes the |
compiled, the pointer in the compiled pattern is NULL, which causes the |
matching functions to use PCRE's internal tables. Thus, you do not need |
matching functions to use PCRE's internal tables. Thus, you do not need |
Line 9061 AUTHOR
|
Line 9272 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 24 June 2012 | Last updated: 12 November 2013 |
Copyright (c) 1997-2012 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
|
|
Line 9243 PCREPOSIX(3) Library Functions Manual
|
Line 9454 PCREPOSIX(3) Library Functions Manual
|
NAME |
NAME |
PCRE - Perl-compatible regular expressions. |
PCRE - Perl-compatible regular expressions. |
|
|
SYNOPSIS OF POSIX API | SYNOPSIS |
|
|
#include <pcreposix.h> |
#include <pcreposix.h> |
|
|
Line 9252 SYNOPSIS OF POSIX API
|
Line 9463 SYNOPSIS OF POSIX API
|
|
|
int regexec(regex_t *preg, const char *string, |
int regexec(regex_t *preg, const char *string, |
size_t nmatch, regmatch_t pmatch[], int eflags); |
size_t nmatch, regmatch_t pmatch[], int eflags); |
| size_t regerror(int errcode, const regex_t *preg, |
size_t regerror(int errcode, const regex_t *preg, | |
char *errbuf, size_t errbuf_size); |
char *errbuf, size_t errbuf_size); |
|
|
void regfree(regex_t *preg); |
void regfree(regex_t *preg); |
Line 9943 SIZE AND OTHER LIMITATIONS
|
Line 10153 SIZE AND OTHER LIMITATIONS
|
never in practice be relevant. |
never in practice be relevant. |
|
|
The maximum length of a compiled pattern is approximately 64K data |
The maximum length of a compiled pattern is approximately 64K data |
units (bytes for the 8-bit library, 32-bit units for the 32-bit | units (bytes for the 8-bit library, 16-bit units for the 16-bit |
library, and 32-bit units for the 32-bit library) if PCRE is compiled |
library, and 32-bit units for the 32-bit library) if PCRE is compiled |
with the default internal linkage size of 2 bytes. If you want to | with the default internal linkage size, which is 2 bytes for the 8-bit |
process regular expressions that are truly enormous, you can compile | and 16-bit libraries, and 4 bytes for the 32-bit library. If you want |
PCRE with an internal linkage size of 3 or 4 (when building the 16-bit | to process regular expressions that are truly enormous, you can compile |
or 32-bit library, 3 is rounded up to 4). See the README file in the | PCRE with an internal linkage size of 3 or 4 (when building the 16-bit |
source distribution and the pcrebuild documentation for details. In | or 32-bit library, 3 is rounded up to 4). See the README file in the |
these cases the limit is substantially larger. However, the speed of | source distribution and the pcrebuild documentation for details. In |
| these cases the limit is substantially larger. However, the speed of |
execution is slower. |
execution is slower. |
|
|
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
|
|
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
can be no more than 65535 capturing subpatterns. | can be no more than 65535 capturing subpatterns. There is, however, a |
| limit to the depth of nesting of parenthesized subpatterns of all |
| kinds. This is imposed in order to limit the amount of system stack |
| used at compile time. The limit can be specified when PCRE is built; |
| the default is 250. |
|
|
There is a limit to the number of forward references to subsequent sub- |
There is a limit to the number of forward references to subsequent sub- |
patterns of around 200,000. Repeated forward references with fixed | patterns of around 200,000. Repeated forward references with fixed |
upper limits, for example, (?2){0,100} when subpattern number 2 is to | upper limits, for example, (?2){0,100} when subpattern number 2 is to |
the right, are included in the count. There is no limit to the number | the right, are included in the count. There is no limit to the number |
of backward references. |
of backward references. |
|
|
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
|
|
The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or | The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or |
(*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and | (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit and |
32-bit library. | 32-bit libraries. |
|
|
The maximum length of a subject string is the largest positive number | The maximum length of a subject string is the largest positive number |
that an integer variable can hold. However, when using the traditional | that an integer variable can hold. However, when using the traditional |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
inite repetition. This means that the available stack space may limit | inite repetition. This means that the available stack space may limit |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
|
|
Line 9988 AUTHOR
|
Line 10203 AUTHOR
|
|
|
REVISION |
REVISION |
|
|
Last updated: 04 May 2012 | Last updated: 05 November 2013 |
Copyright (c) 1997-2012 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
|
|
|
|