|
version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.4, 2013/07/22 08:25:57
|
|
Line 46 man page, in case the conversion went wrong.
|
Line 46 man page, in case the conversion went wrong.
|
| The full syntax and semantics of the regular expressions that are supported by |
The full syntax and semantics of the regular expressions that are supported by |
| PCRE are described in the |
PCRE are described in the |
| <a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
| documentation. This document contains just a quick-reference summary of the | documentation. This document contains a quick-reference summary of the syntax. |
| syntax. | |
| </P> |
</P> |
| <br><a name="SEC2" href="#TOC1">QUOTING</a><br> |
<br><a name="SEC2" href="#TOC1">QUOTING</a><br> |
| <P> |
<P> |
|
Line 62 syntax.
|
Line 61 syntax.
|
| \a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is any ASCII character |
\cx "control-x", where x is any ASCII character |
| \e escape (hex 1B) |
\e escape (hex 1B) |
| \f formfeed (hex 0C) | \f form feed (hex 0C) |
| \n newline (hex 0A) |
\n newline (hex 0A) |
| \r carriage return (hex 0D) |
\r carriage return (hex 0D) |
| \t tab (hex 09) |
\t tab (hex 09) |
|
Line 76 syntax.
|
Line 75 syntax.
|
| <pre> |
<pre> |
| . any character except newline; |
. any character except newline; |
| in dotall mode, any character whatsoever |
in dotall mode, any character whatsoever |
| \C one byte, even in UTF-8 mode (best avoided) | \C one data unit, even in UTF mode (best avoided) |
| \d a decimal digit |
\d a decimal digit |
| \D a character that is not a decimal digit |
\D a character that is not a decimal digit |
| \h a horizontal whitespace character | \h a horizontal white space character |
| \H a character that is not a horizontal whitespace character | \H a character that is not a horizontal white space character |
| \N a character that is not a newline |
\N a character that is not a newline |
| \p{<i>xx</i>} a character with the <i>xx</i> property |
\p{<i>xx</i>} a character with the <i>xx</i> property |
| \P{<i>xx</i>} a character without the <i>xx</i> property |
\P{<i>xx</i>} a character without the <i>xx</i> property |
| \R a newline sequence |
\R a newline sequence |
| \s a whitespace character | \s a white space character |
| \S a character that is not a whitespace character | \S a character that is not a white space character |
| \v a vertical whitespace character | \v a vertical white space character |
| \V a character that is not a vertical whitespace character | \V a character that is not a vertical white space character |
| \w a "word" character |
\w a "word" character |
| \W a "non-word" character |
\W a "non-word" character |
| \X an extended Unicode sequence | \X a Unicode extended grapheme cluster |
| </pre> |
</pre> |
| In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII |
In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII |
| characters, even in UTF-8 mode. However, this can be changed by setting the | characters, even in a UTF mode. However, this can be changed by setting the |
| PCRE_UCP option. |
PCRE_UCP option. |
| </P> |
</P> |
| <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> |
<br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br> |
|
Line 152 PCRE_UCP option.
|
Line 151 PCRE_UCP option.
|
| Xan Alphanumeric: union of properties L and N |
Xan Alphanumeric: union of properties L and N |
| Xps POSIX space: property Z or tab, NL, VT, FF, CR |
Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| Xsp Perl space: property Z or tab, NL, FF, CR |
Xsp Perl space: property Z or tab, NL, FF, CR |
| |
Xuc Univerally-named character: one that can be |
| |
represented by a Universal Character Name |
| Xwd Perl word: property Xan or underscore |
Xwd Perl word: property Xan or underscore |
| </PRE> |
</PRE> |
| </P> |
</P> |
|
Line 162 Armenian,
|
Line 163 Armenian,
|
| Avestan, |
Avestan, |
| Balinese, |
Balinese, |
| Bamum, |
Bamum, |
| |
Batak, |
| Bengali, |
Bengali, |
| Bopomofo, |
Bopomofo, |
| |
Brahmi, |
| Braille, |
Braille, |
| Buginese, |
Buginese, |
| Buhid, |
Buhid, |
| Canadian_Aboriginal, |
Canadian_Aboriginal, |
| Carian, |
Carian, |
| |
Chakma, |
| Cham, |
Cham, |
| Cherokee, |
Cherokee, |
| Common, |
Common, |
|
Line 211 Lisu,
|
Line 215 Lisu,
|
| Lycian, |
Lycian, |
| Lydian, |
Lydian, |
| Malayalam, |
Malayalam, |
| |
Mandaic, |
| Meetei_Mayek, |
Meetei_Mayek, |
| |
Meroitic_Cursive, |
| |
Meroitic_Hieroglyphs, |
| |
Miao, |
| Mongolian, |
Mongolian, |
| Myanmar, |
Myanmar, |
| New_Tai_Lue, |
New_Tai_Lue, |
|
Line 230 Rejang,
|
Line 238 Rejang,
|
| Runic, |
Runic, |
| Samaritan, |
Samaritan, |
| Saurashtra, |
Saurashtra, |
| |
Sharada, |
| Shavian, |
Shavian, |
| Sinhala, |
Sinhala, |
| |
Sora_Sompeng, |
| Sundanese, |
Sundanese, |
| Syloti_Nagri, |
Syloti_Nagri, |
| Syriac, |
Syriac, |
|
Line 240 Tagbanwa,
|
Line 250 Tagbanwa,
|
| Tai_Le, |
Tai_Le, |
| Tai_Tham, |
Tai_Tham, |
| Tai_Viet, |
Tai_Viet, |
| |
Takri, |
| Tamil, |
Tamil, |
| Telugu, |
Telugu, |
| Thaana, |
Thaana, |
|
Line 269 Yi.
|
Line 280 Yi.
|
| lower lower case letter |
lower lower case letter |
| print printing, including space |
print printing, including space |
| punct printing, excluding alphanumeric |
punct printing, excluding alphanumeric |
| space whitespace | space white space |
| upper upper case letter |
upper upper case letter |
| word same as \w |
word same as \w |
| xdigit hexadecimal digit |
xdigit hexadecimal digit |
|
Line 366 but some of them use Unicode properties if PCRE_UCP is
|
Line 377 but some of them use Unicode properties if PCRE_UCP is
|
| The following are recognized only at the start of a pattern or after one of the |
The following are recognized only at the start of a pattern or after one of the |
| newline-setting options with similar syntax: |
newline-setting options with similar syntax: |
| <pre> |
<pre> |
| |
(*LIMIT_MATCH=d) set the match limit to d (decimal number) |
| |
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) |
| (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) |
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) |
| (*UTF8) set UTF-8 mode (PCRE_UTF8) | (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) |
| | (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) |
| | (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) |
| | (*UTF) set appropriate UTF mode for the library in use |
| (*UCP) set PCRE_UCP (use Unicode properties for \d etc) |
(*UCP) set PCRE_UCP (use Unicode properties for \d etc) |
| </PRE> |
</PRE> |
| </P> |
</P> |
|
Line 439 The following act immediately they are reached:
|
Line 455 The following act immediately they are reached:
|
| <pre> |
<pre> |
| (*ACCEPT) force successful match |
(*ACCEPT) force successful match |
| (*FAIL) force backtrack; synonym (*F) |
(*FAIL) force backtrack; synonym (*F) |
| |
(*MARK:NAME) set name to be passed back; synonym (*:NAME) |
| </pre> |
</pre> |
| The following act only when a subsequent match failure causes a backtrack to |
The following act only when a subsequent match failure causes a backtrack to |
| reach them. They all force a match failure, but they differ in what happens |
reach them. They all force a match failure, but they differ in what happens |
|
Line 447 pattern is not anchored.
|
Line 464 pattern is not anchored.
|
| <pre> |
<pre> |
| (*COMMIT) overall failure, no advance of starting point |
(*COMMIT) overall failure, no advance of starting point |
| (*PRUNE) advance to next starting character |
(*PRUNE) advance to next starting character |
| (*SKIP) advance start to current matching position | (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) |
| | (*SKIP) advance to current matching position |
| | (*SKIP:NAME) advance to position corresponding to an earlier |
| | (*MARK:NAME); if not found, the (*SKIP) is ignored |
| (*THEN) local failure, backtrack to next alternation |
(*THEN) local failure, backtrack to next alternation |
| |
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) |
| </PRE> |
</PRE> |
| </P> |
</P> |
| <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br> |
<br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br> |
| <P> |
<P> |
| These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
| (*BSR_...) or (*UTF8) or (*UCP) option. | (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. |
| <pre> |
<pre> |
| (*CR) carriage return only |
(*CR) carriage return only |
| (*LF) linefeed only |
(*LF) linefeed only |
|
Line 466 These are recognized only at the very start of the pat
|
Line 487 These are recognized only at the very start of the pat
|
| <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br> |
<br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br> |
| <P> |
<P> |
| These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
| (*...) option that sets the newline convention or UTF-8 or UCP mode. | (*...) option that sets the newline convention or a UTF or UCP mode. |
| <pre> |
<pre> |
| (*BSR_ANYCRLF) CR, LF, or CRLF |
(*BSR_ANYCRLF) CR, LF, or CRLF |
| (*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
|
Line 495 Cambridge CB2 3QH, England.
|
Line 516 Cambridge CB2 3QH, England.
|
| </P> |
</P> |
| <br><a name="SEC27" href="#TOC1">REVISION</a><br> |
<br><a name="SEC27" href="#TOC1">REVISION</a><br> |
| <P> |
<P> |
| Last updated: 21 November 2010 | Last updated: 26 April 2013 |
| <br> |
<br> |
| Copyright © 1997-2010 University of Cambridge. | Copyright © 1997-2013 University of Cambridge. |
| <br> |
<br> |
| <p> |
<p> |
| Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |