version 1.1.1.1, 2012/02/21 23:05:52
|
version 1.1.1.5, 2014/06/15 19:46:05
|
Line 1
|
Line 1
|
.TH PCRESYNTAX 3 | .TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34" |
.SH NAME |
.SH NAME |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" |
.SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" |
Line 9 PCRE are described in the
|
Line 9 PCRE are described in the
|
.\" HREF |
.\" HREF |
\fBpcrepattern\fP |
\fBpcrepattern\fP |
.\" |
.\" |
documentation. This document contains just a quick-reference summary of the | documentation. This document contains a quick-reference summary of the syntax. |
syntax. | |
. |
. |
. |
. |
.SH "QUOTING" |
.SH "QUOTING" |
Line 26 syntax.
|
Line 25 syntax.
|
\ea alarm, that is, the BEL character (hex 07) |
\ea alarm, that is, the BEL character (hex 07) |
\ecx "control-x", where x is any ASCII character |
\ecx "control-x", where x is any ASCII character |
\ee escape (hex 1B) |
\ee escape (hex 1B) |
\ef formfeed (hex 0C) | \ef form feed (hex 0C) |
\en newline (hex 0A) |
\en newline (hex 0A) |
\er carriage return (hex 0D) |
\er carriage return (hex 0D) |
\et tab (hex 09) |
\et tab (hex 09) |
|
\e0dd character with octal code 0dd |
\eddd character with octal code ddd, or backreference |
\eddd character with octal code ddd, or backreference |
|
\eo{ddd..} character with octal code ddd.. |
\exhh character with hex code hh |
\exhh character with hex code hh |
\ex{hhh..} character with hex code hhh.. |
\ex{hhh..} character with hex code hhh.. |
|
.sp |
|
Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal |
|
characters "8" and "9". |
. |
. |
. |
. |
.SH "CHARACTER TYPES" |
.SH "CHARACTER TYPES" |
Line 40 syntax.
|
Line 44 syntax.
|
.sp |
.sp |
. any character except newline; |
. any character except newline; |
in dotall mode, any character whatsoever |
in dotall mode, any character whatsoever |
\eC one byte, even in UTF-8 mode (best avoided) | \eC one data unit, even in UTF mode (best avoided) |
\ed a decimal digit |
\ed a decimal digit |
\eD a character that is not a decimal digit |
\eD a character that is not a decimal digit |
\eh a horizontal whitespace character | \eh a horizontal white space character |
\eH a character that is not a horizontal whitespace character | \eH a character that is not a horizontal white space character |
\eN a character that is not a newline |
\eN a character that is not a newline |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eR a newline sequence |
\eR a newline sequence |
\es a whitespace character | \es a white space character |
\eS a character that is not a whitespace character | \eS a character that is not a white space character |
\ev a vertical whitespace character | \ev a vertical white space character |
\eV a character that is not a vertical whitespace character | \eV a character that is not a vertical white space character |
\ew a "word" character |
\ew a "word" character |
\eW a "non-word" character |
\eW a "non-word" character |
\eX an extended Unicode sequence | \eX a Unicode extended grapheme cluster |
.sp |
.sp |
In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII | By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode |
characters, even in UTF-8 mode. However, this can be changed by setting the | or in the 16- bit and 32-bit libraries. However, if locale-specific matching is |
PCRE_UCP option. | happening, \es and \ew may also match characters with code points in the range |
| 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences |
| is changed to use Unicode properties and they match many more characters. |
. |
. |
. |
. |
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" |
.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" |
Line 116 PCRE_UCP option.
|
Line 122 PCRE_UCP option.
|
.sp |
.sp |
Xan Alphanumeric: union of properties L and N |
Xan Alphanumeric: union of properties L and N |
Xps POSIX space: property Z or tab, NL, VT, FF, CR |
Xps POSIX space: property Z or tab, NL, VT, FF, CR |
Xsp Perl space: property Z or tab, NL, FF, CR | Xsp Perl space: property Z or tab, NL, VT, FF, CR |
| Xuc Univerally-named character: one that can be |
| represented by a Universal Character Name |
Xwd Perl word: property Xan or underscore |
Xwd Perl word: property Xan or underscore |
|
.sp |
|
Perl and POSIX space are now the same. Perl added VT to its space character set |
|
at release 5.18 and PCRE changed at release 8.34. |
. |
. |
. |
. |
.SH "SCRIPT NAMES FOR \ep AND \eP" |
.SH "SCRIPT NAMES FOR \ep AND \eP" |
Line 128 Armenian,
|
Line 139 Armenian,
|
Avestan, |
Avestan, |
Balinese, |
Balinese, |
Bamum, |
Bamum, |
|
Batak, |
Bengali, |
Bengali, |
Bopomofo, |
Bopomofo, |
|
Brahmi, |
Braille, |
Braille, |
Buginese, |
Buginese, |
Buhid, |
Buhid, |
Canadian_Aboriginal, |
Canadian_Aboriginal, |
Carian, |
Carian, |
|
Chakma, |
Cham, |
Cham, |
Cherokee, |
Cherokee, |
Common, |
Common, |
Line 177 Lisu,
|
Line 191 Lisu,
|
Lycian, |
Lycian, |
Lydian, |
Lydian, |
Malayalam, |
Malayalam, |
|
Mandaic, |
Meetei_Mayek, |
Meetei_Mayek, |
|
Meroitic_Cursive, |
|
Meroitic_Hieroglyphs, |
|
Miao, |
Mongolian, |
Mongolian, |
Myanmar, |
Myanmar, |
New_Tai_Lue, |
New_Tai_Lue, |
Line 196 Rejang,
|
Line 214 Rejang,
|
Runic, |
Runic, |
Samaritan, |
Samaritan, |
Saurashtra, |
Saurashtra, |
|
Sharada, |
Shavian, |
Shavian, |
Sinhala, |
Sinhala, |
|
Sora_Sompeng, |
Sundanese, |
Sundanese, |
Syloti_Nagri, |
Syloti_Nagri, |
Syriac, |
Syriac, |
Line 206 Tagbanwa,
|
Line 226 Tagbanwa,
|
Tai_Le, |
Tai_Le, |
Tai_Tham, |
Tai_Tham, |
Tai_Viet, |
Tai_Viet, |
|
Takri, |
Tamil, |
Tamil, |
Telugu, |
Telugu, |
Thaana, |
Thaana, |
Line 236 Yi.
|
Line 257 Yi.
|
lower lower case letter |
lower lower case letter |
print printing, including space |
print printing, including space |
punct printing, excluding alphanumeric |
punct printing, excluding alphanumeric |
space whitespace | space white space |
upper upper case letter |
upper upper case letter |
word same as \ew |
word same as \ew |
xdigit hexadecimal digit |
xdigit hexadecimal digit |
Line 336 but some of them use Unicode properties if PCRE_UCP is
|
Line 357 but some of them use Unicode properties if PCRE_UCP is
|
The following are recognized only at the start of a pattern or after one of the |
The following are recognized only at the start of a pattern or after one of the |
newline-setting options with similar syntax: |
newline-setting options with similar syntax: |
.sp |
.sp |
|
(*LIMIT_MATCH=d) set the match limit to d (decimal number) |
|
(*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) |
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) |
(*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) |
(*UTF8) set UTF-8 mode (PCRE_UTF8) | (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) |
| (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) |
| (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) |
| (*UTF) set appropriate UTF mode for the library in use |
(*UCP) set PCRE_UCP (use Unicode properties for \ed etc) |
(*UCP) set PCRE_UCP (use Unicode properties for \ed etc) |
|
.sp |
|
Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the |
|
limits set by the caller of pcre_exec(), not increase them. |
. |
. |
. |
. |
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" |
.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" |
Line 411 The following act immediately they are reached:
|
Line 440 The following act immediately they are reached:
|
.sp |
.sp |
(*ACCEPT) force successful match |
(*ACCEPT) force successful match |
(*FAIL) force backtrack; synonym (*F) |
(*FAIL) force backtrack; synonym (*F) |
|
(*MARK:NAME) set name to be passed back; synonym (*:NAME) |
.sp |
.sp |
The following act only when a subsequent match failure causes a backtrack to |
The following act only when a subsequent match failure causes a backtrack to |
reach them. They all force a match failure, but they differ in what happens |
reach them. They all force a match failure, but they differ in what happens |
Line 419 pattern is not anchored.
|
Line 449 pattern is not anchored.
|
.sp |
.sp |
(*COMMIT) overall failure, no advance of starting point |
(*COMMIT) overall failure, no advance of starting point |
(*PRUNE) advance to next starting character |
(*PRUNE) advance to next starting character |
(*SKIP) advance start to current matching position | (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) |
| (*SKIP) advance to current matching position |
| (*SKIP:NAME) advance to position corresponding to an earlier |
| (*MARK:NAME); if not found, the (*SKIP) is ignored |
(*THEN) local failure, backtrack to next alternation |
(*THEN) local failure, backtrack to next alternation |
|
(*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) |
. |
. |
. |
. |
.SH "NEWLINE CONVENTIONS" |
.SH "NEWLINE CONVENTIONS" |
.rs |
.rs |
.sp |
.sp |
These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
(*BSR_...) or (*UTF8) or (*UCP) option. | (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. |
.sp |
.sp |
(*CR) carriage return only |
(*CR) carriage return only |
(*LF) linefeed only |
(*LF) linefeed only |
Line 440 These are recognized only at the very start of the pat
|
Line 474 These are recognized only at the very start of the pat
|
.rs |
.rs |
.sp |
.sp |
These are recognized only at the very start of the pattern or after a |
These are recognized only at the very start of the pattern or after a |
(*...) option that sets the newline convention or UTF-8 or UCP mode. | (*...) option that sets the newline convention or a UTF or UCP mode. |
.sp |
.sp |
(*BSR_ANYCRLF) CR, LF, or CRLF |
(*BSR_ANYCRLF) CR, LF, or CRLF |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
Line 474 Cambridge CB2 3QH, England.
|
Line 508 Cambridge CB2 3QH, England.
|
.rs |
.rs |
.sp |
.sp |
.nf |
.nf |
Last updated: 21 November 2010 | Last updated: 12 November 2013 |
Copyright (c) 1997-2010 University of Cambridge. | Copyright (c) 1997-2013 University of Cambridge. |
.fi |
.fi |