--- embedaddon/pcre/doc/html/pcresyntax.html 2012/02/21 23:05:52 1.1 +++ embedaddon/pcre/doc/html/pcresyntax.html 2013/07/22 08:25:57 1.1.1.4 @@ -46,8 +46,7 @@ man page, in case the conversion went wrong. The full syntax and semantics of the regular expressions that are supported by PCRE are described in the pcrepattern -documentation. This document contains just a quick-reference summary of the -syntax. +documentation. This document contains a quick-reference summary of the syntax.


QUOTING

@@ -62,7 +61,7 @@ syntax. \a alarm, that is, the BEL character (hex 07) \cx "control-x", where x is any ASCII character \e escape (hex 1B) - \f formfeed (hex 0C) + \f form feed (hex 0C) \n newline (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) @@ -76,25 +75,25 @@ syntax.

   .          any character except newline;
                in dotall mode, any character whatsoever
-  \C         one byte, even in UTF-8 mode (best avoided)
+  \C         one data unit, even in UTF mode (best avoided)
   \d         a decimal digit
   \D         a character that is not a decimal digit
-  \h         a horizontal whitespace character
-  \H         a character that is not a horizontal whitespace character
+  \h         a horizontal white space character
+  \H         a character that is not a horizontal white space character
   \N         a character that is not a newline
   \p{xx}     a character with the xx property
   \P{xx}     a character without the xx property
   \R         a newline sequence
-  \s         a whitespace character
-  \S         a character that is not a whitespace character
-  \v         a vertical whitespace character
-  \V         a character that is not a vertical whitespace character
+  \s         a white space character
+  \S         a character that is not a white space character
+  \v         a vertical white space character
+  \V         a character that is not a vertical white space character
   \w         a "word" character
   \W         a "non-word" character
-  \X         an extended Unicode sequence
+  \X         a Unicode extended grapheme cluster
 
In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII -characters, even in UTF-8 mode. However, this can be changed by setting the +characters, even in a UTF mode. However, this can be changed by setting the PCRE_UCP option.


GENERAL CATEGORY PROPERTIES FOR \p and \P
@@ -152,6 +151,8 @@ PCRE_UCP option. Xan Alphanumeric: union of properties L and N Xps POSIX space: property Z or tab, NL, VT, FF, CR Xsp Perl space: property Z or tab, NL, FF, CR + Xuc Univerally-named character: one that can be + represented by a Universal Character Name Xwd Perl word: property Xan or underscore

@@ -162,13 +163,16 @@ Armenian, Avestan, Balinese, Bamum, +Batak, Bengali, Bopomofo, +Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, +Chakma, Cham, Cherokee, Common, @@ -211,7 +215,11 @@ Lisu, Lycian, Lydian, Malayalam, +Mandaic, Meetei_Mayek, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, Mongolian, Myanmar, New_Tai_Lue, @@ -230,8 +238,10 @@ Rejang, Runic, Samaritan, Saurashtra, +Sharada, Shavian, Sinhala, +Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, @@ -240,6 +250,7 @@ Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, +Takri, Tamil, Telugu, Thaana, @@ -269,7 +280,7 @@ Yi. lower lower case letter print printing, including space punct printing, excluding alphanumeric - space whitespace + space white space upper upper case letter word same as \w xdigit hexadecimal digit @@ -366,8 +377,13 @@ but some of them use Unicode properties if PCRE_UCP is The following are recognized only at the start of a pattern or after one of the newline-setting options with similar syntax:
+  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
-  (*UTF8)         set UTF-8 mode (PCRE_UTF8)
+  (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
+  (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
+  (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
+  (*UTF)          set appropriate UTF mode for the library in use
   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
 

@@ -439,6 +455,7 @@ The following act immediately they are reached:
   (*ACCEPT)       force successful match
   (*FAIL)         force backtrack; synonym (*F)
+  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
 
The following act only when a subsequent match failure causes a backtrack to reach them. They all force a match failure, but they differ in what happens @@ -447,14 +464,18 @@ pattern is not anchored.
   (*COMMIT)       overall failure, no advance of starting point
   (*PRUNE)        advance to next starting character
-  (*SKIP)         advance start to current matching position
+  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
+  (*SKIP)         advance to current matching position
+  (*SKIP:NAME)    advance to position corresponding to an earlier
+                  (*MARK:NAME); if not found, the (*SKIP) is ignored
   (*THEN)         local failure, backtrack to next alternation
+  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
 


NEWLINE CONVENTIONS

These are recognized only at the very start of the pattern or after a -(*BSR_...) or (*UTF8) or (*UCP) option. +(*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.

   (*CR)           carriage return only
   (*LF)           linefeed only
@@ -466,7 +487,7 @@ These are recognized only at the very start of the pat
 
WHAT \R MATCHES

These are recognized only at the very start of the pattern or after a -(*...) option that sets the newline convention or UTF-8 or UCP mode. +(*...) option that sets the newline convention or a UTF or UCP mode.

   (*BSR_ANYCRLF)  CR, LF, or CRLF
   (*BSR_UNICODE)  any Unicode newline sequence
@@ -495,9 +516,9 @@ Cambridge CB2 3QH, England.
 


REVISION

-Last updated: 21 November 2010 +Last updated: 26 April 2013
-Copyright © 1997-2010 University of Cambridge. +Copyright © 1997-2013 University of Cambridge.

Return to the PCRE index page.