--- embedaddon/pcre/doc/html/pcresyntax.html 2012/02/21 23:50:25 1.1.1.2 +++ embedaddon/pcre/doc/html/pcresyntax.html 2014/06/15 19:46:05 1.1.1.5 @@ -61,14 +61,18 @@ documentation. This document contains a quick-referenc \a alarm, that is, the BEL character (hex 07) \cx "control-x", where x is any ASCII character \e escape (hex 1B) - \f formfeed (hex 0C) + \f form feed (hex 0C) \n newline (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) + \0dd character with octal code 0dd \ddd character with octal code ddd, or backreference + \o{ddd..} character with octal code ddd.. \xhh character with hex code hh \x{hhh..} character with hex code hhh.. - + +Note that \0dd is always an octal code, and that \8 and \9 are the literal +characters "8" and "9".


CHARACTER TYPES

@@ -78,23 +82,25 @@ documentation. This document contains a quick-referenc \C one data unit, even in UTF mode (best avoided) \d a decimal digit \D a character that is not a decimal digit - \h a horizontal whitespace character - \H a character that is not a horizontal whitespace character + \h a horizontal white space character + \H a character that is not a horizontal white space character \N a character that is not a newline \p{xx} a character with the xx property \P{xx} a character without the xx property \R a newline sequence - \s a whitespace character - \S a character that is not a whitespace character - \v a vertical whitespace character - \V a character that is not a vertical whitespace character + \s a white space character + \S a character that is not a white space character + \v a vertical white space character + \V a character that is not a vertical white space character \w a "word" character \W a "non-word" character - \X an extended Unicode sequence + \X a Unicode extended grapheme cluster -In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII -characters, even in a UTF mode. However, this can be changed by setting the -PCRE_UCP option. +By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode +or in the 16- bit and 32-bit libraries. However, if locale-specific matching is +happening, \s and \w may also match characters with code points in the range +128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences +is changed to use Unicode properties and they match many more characters.


GENERAL CATEGORY PROPERTIES FOR \p and \P

@@ -150,9 +156,13 @@ PCRE_UCP option.

   Xan        Alphanumeric: union of properties L and N
   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
-  Xsp        Perl space: property Z or tab, NL, FF, CR
+  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
+  Xuc        Univerally-named character: one that can be
+               represented by a Universal Character Name
   Xwd        Perl word: property Xan or underscore
-
+ +Perl and POSIX space are now the same. Perl added VT to its space character set +at release 5.18 and PCRE changed at release 8.34.


SCRIPT NAMES FOR \p AND \P

@@ -161,13 +171,16 @@ Armenian, Avestan, Balinese, Bamum, +Batak, Bengali, Bopomofo, +Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, +Chakma, Cham, Cherokee, Common, @@ -210,7 +223,11 @@ Lisu, Lycian, Lydian, Malayalam, +Mandaic, Meetei_Mayek, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, Mongolian, Myanmar, New_Tai_Lue, @@ -229,8 +246,10 @@ Rejang, Runic, Samaritan, Saurashtra, +Sharada, Shavian, Sinhala, +Sora_Sompeng, Sundanese, Syloti_Nagri, Syriac, @@ -239,6 +258,7 @@ Tagbanwa, Tai_Le, Tai_Tham, Tai_Viet, +Takri, Tamil, Telugu, Thaana, @@ -268,7 +288,7 @@ Yi. lower lower case letter print printing, including space punct printing, excluding alphanumeric - space whitespace + space white space upper upper case letter word same as \w xdigit hexadecimal digit @@ -365,11 +385,17 @@ but some of them use Unicode properties if PCRE_UCP is The following are recognized only at the start of a pattern or after one of the newline-setting options with similar syntax:

+  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
+  (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
+  (*UTF)          set appropriate UTF mode for the library in use
   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
-
+ +Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the +limits set by the caller of pcre_exec(), not increase them.


LOOKAHEAD AND LOOKBEHIND ASSERTIONS

@@ -459,7 +485,7 @@ pattern is not anchored.
NEWLINE CONVENTIONS

These are recognized only at the very start of the pattern or after a -(*BSR_...), (*UTF8), (*UTF16) or (*UCP) option. +(*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.

   (*CR)           carriage return only
   (*LF)           linefeed only
@@ -500,9 +526,9 @@ Cambridge CB2 3QH, England.
 


REVISION

-Last updated: 10 January 2012 +Last updated: 12 November 2013
-Copyright © 1997-2012 University of Cambridge. +Copyright © 1997-2013 University of Cambridge.

Return to the PCRE index page.