embedaddon/pcre/doc/html/pcresyntax.html - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html / pcresyntax.html
Revision 1.1.1.5 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:05 2014 UTC (11 years ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD

pcre 8.34

1: <html> 2: <head> 3: <title>pcresyntax specification</title> 4: </head> 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6: <h1>pcresyntax man page</h1> 7:  8: Return to the <a href="index.html">PCRE index page</a>. 9:  10:  11: This page is part of the PCRE HTML documentation. It was generated automatically 12: from the original man page. If there is any nonsense in it, please consult the 13: man page, in case the conversion went wrong. 14:   15: <ul> 16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a> 17: <li><a name="TOC2" href="#SEC2">QUOTING</a> 18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a> 19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a> 20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> 21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> 22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a> 23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a> 24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a> 25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a> 26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a> 27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a> 28: <li><a name="TOC13" href="#SEC13">CAPTURING</a> 29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a> 30: <li><a name="TOC15" href="#SEC15">COMMENT</a> 31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a> 32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> 33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a> 34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> 35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a> 36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a> 37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a> 38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a> 39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a> 40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a> 41: <li><a name="TOC26" href="#SEC26">AUTHOR</a> 42: <li><a name="TOC27" href="#SEC27">REVISION</a> 43: </ul> 44:  <a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>  45:  46: The full syntax and semantics of the regular expressions that are supported by 47: PCRE are described in the 48: <a href="pcrepattern.html">pcrepattern</a> 49: documentation. This document contains a quick-reference summary of the syntax. 50:  51:  <a name="SEC2" href="#TOC1">QUOTING</a>  52:  53: <pre> 54: \x where x is non-alphanumeric is a literal x 55: \Q...\E treat enclosed characters as literal 56: </PRE> 57:  58:  <a name="SEC3" href="#TOC1">CHARACTERS</a>  59:  60: <pre> 61: \a alarm, that is, the BEL character (hex 07) 62: \cx "control-x", where x is any ASCII character 63: \e escape (hex 1B) 64: \f form feed (hex 0C) 65: \n newline (hex 0A) 66: \r carriage return (hex 0D) 67: \t tab (hex 09) 68: \0dd character with octal code 0dd 69: \ddd character with octal code ddd, or backreference 70: \o{ddd..} character with octal code ddd.. 71: \xhh character with hex code hh 72: \x{hhh..} character with hex code hhh.. 73: </pre> 74: Note that \0dd is always an octal code, and that \8 and \9 are the literal 75: characters "8" and "9". 76:  77:  <a name="SEC4" href="#TOC1">CHARACTER TYPES</a>  78:  79: <pre> 80: . any character except newline; 81: in dotall mode, any character whatsoever 82: \C one data unit, even in UTF mode (best avoided) 83: \d a decimal digit 84: \D a character that is not a decimal digit 85: \h a horizontal white space character 86: \H a character that is not a horizontal white space character 87: \N a character that is not a newline 88: \p{xx} a character with the xx property 89: \P{xx} a character without the xx property 90: \R a newline sequence 91: \s a white space character 92: \S a character that is not a white space character 93: \v a vertical white space character 94: \V a character that is not a vertical white space character 95: \w a "word" character 96: \W a "non-word" character 97: \X a Unicode extended grapheme cluster 98: </pre> 99: By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode 100: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is 101: happening, \s and \w may also match characters with code points in the range 102: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences 103: is changed to use Unicode properties and they match many more characters. 104:  105:  <a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>  106:  107: <pre> 108: C Other 109: Cc Control 110: Cf Format 111: Cn Unassigned 112: Co Private use 113: Cs Surrogate 114: 115: L Letter 116: Ll Lower case letter 117: Lm Modifier letter 118: Lo Other letter 119: Lt Title case letter 120: Lu Upper case letter 121: L& Ll, Lu, or Lt 122: 123: M Mark 124: Mc Spacing mark 125: Me Enclosing mark 126: Mn Non-spacing mark 127: 128: N Number 129: Nd Decimal number 130: Nl Letter number 131: No Other number 132: 133: P Punctuation 134: Pc Connector punctuation 135: Pd Dash punctuation 136: Pe Close punctuation 137: Pf Final punctuation 138: Pi Initial punctuation 139: Po Other punctuation 140: Ps Open punctuation 141: 142: S Symbol 143: Sc Currency symbol 144: Sk Modifier symbol 145: Sm Mathematical symbol 146: So Other symbol 147: 148: Z Separator 149: Zl Line separator 150: Zp Paragraph separator 151: Zs Space separator 152: </PRE> 153: 154: <a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> 155: 156: <pre> 157: Xan Alphanumeric: union of properties L and N 158: Xps POSIX space: property Z or tab, NL, VT, FF, CR 159: Xsp Perl space: property Z or tab, NL, VT, FF, CR 160: Xuc Univerally-named character: one that can be 161: represented by a Universal Character Name 162: Xwd Perl word: property Xan or underscore 163: </pre> 164: Perl and POSIX space are now the same. Perl added VT to its space character set 165: at release 5.18 and PCRE changed at release 8.34. 166: 167: <a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a> 168: 169: Arabic, 170: Armenian, 171: Avestan, 172: Balinese, 173: Bamum, 174: Batak, 175: Bengali, 176: Bopomofo, 177: Brahmi, 178: Braille, 179: Buginese, 180: Buhid, 181: Canadian_Aboriginal, 182: Carian, 183: Chakma, 184: Cham, 185: Cherokee, 186: Common, 187: Coptic, 188: Cuneiform, 189: Cypriot, 190: Cyrillic, 191: Deseret, 192: Devanagari, 193: Egyptian_Hieroglyphs, 194: Ethiopic, 195: Georgian, 196: Glagolitic, 197: Gothic, 198: Greek, 199: Gujarati, 200: Gurmukhi, 201: Han, 202: Hangul, 203: Hanunoo, 204: Hebrew, 205: Hiragana, 206: Imperial_Aramaic, 207: Inherited, 208: Inscriptional_Pahlavi, 209: Inscriptional_Parthian, 210: Javanese, 211: Kaithi, 212: Kannada, 213: Katakana, 214: Kayah_Li, 215: Kharoshthi, 216: Khmer, 217: Lao, 218: Latin, 219: Lepcha, 220: Limbu, 221: Linear_B, 222: Lisu, 223: Lycian, 224: Lydian, 225: Malayalam, 226: Mandaic, 227: Meetei_Mayek, 228: Meroitic_Cursive, 229: Meroitic_Hieroglyphs, 230: Miao, 231: Mongolian, 232: Myanmar, 233: New_Tai_Lue, 234: Nko, 235: Ogham, 236: Old_Italic, 237: Old_Persian, 238: Old_South_Arabian, 239: Old_Turkic, 240: Ol_Chiki, 241: Oriya, 242: Osmanya, 243: Phags_Pa, 244: Phoenician, 245: Rejang, 246: Runic, 247: Samaritan, 248: Saurashtra, 249: Sharada, 250: Shavian, 251: Sinhala, 252: Sora_Sompeng, 253: Sundanese, 254: Syloti_Nagri, 255: Syriac, 256: Tagalog, 257: Tagbanwa, 258: Tai_Le, 259: Tai_Tham, 260: Tai_Viet, 261: Takri, 262: Tamil, 263: Telugu, 264: Thaana, 265: Thai, 266: Tibetan, 267: Tifinagh, 268: Ugaritic, 269: Vai, 270: Yi. 271: 272: <a name="SEC8" href="#TOC1">CHARACTER CLASSES</a> 273: 274: <pre> 275: [...] positive character class 276: [^...] negative character class 277: [x-y] range (can be used for hex characters) 278: [[:xxx:]] positive POSIX named set 279: [[:^xxx:]] negative POSIX named set 280: 281: alnum alphanumeric 282: alpha alphabetic 283: ascii 0-127 284: blank space or tab 285: cntrl control character 286: digit decimal digit 287: graph printing, excluding space 288: lower lower case letter 289: print printing, including space 290: punct printing, excluding alphanumeric 291: space white space 292: upper upper case letter 293: word same as \w 294: xdigit hexadecimal digit 295: </pre> 296: In PCRE, POSIX character set names recognize only ASCII characters by default, 297: but some of them use Unicode properties if PCRE_UCP is set. You can use 298: \Q...\E inside a character class. 299: 300: <a name="SEC9" href="#TOC1">QUANTIFIERS</a> 301: 302: <pre> 303: ? 0 or 1, greedy 304: ?+ 0 or 1, possessive 305: ?? 0 or 1, lazy 306: * 0 or more, greedy 307: *+ 0 or more, possessive 308: *? 0 or more, lazy 309: + 1 or more, greedy 310: ++ 1 or more, possessive 311: +? 1 or more, lazy 312: {n} exactly n 313: {n,m} at least n, no more than m, greedy 314: {n,m}+ at least n, no more than m, possessive 315: {n,m}? at least n, no more than m, lazy 316: {n,} n or more, greedy 317: {n,}+ n or more, possessive 318: {n,}? n or more, lazy 319: </PRE> 320: 321: <a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a> 322: 323: <pre> 324: \b word boundary 325: \B not a word boundary 326: ^ start of subject 327: also after internal newline in multiline mode 328: \A start of subject 329: $ end of subject 330: also before newline at end of subject 331: also before internal newline in multiline mode 332: \Z end of subject 333: also before newline at end of subject 334: \z end of subject 335: \G first matching position in subject 336: </PRE> 337: 338: <a name="SEC11" href="#TOC1">MATCH POINT RESET</a> 339: 340: <pre> 341: \K reset start of match 342: </PRE> 343: 344: <a name="SEC12" href="#TOC1">ALTERNATION</a> 345: 346: <pre> 347: expr|expr|expr... 348: </PRE> 349: 350: <a name="SEC13" href="#TOC1">CAPTURING</a> 351: 352: <pre> 353: (...) capturing group 354: (?<name>...) named capturing group (Perl) 355: (?'name'...) named capturing group (Perl) 356: (?P<name>...) named capturing group (Python) 357: (?:...) non-capturing group 358: (?|...) non-capturing group; reset group numbers for 359: capturing groups in each alternative 360: </PRE> 361:  362:  <a name="SEC14" href="#TOC1">ATOMIC GROUPS</a>  363:  364: <pre> 365: (?>...) atomic, non-capturing group 366: </PRE> 367:  368:  <a name="SEC15" href="#TOC1">COMMENT</a>  369:  370: <pre> 371: (?#....) comment (not nestable) 372: </PRE> 373:  374:  <a name="SEC16" href="#TOC1">OPTION SETTING</a>  375:  376: <pre> 377: (?i) caseless 378: (?J) allow duplicate names 379: (?m) multiline 380: (?s) single line (dotall) 381: (?U) default ungreedy (lazy) 382: (?x) extended (ignore white space) 383: (?-...) unset option(s) 384: </pre> 385: The following are recognized only at the start of a pattern or after one of the 386: newline-setting options with similar syntax: 387: <pre> 388: (*LIMIT_MATCH=d) set the match limit to d (decimal number) 389: (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) 390: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) 391: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) 392: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) 393: (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) 394: (*UTF) set appropriate UTF mode for the library in use 395: (*UCP) set PCRE_UCP (use Unicode properties for \d etc) 396: </pre> 397: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the 398: limits set by the caller of pcre_exec(), not increase them. 399:  400:  <a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>  401:  402: <pre> 403: (?=...) positive look ahead 404: (?!...) negative look ahead 405: (?<=...) positive look behind 406: (?<!...) negative look behind 407: </pre> 408: Each top-level branch of a look behind must be of a fixed length. 409:  410:  <a name="SEC18" href="#TOC1">BACKREFERENCES</a>  411:  412: <pre> 413: \n reference by number (can be ambiguous) 414: \gn reference by number 415: \g{n} reference by number 416: \g{-n} relative reference by number 417: \k<name> reference by name (Perl) 418: \k'name' reference by name (Perl) 419: \g{name} reference by name (Perl) 420: \k{name} reference by name (.NET) 421: (?P=name) reference by name (Python) 422: </PRE> 423:  424:  <a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>  425:  426: <pre> 427: (?R) recurse whole pattern 428: (?n) call subpattern by absolute number 429: (?+n) call subpattern by relative number 430: (?-n) call subpattern by relative number 431: (?&name) call subpattern by name (Perl) 432: (?P>name) call subpattern by name (Python) 433: \g<name> call subpattern by name (Oniguruma) 434: \g'name' call subpattern by name (Oniguruma) 435: \g<n> call subpattern by absolute number (Oniguruma) 436: \g'n' call subpattern by absolute number (Oniguruma) 437: \g<+n> call subpattern by relative number (PCRE extension) 438: \g'+n' call subpattern by relative number (PCRE extension) 439: \g<-n> call subpattern by relative number (PCRE extension) 440: \g'-n' call subpattern by relative number (PCRE extension) 441: </PRE> 442:  443:  <a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a>  444:  445: <pre> 446: (?(condition)yes-pattern) 447: (?(condition)yes-pattern|no-pattern) 448: 449: (?(n)... absolute reference condition 450: (?(+n)... relative reference condition 451: (?(-n)... relative reference condition 452: (?(<name>)... named reference condition (Perl) 453: (?('name')... named reference condition (Perl) 454: (?(name)... named reference condition (PCRE) 455: (?(R)... overall recursion condition 456: (?(Rn)... specific group recursion condition 457: (?(R&name)... specific recursion condition 458: (?(DEFINE)... define subpattern for reference 459: (?(assert)... assertion condition 460: </PRE> 461: 462: <a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a> 463: 464: The following act immediately they are reached: 465: <pre> 466: (*ACCEPT) force successful match 467: (*FAIL) force backtrack; synonym (*F) 468: (*MARK:NAME) set name to be passed back; synonym (*:NAME) 469: </pre> 470: The following act only when a subsequent match failure causes a backtrack to 471: reach them. They all force a match failure, but they differ in what happens 472: afterwards. Those that advance the start-of-match point do so only if the 473: pattern is not anchored. 474: <pre> 475: (*COMMIT) overall failure, no advance of starting point 476: (*PRUNE) advance to next starting character 477: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 478: (*SKIP) advance to current matching position 479: (*SKIP:NAME) advance to position corresponding to an earlier 480: (*MARK:NAME); if not found, the (*SKIP) is ignored 481: (*THEN) local failure, backtrack to next alternation 482: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 483: </PRE> 484:  485:  <a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a>  486:  487: These are recognized only at the very start of the pattern or after a 488: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. 489: <pre> 490: (*CR) carriage return only 491: (*LF) linefeed only 492: (*CRLF) carriage return followed by linefeed 493: (*ANYCRLF) all three of the above 494: (*ANY) any Unicode newline sequence 495: </PRE> 496:  497:  <a name="SEC23" href="#TOC1">WHAT \R MATCHES</a>  498:  499: These are recognized only at the very start of the pattern or after a 500: (*...) option that sets the newline convention or a UTF or UCP mode. 501: <pre> 502: (*BSR_ANYCRLF) CR, LF, or CRLF 503: (*BSR_UNICODE) any Unicode newline sequence 504: </PRE> 505:  506:  <a name="SEC24" href="#TOC1">CALLOUTS</a>  507:  508: <pre> 509: (?C) callout 510: (?Cn) callout with data n 511: </PRE> 512:  513:  <a name="SEC25" href="#TOC1">SEE ALSO</a>  514:  515: pcrepattern(3), pcreapi(3), pcrecallout(3), 516: pcrematching(3), pcre(3). 517:  518:  <a name="SEC26" href="#TOC1">AUTHOR</a>  519:  520: Philip Hazel 521:   522: University Computing Service 523:   524: Cambridge CB2 3QH, England. 525:   526:  527:  <a name="SEC27" href="#TOC1">REVISION</a>  528:  529: Last updated: 12 November 2013 530:   531: Copyright © 1997-2013 University of Cambridge. 532:   533:  534: Return to the <a href="index.html">PCRE index page</a>. 535: