embedaddon/pcre/doc/pcresyntax.3 - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / pcresyntax.3
Revision 1.1.1.5 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:05 2014 UTC (11 years ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD

pcre 8.34

1: .TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34" 2: .SH NAME 3: PCRE - Perl-compatible regular expressions 4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" 5: .rs 6: .sp 7: The full syntax and semantics of the regular expressions that are supported by 8: PCRE are described in the 9: .\" HREF 10: \fBpcrepattern\fP 11: .\" 12: documentation. This document contains a quick-reference summary of the syntax. 13: . 14: . 15: .SH "QUOTING" 16: .rs 17: .sp 18: \ex where x is non-alphanumeric is a literal x 19: \eQ...\eE treat enclosed characters as literal 20: . 21: . 22: .SH "CHARACTERS" 23: .rs 24: .sp 25: \ea alarm, that is, the BEL character (hex 07) 26: \ecx "control-x", where x is any ASCII character 27: \ee escape (hex 1B) 28: \ef form feed (hex 0C) 29: \en newline (hex 0A) 30: \er carriage return (hex 0D) 31: \et tab (hex 09) 32: \e0dd character with octal code 0dd 33: \eddd character with octal code ddd, or backreference 34: \eo{ddd..} character with octal code ddd.. 35: \exhh character with hex code hh 36: \ex{hhh..} character with hex code hhh.. 37: .sp 38: Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal 39: characters "8" and "9". 40: . 41: . 42: .SH "CHARACTER TYPES" 43: .rs 44: .sp 45: . any character except newline; 46: in dotall mode, any character whatsoever 47: \eC one data unit, even in UTF mode (best avoided) 48: \ed a decimal digit 49: \eD a character that is not a decimal digit 50: \eh a horizontal white space character 51: \eH a character that is not a horizontal white space character 52: \eN a character that is not a newline 53: \ep{\fIxx\fP} a character with the \fIxx\fP property 54: \eP{\fIxx\fP} a character without the \fIxx\fP property 55: \eR a newline sequence 56: \es a white space character 57: \eS a character that is not a white space character 58: \ev a vertical white space character 59: \eV a character that is not a vertical white space character 60: \ew a "word" character 61: \eW a "non-word" character 62: \eX a Unicode extended grapheme cluster 63: .sp 64: By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode 65: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is 66: happening, \es and \ew may also match characters with code points in the range 67: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences 68: is changed to use Unicode properties and they match many more characters. 69: . 70: . 71: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" 72: .rs 73: .sp 74: C Other 75: Cc Control 76: Cf Format 77: Cn Unassigned 78: Co Private use 79: Cs Surrogate 80: .sp 81: L Letter 82: Ll Lower case letter 83: Lm Modifier letter 84: Lo Other letter 85: Lt Title case letter 86: Lu Upper case letter 87: L& Ll, Lu, or Lt 88: .sp 89: M Mark 90: Mc Spacing mark 91: Me Enclosing mark 92: Mn Non-spacing mark 93: .sp 94: N Number 95: Nd Decimal number 96: Nl Letter number 97: No Other number 98: .sp 99: P Punctuation 100: Pc Connector punctuation 101: Pd Dash punctuation 102: Pe Close punctuation 103: Pf Final punctuation 104: Pi Initial punctuation 105: Po Other punctuation 106: Ps Open punctuation 107: .sp 108: S Symbol 109: Sc Currency symbol 110: Sk Modifier symbol 111: Sm Mathematical symbol 112: So Other symbol 113: .sp 114: Z Separator 115: Zl Line separator 116: Zp Paragraph separator 117: Zs Space separator 118: . 119: . 120: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP" 121: .rs 122: .sp 123: Xan Alphanumeric: union of properties L and N 124: Xps POSIX space: property Z or tab, NL, VT, FF, CR 125: Xsp Perl space: property Z or tab, NL, VT, FF, CR 126: Xuc Univerally-named character: one that can be 127: represented by a Universal Character Name 128: Xwd Perl word: property Xan or underscore 129: .sp 130: Perl and POSIX space are now the same. Perl added VT to its space character set 131: at release 5.18 and PCRE changed at release 8.34. 132: . 133: . 134: .SH "SCRIPT NAMES FOR \ep AND \eP" 135: .rs 136: .sp 137: Arabic, 138: Armenian, 139: Avestan, 140: Balinese, 141: Bamum, 142: Batak, 143: Bengali, 144: Bopomofo, 145: Brahmi, 146: Braille, 147: Buginese, 148: Buhid, 149: Canadian_Aboriginal, 150: Carian, 151: Chakma, 152: Cham, 153: Cherokee, 154: Common, 155: Coptic, 156: Cuneiform, 157: Cypriot, 158: Cyrillic, 159: Deseret, 160: Devanagari, 161: Egyptian_Hieroglyphs, 162: Ethiopic, 163: Georgian, 164: Glagolitic, 165: Gothic, 166: Greek, 167: Gujarati, 168: Gurmukhi, 169: Han, 170: Hangul, 171: Hanunoo, 172: Hebrew, 173: Hiragana, 174: Imperial_Aramaic, 175: Inherited, 176: Inscriptional_Pahlavi, 177: Inscriptional_Parthian, 178: Javanese, 179: Kaithi, 180: Kannada, 181: Katakana, 182: Kayah_Li, 183: Kharoshthi, 184: Khmer, 185: Lao, 186: Latin, 187: Lepcha, 188: Limbu, 189: Linear_B, 190: Lisu, 191: Lycian, 192: Lydian, 193: Malayalam, 194: Mandaic, 195: Meetei_Mayek, 196: Meroitic_Cursive, 197: Meroitic_Hieroglyphs, 198: Miao, 199: Mongolian, 200: Myanmar, 201: New_Tai_Lue, 202: Nko, 203: Ogham, 204: Old_Italic, 205: Old_Persian, 206: Old_South_Arabian, 207: Old_Turkic, 208: Ol_Chiki, 209: Oriya, 210: Osmanya, 211: Phags_Pa, 212: Phoenician, 213: Rejang, 214: Runic, 215: Samaritan, 216: Saurashtra, 217: Sharada, 218: Shavian, 219: Sinhala, 220: Sora_Sompeng, 221: Sundanese, 222: Syloti_Nagri, 223: Syriac, 224: Tagalog, 225: Tagbanwa, 226: Tai_Le, 227: Tai_Tham, 228: Tai_Viet, 229: Takri, 230: Tamil, 231: Telugu, 232: Thaana, 233: Thai, 234: Tibetan, 235: Tifinagh, 236: Ugaritic, 237: Vai, 238: Yi. 239: . 240: . 241: .SH "CHARACTER CLASSES" 242: .rs 243: .sp 244: [...] positive character class 245: [^...] negative character class 246: [x-y] range (can be used for hex characters) 247: [[:xxx:]] positive POSIX named set 248: [[:^xxx:]] negative POSIX named set 249: .sp 250: alnum alphanumeric 251: alpha alphabetic 252: ascii 0-127 253: blank space or tab 254: cntrl control character 255: digit decimal digit 256: graph printing, excluding space 257: lower lower case letter 258: print printing, including space 259: punct printing, excluding alphanumeric 260: space white space 261: upper upper case letter 262: word same as \ew 263: xdigit hexadecimal digit 264: .sp 265: In PCRE, POSIX character set names recognize only ASCII characters by default, 266: but some of them use Unicode properties if PCRE_UCP is set. You can use 267: \eQ...\eE inside a character class. 268: . 269: . 270: .SH "QUANTIFIERS" 271: .rs 272: .sp 273: ? 0 or 1, greedy 274: ?+ 0 or 1, possessive 275: ?? 0 or 1, lazy 276: * 0 or more, greedy 277: *+ 0 or more, possessive 278: *? 0 or more, lazy 279: + 1 or more, greedy 280: ++ 1 or more, possessive 281: +? 1 or more, lazy 282: {n} exactly n 283: {n,m} at least n, no more than m, greedy 284: {n,m}+ at least n, no more than m, possessive 285: {n,m}? at least n, no more than m, lazy 286: {n,} n or more, greedy 287: {n,}+ n or more, possessive 288: {n,}? n or more, lazy 289: . 290: . 291: .SH "ANCHORS AND SIMPLE ASSERTIONS" 292: .rs 293: .sp 294: \eb word boundary 295: \eB not a word boundary 296: ^ start of subject 297: also after internal newline in multiline mode 298: \eA start of subject 299: $ end of subject 300: also before newline at end of subject 301: also before internal newline in multiline mode 302: \eZ end of subject 303: also before newline at end of subject 304: \ez end of subject 305: \eG first matching position in subject 306: . 307: . 308: .SH "MATCH POINT RESET" 309: .rs 310: .sp 311: \eK reset start of match 312: . 313: . 314: .SH "ALTERNATION" 315: .rs 316: .sp 317: expr|expr|expr... 318: . 319: . 320: .SH "CAPTURING" 321: .rs 322: .sp 323: (...) capturing group 324: (?<name>...) named capturing group (Perl) 325: (?'name'...) named capturing group (Perl) 326: (?P<name>...) named capturing group (Python) 327: (?:...) non-capturing group 328: (?|...) non-capturing group; reset group numbers for 329: capturing groups in each alternative 330: . 331: . 332: .SH "ATOMIC GROUPS" 333: .rs 334: .sp 335: (?>...) atomic, non-capturing group 336: . 337: . 338: . 339: . 340: .SH "COMMENT" 341: .rs 342: .sp 343: (?#....) comment (not nestable) 344: . 345: . 346: .SH "OPTION SETTING" 347: .rs 348: .sp 349: (?i) caseless 350: (?J) allow duplicate names 351: (?m) multiline 352: (?s) single line (dotall) 353: (?U) default ungreedy (lazy) 354: (?x) extended (ignore white space) 355: (?-...) unset option(s) 356: .sp 357: The following are recognized only at the start of a pattern or after one of the 358: newline-setting options with similar syntax: 359: .sp 360: (*LIMIT_MATCH=d) set the match limit to d (decimal number) 361: (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) 362: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) 363: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) 364: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) 365: (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) 366: (*UTF) set appropriate UTF mode for the library in use 367: (*UCP) set PCRE_UCP (use Unicode properties for \ed etc) 368: .sp 369: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the 370: limits set by the caller of pcre_exec(), not increase them. 371: . 372: . 373: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" 374: .rs 375: .sp 376: (?=...) positive look ahead 377: (?!...) negative look ahead 378: (?<=...) positive look behind 379: (?<!...) negative look behind 380: .sp 381: Each top-level branch of a look behind must be of a fixed length. 382: . 383: . 384: .SH "BACKREFERENCES" 385: .rs 386: .sp 387: \en reference by number (can be ambiguous) 388: \egn reference by number 389: \eg{n} reference by number 390: \eg{-n} relative reference by number 391: \ek<name> reference by name (Perl) 392: \ek'name' reference by name (Perl) 393: \eg{name} reference by name (Perl) 394: \ek{name} reference by name (.NET) 395: (?P=name) reference by name (Python) 396: . 397: . 398: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)" 399: .rs 400: .sp 401: (?R) recurse whole pattern 402: (?n) call subpattern by absolute number 403: (?+n) call subpattern by relative number 404: (?-n) call subpattern by relative number 405: (?&name) call subpattern by name (Perl) 406: (?P>name) call subpattern by name (Python) 407: \eg<name> call subpattern by name (Oniguruma) 408: \eg'name' call subpattern by name (Oniguruma) 409: \eg<n> call subpattern by absolute number (Oniguruma) 410: \eg'n' call subpattern by absolute number (Oniguruma) 411: \eg<+n> call subpattern by relative number (PCRE extension) 412: \eg'+n' call subpattern by relative number (PCRE extension) 413: \eg<-n> call subpattern by relative number (PCRE extension) 414: \eg'-n' call subpattern by relative number (PCRE extension) 415: . 416: . 417: .SH "CONDITIONAL PATTERNS" 418: .rs 419: .sp 420: (?(condition)yes-pattern) 421: (?(condition)yes-pattern|no-pattern) 422: .sp 423: (?(n)... absolute reference condition 424: (?(+n)... relative reference condition 425: (?(-n)... relative reference condition 426: (?(<name>)... named reference condition (Perl) 427: (?('name')... named reference condition (Perl) 428: (?(name)... named reference condition (PCRE) 429: (?(R)... overall recursion condition 430: (?(Rn)... specific group recursion condition 431: (?(R&name)... specific recursion condition 432: (?(DEFINE)... define subpattern for reference 433: (?(assert)... assertion condition 434: . 435: . 436: .SH "BACKTRACKING CONTROL" 437: .rs 438: .sp 439: The following act immediately they are reached: 440: .sp 441: (*ACCEPT) force successful match 442: (*FAIL) force backtrack; synonym (*F) 443: (*MARK:NAME) set name to be passed back; synonym (*:NAME) 444: .sp 445: The following act only when a subsequent match failure causes a backtrack to 446: reach them. They all force a match failure, but they differ in what happens 447: afterwards. Those that advance the start-of-match point do so only if the 448: pattern is not anchored. 449: .sp 450: (*COMMIT) overall failure, no advance of starting point 451: (*PRUNE) advance to next starting character 452: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 453: (*SKIP) advance to current matching position 454: (*SKIP:NAME) advance to position corresponding to an earlier 455: (*MARK:NAME); if not found, the (*SKIP) is ignored 456: (*THEN) local failure, backtrack to next alternation 457: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 458: . 459: . 460: .SH "NEWLINE CONVENTIONS" 461: .rs 462: .sp 463: These are recognized only at the very start of the pattern or after a 464: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. 465: .sp 466: (*CR) carriage return only 467: (*LF) linefeed only 468: (*CRLF) carriage return followed by linefeed 469: (*ANYCRLF) all three of the above 470: (*ANY) any Unicode newline sequence 471: . 472: . 473: .SH "WHAT \eR MATCHES" 474: .rs 475: .sp 476: These are recognized only at the very start of the pattern or after a 477: (*...) option that sets the newline convention or a UTF or UCP mode. 478: .sp 479: (*BSR_ANYCRLF) CR, LF, or CRLF 480: (*BSR_UNICODE) any Unicode newline sequence 481: . 482: . 483: .SH "CALLOUTS" 484: .rs 485: .sp 486: (?C) callout 487: (?Cn) callout with data n 488: . 489: . 490: .SH "SEE ALSO" 491: .rs 492: .sp 493: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3), 494: \fBpcrematching\fP(3), \fBpcre\fP(3). 495: . 496: . 497: .SH AUTHOR 498: .rs 499: .sp 500: .nf 501: Philip Hazel 502: University Computing Service 503: Cambridge CB2 3QH, England. 504: .fi 505: . 506: . 507: .SH REVISION 508: .rs 509: .sp 510: .nf 511: Last updated: 12 November 2013 512: Copyright (c) 1997-2013 University of Cambridge. 513: .fi