embedaddon/pcre/doc/pcresyntax.3 - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / pcresyntax.3
Revision 1.1.1.4 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Jul 22 08:25:56 2013 UTC (11 years, 11 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_33, HEAD

8.33

1: .TH PCRESYNTAX 3 "26 April 2013" "PCRE 8.33" 2: .SH NAME 3: PCRE - Perl-compatible regular expressions 4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY" 5: .rs 6: .sp 7: The full syntax and semantics of the regular expressions that are supported by 8: PCRE are described in the 9: .\" HREF 10: \fBpcrepattern\fP 11: .\" 12: documentation. This document contains a quick-reference summary of the syntax. 13: . 14: . 15: .SH "QUOTING" 16: .rs 17: .sp 18: \ex where x is non-alphanumeric is a literal x 19: \eQ...\eE treat enclosed characters as literal 20: . 21: . 22: .SH "CHARACTERS" 23: .rs 24: .sp 25: \ea alarm, that is, the BEL character (hex 07) 26: \ecx "control-x", where x is any ASCII character 27: \ee escape (hex 1B) 28: \ef form feed (hex 0C) 29: \en newline (hex 0A) 30: \er carriage return (hex 0D) 31: \et tab (hex 09) 32: \eddd character with octal code ddd, or backreference 33: \exhh character with hex code hh 34: \ex{hhh..} character with hex code hhh.. 35: . 36: . 37: .SH "CHARACTER TYPES" 38: .rs 39: .sp 40: . any character except newline; 41: in dotall mode, any character whatsoever 42: \eC one data unit, even in UTF mode (best avoided) 43: \ed a decimal digit 44: \eD a character that is not a decimal digit 45: \eh a horizontal white space character 46: \eH a character that is not a horizontal white space character 47: \eN a character that is not a newline 48: \ep{\fIxx\fP} a character with the \fIxx\fP property 49: \eP{\fIxx\fP} a character without the \fIxx\fP property 50: \eR a newline sequence 51: \es a white space character 52: \eS a character that is not a white space character 53: \ev a vertical white space character 54: \eV a character that is not a vertical white space character 55: \ew a "word" character 56: \eW a "non-word" character 57: \eX a Unicode extended grapheme cluster 58: .sp 59: In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII 60: characters, even in a UTF mode. However, this can be changed by setting the 61: PCRE_UCP option. 62: . 63: . 64: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP" 65: .rs 66: .sp 67: C Other 68: Cc Control 69: Cf Format 70: Cn Unassigned 71: Co Private use 72: Cs Surrogate 73: .sp 74: L Letter 75: Ll Lower case letter 76: Lm Modifier letter 77: Lo Other letter 78: Lt Title case letter 79: Lu Upper case letter 80: L& Ll, Lu, or Lt 81: .sp 82: M Mark 83: Mc Spacing mark 84: Me Enclosing mark 85: Mn Non-spacing mark 86: .sp 87: N Number 88: Nd Decimal number 89: Nl Letter number 90: No Other number 91: .sp 92: P Punctuation 93: Pc Connector punctuation 94: Pd Dash punctuation 95: Pe Close punctuation 96: Pf Final punctuation 97: Pi Initial punctuation 98: Po Other punctuation 99: Ps Open punctuation 100: .sp 101: S Symbol 102: Sc Currency symbol 103: Sk Modifier symbol 104: Sm Mathematical symbol 105: So Other symbol 106: .sp 107: Z Separator 108: Zl Line separator 109: Zp Paragraph separator 110: Zs Space separator 111: . 112: . 113: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP" 114: .rs 115: .sp 116: Xan Alphanumeric: union of properties L and N 117: Xps POSIX space: property Z or tab, NL, VT, FF, CR 118: Xsp Perl space: property Z or tab, NL, FF, CR 119: Xuc Univerally-named character: one that can be 120: represented by a Universal Character Name 121: Xwd Perl word: property Xan or underscore 122: . 123: . 124: .SH "SCRIPT NAMES FOR \ep AND \eP" 125: .rs 126: .sp 127: Arabic, 128: Armenian, 129: Avestan, 130: Balinese, 131: Bamum, 132: Batak, 133: Bengali, 134: Bopomofo, 135: Brahmi, 136: Braille, 137: Buginese, 138: Buhid, 139: Canadian_Aboriginal, 140: Carian, 141: Chakma, 142: Cham, 143: Cherokee, 144: Common, 145: Coptic, 146: Cuneiform, 147: Cypriot, 148: Cyrillic, 149: Deseret, 150: Devanagari, 151: Egyptian_Hieroglyphs, 152: Ethiopic, 153: Georgian, 154: Glagolitic, 155: Gothic, 156: Greek, 157: Gujarati, 158: Gurmukhi, 159: Han, 160: Hangul, 161: Hanunoo, 162: Hebrew, 163: Hiragana, 164: Imperial_Aramaic, 165: Inherited, 166: Inscriptional_Pahlavi, 167: Inscriptional_Parthian, 168: Javanese, 169: Kaithi, 170: Kannada, 171: Katakana, 172: Kayah_Li, 173: Kharoshthi, 174: Khmer, 175: Lao, 176: Latin, 177: Lepcha, 178: Limbu, 179: Linear_B, 180: Lisu, 181: Lycian, 182: Lydian, 183: Malayalam, 184: Mandaic, 185: Meetei_Mayek, 186: Meroitic_Cursive, 187: Meroitic_Hieroglyphs, 188: Miao, 189: Mongolian, 190: Myanmar, 191: New_Tai_Lue, 192: Nko, 193: Ogham, 194: Old_Italic, 195: Old_Persian, 196: Old_South_Arabian, 197: Old_Turkic, 198: Ol_Chiki, 199: Oriya, 200: Osmanya, 201: Phags_Pa, 202: Phoenician, 203: Rejang, 204: Runic, 205: Samaritan, 206: Saurashtra, 207: Sharada, 208: Shavian, 209: Sinhala, 210: Sora_Sompeng, 211: Sundanese, 212: Syloti_Nagri, 213: Syriac, 214: Tagalog, 215: Tagbanwa, 216: Tai_Le, 217: Tai_Tham, 218: Tai_Viet, 219: Takri, 220: Tamil, 221: Telugu, 222: Thaana, 223: Thai, 224: Tibetan, 225: Tifinagh, 226: Ugaritic, 227: Vai, 228: Yi. 229: . 230: . 231: .SH "CHARACTER CLASSES" 232: .rs 233: .sp 234: [...] positive character class 235: [^...] negative character class 236: [x-y] range (can be used for hex characters) 237: [[:xxx:]] positive POSIX named set 238: [[:^xxx:]] negative POSIX named set 239: .sp 240: alnum alphanumeric 241: alpha alphabetic 242: ascii 0-127 243: blank space or tab 244: cntrl control character 245: digit decimal digit 246: graph printing, excluding space 247: lower lower case letter 248: print printing, including space 249: punct printing, excluding alphanumeric 250: space white space 251: upper upper case letter 252: word same as \ew 253: xdigit hexadecimal digit 254: .sp 255: In PCRE, POSIX character set names recognize only ASCII characters by default, 256: but some of them use Unicode properties if PCRE_UCP is set. You can use 257: \eQ...\eE inside a character class. 258: . 259: . 260: .SH "QUANTIFIERS" 261: .rs 262: .sp 263: ? 0 or 1, greedy 264: ?+ 0 or 1, possessive 265: ?? 0 or 1, lazy 266: * 0 or more, greedy 267: *+ 0 or more, possessive 268: *? 0 or more, lazy 269: + 1 or more, greedy 270: ++ 1 or more, possessive 271: +? 1 or more, lazy 272: {n} exactly n 273: {n,m} at least n, no more than m, greedy 274: {n,m}+ at least n, no more than m, possessive 275: {n,m}? at least n, no more than m, lazy 276: {n,} n or more, greedy 277: {n,}+ n or more, possessive 278: {n,}? n or more, lazy 279: . 280: . 281: .SH "ANCHORS AND SIMPLE ASSERTIONS" 282: .rs 283: .sp 284: \eb word boundary 285: \eB not a word boundary 286: ^ start of subject 287: also after internal newline in multiline mode 288: \eA start of subject 289: $ end of subject 290: also before newline at end of subject 291: also before internal newline in multiline mode 292: \eZ end of subject 293: also before newline at end of subject 294: \ez end of subject 295: \eG first matching position in subject 296: . 297: . 298: .SH "MATCH POINT RESET" 299: .rs 300: .sp 301: \eK reset start of match 302: . 303: . 304: .SH "ALTERNATION" 305: .rs 306: .sp 307: expr|expr|expr... 308: . 309: . 310: .SH "CAPTURING" 311: .rs 312: .sp 313: (...) capturing group 314: (?<name>...) named capturing group (Perl) 315: (?'name'...) named capturing group (Perl) 316: (?P<name>...) named capturing group (Python) 317: (?:...) non-capturing group 318: (?|...) non-capturing group; reset group numbers for 319: capturing groups in each alternative 320: . 321: . 322: .SH "ATOMIC GROUPS" 323: .rs 324: .sp 325: (?>...) atomic, non-capturing group 326: . 327: . 328: . 329: . 330: .SH "COMMENT" 331: .rs 332: .sp 333: (?#....) comment (not nestable) 334: . 335: . 336: .SH "OPTION SETTING" 337: .rs 338: .sp 339: (?i) caseless 340: (?J) allow duplicate names 341: (?m) multiline 342: (?s) single line (dotall) 343: (?U) default ungreedy (lazy) 344: (?x) extended (ignore white space) 345: (?-...) unset option(s) 346: .sp 347: The following are recognized only at the start of a pattern or after one of the 348: newline-setting options with similar syntax: 349: .sp 350: (*LIMIT_MATCH=d) set the match limit to d (decimal number) 351: (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number) 352: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE) 353: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8) 354: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16) 355: (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32) 356: (*UTF) set appropriate UTF mode for the library in use 357: (*UCP) set PCRE_UCP (use Unicode properties for \ed etc) 358: . 359: . 360: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS" 361: .rs 362: .sp 363: (?=...) positive look ahead 364: (?!...) negative look ahead 365: (?<=...) positive look behind 366: (?<!...) negative look behind 367: .sp 368: Each top-level branch of a look behind must be of a fixed length. 369: . 370: . 371: .SH "BACKREFERENCES" 372: .rs 373: .sp 374: \en reference by number (can be ambiguous) 375: \egn reference by number 376: \eg{n} reference by number 377: \eg{-n} relative reference by number 378: \ek<name> reference by name (Perl) 379: \ek'name' reference by name (Perl) 380: \eg{name} reference by name (Perl) 381: \ek{name} reference by name (.NET) 382: (?P=name) reference by name (Python) 383: . 384: . 385: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)" 386: .rs 387: .sp 388: (?R) recurse whole pattern 389: (?n) call subpattern by absolute number 390: (?+n) call subpattern by relative number 391: (?-n) call subpattern by relative number 392: (?&name) call subpattern by name (Perl) 393: (?P>name) call subpattern by name (Python) 394: \eg<name> call subpattern by name (Oniguruma) 395: \eg'name' call subpattern by name (Oniguruma) 396: \eg<n> call subpattern by absolute number (Oniguruma) 397: \eg'n' call subpattern by absolute number (Oniguruma) 398: \eg<+n> call subpattern by relative number (PCRE extension) 399: \eg'+n' call subpattern by relative number (PCRE extension) 400: \eg<-n> call subpattern by relative number (PCRE extension) 401: \eg'-n' call subpattern by relative number (PCRE extension) 402: . 403: . 404: .SH "CONDITIONAL PATTERNS" 405: .rs 406: .sp 407: (?(condition)yes-pattern) 408: (?(condition)yes-pattern|no-pattern) 409: .sp 410: (?(n)... absolute reference condition 411: (?(+n)... relative reference condition 412: (?(-n)... relative reference condition 413: (?(<name>)... named reference condition (Perl) 414: (?('name')... named reference condition (Perl) 415: (?(name)... named reference condition (PCRE) 416: (?(R)... overall recursion condition 417: (?(Rn)... specific group recursion condition 418: (?(R&name)... specific recursion condition 419: (?(DEFINE)... define subpattern for reference 420: (?(assert)... assertion condition 421: . 422: . 423: .SH "BACKTRACKING CONTROL" 424: .rs 425: .sp 426: The following act immediately they are reached: 427: .sp 428: (*ACCEPT) force successful match 429: (*FAIL) force backtrack; synonym (*F) 430: (*MARK:NAME) set name to be passed back; synonym (*:NAME) 431: .sp 432: The following act only when a subsequent match failure causes a backtrack to 433: reach them. They all force a match failure, but they differ in what happens 434: afterwards. Those that advance the start-of-match point do so only if the 435: pattern is not anchored. 436: .sp 437: (*COMMIT) overall failure, no advance of starting point 438: (*PRUNE) advance to next starting character 439: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE) 440: (*SKIP) advance to current matching position 441: (*SKIP:NAME) advance to position corresponding to an earlier 442: (*MARK:NAME); if not found, the (*SKIP) is ignored 443: (*THEN) local failure, backtrack to next alternation 444: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN) 445: . 446: . 447: .SH "NEWLINE CONVENTIONS" 448: .rs 449: .sp 450: These are recognized only at the very start of the pattern or after a 451: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option. 452: .sp 453: (*CR) carriage return only 454: (*LF) linefeed only 455: (*CRLF) carriage return followed by linefeed 456: (*ANYCRLF) all three of the above 457: (*ANY) any Unicode newline sequence 458: . 459: . 460: .SH "WHAT \eR MATCHES" 461: .rs 462: .sp 463: These are recognized only at the very start of the pattern or after a 464: (*...) option that sets the newline convention or a UTF or UCP mode. 465: .sp 466: (*BSR_ANYCRLF) CR, LF, or CRLF 467: (*BSR_UNICODE) any Unicode newline sequence 468: . 469: . 470: .SH "CALLOUTS" 471: .rs 472: .sp 473: (?C) callout 474: (?Cn) callout with data n 475: . 476: . 477: .SH "SEE ALSO" 478: .rs 479: .sp 480: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3), 481: \fBpcrematching\fP(3), \fBpcre\fP(3). 482: . 483: . 484: .SH AUTHOR 485: .rs 486: .sp 487: .nf 488: Philip Hazel 489: University Computing Service 490: Cambridge CB2 3QH, England. 491: .fi 492: . 493: . 494: .SH REVISION 495: .rs 496: .sp 497: .nf 498: Last updated: 26 April 2013 499: Copyright (c) 1997-2013 University of Cambridge. 500: .fi