Annotation of embedaddon/pcre/doc/html/pcresyntax.html, revision 1.1
1.1 ! misho 1: <html>
! 2: <head>
! 3: <title>pcresyntax specification</title>
! 4: </head>
! 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
! 6: <h1>pcresyntax man page</h1>
! 7: <p>
! 8: Return to the <a href="index.html">PCRE index page</a>.
! 9: </p>
! 10: <p>
! 11: This page is part of the PCRE HTML documentation. It was generated automatically
! 12: from the original man page. If there is any nonsense in it, please consult the
! 13: man page, in case the conversion went wrong.
! 14: <br>
! 15: <ul>
! 16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
! 17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
! 18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
! 19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
! 20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
! 21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
! 22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
! 23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
! 24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
! 25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
! 26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
! 27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
! 28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
! 29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
! 30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
! 31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
! 32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
! 33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
! 34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
! 35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
! 36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
! 37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
! 38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
! 39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
! 40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
! 41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
! 42: <li><a name="TOC27" href="#SEC27">REVISION</a>
! 43: </ul>
! 44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
! 45: <P>
! 46: The full syntax and semantics of the regular expressions that are supported by
! 47: PCRE are described in the
! 48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
! 49: documentation. This document contains just a quick-reference summary of the
! 50: syntax.
! 51: </P>
! 52: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
! 53: <P>
! 54: <pre>
! 55: \x where x is non-alphanumeric is a literal x
! 56: \Q...\E treat enclosed characters as literal
! 57: </PRE>
! 58: </P>
! 59: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
! 60: <P>
! 61: <pre>
! 62: \a alarm, that is, the BEL character (hex 07)
! 63: \cx "control-x", where x is any ASCII character
! 64: \e escape (hex 1B)
! 65: \f formfeed (hex 0C)
! 66: \n newline (hex 0A)
! 67: \r carriage return (hex 0D)
! 68: \t tab (hex 09)
! 69: \ddd character with octal code ddd, or backreference
! 70: \xhh character with hex code hh
! 71: \x{hhh..} character with hex code hhh..
! 72: </PRE>
! 73: </P>
! 74: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
! 75: <P>
! 76: <pre>
! 77: . any character except newline;
! 78: in dotall mode, any character whatsoever
! 79: \C one byte, even in UTF-8 mode (best avoided)
! 80: \d a decimal digit
! 81: \D a character that is not a decimal digit
! 82: \h a horizontal whitespace character
! 83: \H a character that is not a horizontal whitespace character
! 84: \N a character that is not a newline
! 85: \p{<i>xx</i>} a character with the <i>xx</i> property
! 86: \P{<i>xx</i>} a character without the <i>xx</i> property
! 87: \R a newline sequence
! 88: \s a whitespace character
! 89: \S a character that is not a whitespace character
! 90: \v a vertical whitespace character
! 91: \V a character that is not a vertical whitespace character
! 92: \w a "word" character
! 93: \W a "non-word" character
! 94: \X an extended Unicode sequence
! 95: </pre>
! 96: In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
! 97: characters, even in UTF-8 mode. However, this can be changed by setting the
! 98: PCRE_UCP option.
! 99: </P>
! 100: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
! 101: <P>
! 102: <pre>
! 103: C Other
! 104: Cc Control
! 105: Cf Format
! 106: Cn Unassigned
! 107: Co Private use
! 108: Cs Surrogate
! 109:
! 110: L Letter
! 111: Ll Lower case letter
! 112: Lm Modifier letter
! 113: Lo Other letter
! 114: Lt Title case letter
! 115: Lu Upper case letter
! 116: L& Ll, Lu, or Lt
! 117:
! 118: M Mark
! 119: Mc Spacing mark
! 120: Me Enclosing mark
! 121: Mn Non-spacing mark
! 122:
! 123: N Number
! 124: Nd Decimal number
! 125: Nl Letter number
! 126: No Other number
! 127:
! 128: P Punctuation
! 129: Pc Connector punctuation
! 130: Pd Dash punctuation
! 131: Pe Close punctuation
! 132: Pf Final punctuation
! 133: Pi Initial punctuation
! 134: Po Other punctuation
! 135: Ps Open punctuation
! 136:
! 137: S Symbol
! 138: Sc Currency symbol
! 139: Sk Modifier symbol
! 140: Sm Mathematical symbol
! 141: So Other symbol
! 142:
! 143: Z Separator
! 144: Zl Line separator
! 145: Zp Paragraph separator
! 146: Zs Space separator
! 147: </PRE>
! 148: </P>
! 149: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
! 150: <P>
! 151: <pre>
! 152: Xan Alphanumeric: union of properties L and N
! 153: Xps POSIX space: property Z or tab, NL, VT, FF, CR
! 154: Xsp Perl space: property Z or tab, NL, FF, CR
! 155: Xwd Perl word: property Xan or underscore
! 156: </PRE>
! 157: </P>
! 158: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
! 159: <P>
! 160: Arabic,
! 161: Armenian,
! 162: Avestan,
! 163: Balinese,
! 164: Bamum,
! 165: Bengali,
! 166: Bopomofo,
! 167: Braille,
! 168: Buginese,
! 169: Buhid,
! 170: Canadian_Aboriginal,
! 171: Carian,
! 172: Cham,
! 173: Cherokee,
! 174: Common,
! 175: Coptic,
! 176: Cuneiform,
! 177: Cypriot,
! 178: Cyrillic,
! 179: Deseret,
! 180: Devanagari,
! 181: Egyptian_Hieroglyphs,
! 182: Ethiopic,
! 183: Georgian,
! 184: Glagolitic,
! 185: Gothic,
! 186: Greek,
! 187: Gujarati,
! 188: Gurmukhi,
! 189: Han,
! 190: Hangul,
! 191: Hanunoo,
! 192: Hebrew,
! 193: Hiragana,
! 194: Imperial_Aramaic,
! 195: Inherited,
! 196: Inscriptional_Pahlavi,
! 197: Inscriptional_Parthian,
! 198: Javanese,
! 199: Kaithi,
! 200: Kannada,
! 201: Katakana,
! 202: Kayah_Li,
! 203: Kharoshthi,
! 204: Khmer,
! 205: Lao,
! 206: Latin,
! 207: Lepcha,
! 208: Limbu,
! 209: Linear_B,
! 210: Lisu,
! 211: Lycian,
! 212: Lydian,
! 213: Malayalam,
! 214: Meetei_Mayek,
! 215: Mongolian,
! 216: Myanmar,
! 217: New_Tai_Lue,
! 218: Nko,
! 219: Ogham,
! 220: Old_Italic,
! 221: Old_Persian,
! 222: Old_South_Arabian,
! 223: Old_Turkic,
! 224: Ol_Chiki,
! 225: Oriya,
! 226: Osmanya,
! 227: Phags_Pa,
! 228: Phoenician,
! 229: Rejang,
! 230: Runic,
! 231: Samaritan,
! 232: Saurashtra,
! 233: Shavian,
! 234: Sinhala,
! 235: Sundanese,
! 236: Syloti_Nagri,
! 237: Syriac,
! 238: Tagalog,
! 239: Tagbanwa,
! 240: Tai_Le,
! 241: Tai_Tham,
! 242: Tai_Viet,
! 243: Tamil,
! 244: Telugu,
! 245: Thaana,
! 246: Thai,
! 247: Tibetan,
! 248: Tifinagh,
! 249: Ugaritic,
! 250: Vai,
! 251: Yi.
! 252: </P>
! 253: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
! 254: <P>
! 255: <pre>
! 256: [...] positive character class
! 257: [^...] negative character class
! 258: [x-y] range (can be used for hex characters)
! 259: [[:xxx:]] positive POSIX named set
! 260: [[:^xxx:]] negative POSIX named set
! 261:
! 262: alnum alphanumeric
! 263: alpha alphabetic
! 264: ascii 0-127
! 265: blank space or tab
! 266: cntrl control character
! 267: digit decimal digit
! 268: graph printing, excluding space
! 269: lower lower case letter
! 270: print printing, including space
! 271: punct printing, excluding alphanumeric
! 272: space whitespace
! 273: upper upper case letter
! 274: word same as \w
! 275: xdigit hexadecimal digit
! 276: </pre>
! 277: In PCRE, POSIX character set names recognize only ASCII characters by default,
! 278: but some of them use Unicode properties if PCRE_UCP is set. You can use
! 279: \Q...\E inside a character class.
! 280: </P>
! 281: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
! 282: <P>
! 283: <pre>
! 284: ? 0 or 1, greedy
! 285: ?+ 0 or 1, possessive
! 286: ?? 0 or 1, lazy
! 287: * 0 or more, greedy
! 288: *+ 0 or more, possessive
! 289: *? 0 or more, lazy
! 290: + 1 or more, greedy
! 291: ++ 1 or more, possessive
! 292: +? 1 or more, lazy
! 293: {n} exactly n
! 294: {n,m} at least n, no more than m, greedy
! 295: {n,m}+ at least n, no more than m, possessive
! 296: {n,m}? at least n, no more than m, lazy
! 297: {n,} n or more, greedy
! 298: {n,}+ n or more, possessive
! 299: {n,}? n or more, lazy
! 300: </PRE>
! 301: </P>
! 302: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
! 303: <P>
! 304: <pre>
! 305: \b word boundary
! 306: \B not a word boundary
! 307: ^ start of subject
! 308: also after internal newline in multiline mode
! 309: \A start of subject
! 310: $ end of subject
! 311: also before newline at end of subject
! 312: also before internal newline in multiline mode
! 313: \Z end of subject
! 314: also before newline at end of subject
! 315: \z end of subject
! 316: \G first matching position in subject
! 317: </PRE>
! 318: </P>
! 319: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
! 320: <P>
! 321: <pre>
! 322: \K reset start of match
! 323: </PRE>
! 324: </P>
! 325: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
! 326: <P>
! 327: <pre>
! 328: expr|expr|expr...
! 329: </PRE>
! 330: </P>
! 331: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
! 332: <P>
! 333: <pre>
! 334: (...) capturing group
! 335: (?<name>...) named capturing group (Perl)
! 336: (?'name'...) named capturing group (Perl)
! 337: (?P<name>...) named capturing group (Python)
! 338: (?:...) non-capturing group
! 339: (?|...) non-capturing group; reset group numbers for
! 340: capturing groups in each alternative
! 341: </PRE>
! 342: </P>
! 343: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
! 344: <P>
! 345: <pre>
! 346: (?>...) atomic, non-capturing group
! 347: </PRE>
! 348: </P>
! 349: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
! 350: <P>
! 351: <pre>
! 352: (?#....) comment (not nestable)
! 353: </PRE>
! 354: </P>
! 355: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
! 356: <P>
! 357: <pre>
! 358: (?i) caseless
! 359: (?J) allow duplicate names
! 360: (?m) multiline
! 361: (?s) single line (dotall)
! 362: (?U) default ungreedy (lazy)
! 363: (?x) extended (ignore white space)
! 364: (?-...) unset option(s)
! 365: </pre>
! 366: The following are recognized only at the start of a pattern or after one of the
! 367: newline-setting options with similar syntax:
! 368: <pre>
! 369: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
! 370: (*UTF8) set UTF-8 mode (PCRE_UTF8)
! 371: (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
! 372: </PRE>
! 373: </P>
! 374: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
! 375: <P>
! 376: <pre>
! 377: (?=...) positive look ahead
! 378: (?!...) negative look ahead
! 379: (?<=...) positive look behind
! 380: (?<!...) negative look behind
! 381: </pre>
! 382: Each top-level branch of a look behind must be of a fixed length.
! 383: </P>
! 384: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
! 385: <P>
! 386: <pre>
! 387: \n reference by number (can be ambiguous)
! 388: \gn reference by number
! 389: \g{n} reference by number
! 390: \g{-n} relative reference by number
! 391: \k<name> reference by name (Perl)
! 392: \k'name' reference by name (Perl)
! 393: \g{name} reference by name (Perl)
! 394: \k{name} reference by name (.NET)
! 395: (?P=name) reference by name (Python)
! 396: </PRE>
! 397: </P>
! 398: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
! 399: <P>
! 400: <pre>
! 401: (?R) recurse whole pattern
! 402: (?n) call subpattern by absolute number
! 403: (?+n) call subpattern by relative number
! 404: (?-n) call subpattern by relative number
! 405: (?&name) call subpattern by name (Perl)
! 406: (?P>name) call subpattern by name (Python)
! 407: \g<name> call subpattern by name (Oniguruma)
! 408: \g'name' call subpattern by name (Oniguruma)
! 409: \g<n> call subpattern by absolute number (Oniguruma)
! 410: \g'n' call subpattern by absolute number (Oniguruma)
! 411: \g<+n> call subpattern by relative number (PCRE extension)
! 412: \g'+n' call subpattern by relative number (PCRE extension)
! 413: \g<-n> call subpattern by relative number (PCRE extension)
! 414: \g'-n' call subpattern by relative number (PCRE extension)
! 415: </PRE>
! 416: </P>
! 417: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
! 418: <P>
! 419: <pre>
! 420: (?(condition)yes-pattern)
! 421: (?(condition)yes-pattern|no-pattern)
! 422:
! 423: (?(n)... absolute reference condition
! 424: (?(+n)... relative reference condition
! 425: (?(-n)... relative reference condition
! 426: (?(<name>)... named reference condition (Perl)
! 427: (?('name')... named reference condition (Perl)
! 428: (?(name)... named reference condition (PCRE)
! 429: (?(R)... overall recursion condition
! 430: (?(Rn)... specific group recursion condition
! 431: (?(R&name)... specific recursion condition
! 432: (?(DEFINE)... define subpattern for reference
! 433: (?(assert)... assertion condition
! 434: </PRE>
! 435: </P>
! 436: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
! 437: <P>
! 438: The following act immediately they are reached:
! 439: <pre>
! 440: (*ACCEPT) force successful match
! 441: (*FAIL) force backtrack; synonym (*F)
! 442: </pre>
! 443: The following act only when a subsequent match failure causes a backtrack to
! 444: reach them. They all force a match failure, but they differ in what happens
! 445: afterwards. Those that advance the start-of-match point do so only if the
! 446: pattern is not anchored.
! 447: <pre>
! 448: (*COMMIT) overall failure, no advance of starting point
! 449: (*PRUNE) advance to next starting character
! 450: (*SKIP) advance start to current matching position
! 451: (*THEN) local failure, backtrack to next alternation
! 452: </PRE>
! 453: </P>
! 454: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
! 455: <P>
! 456: These are recognized only at the very start of the pattern or after a
! 457: (*BSR_...) or (*UTF8) or (*UCP) option.
! 458: <pre>
! 459: (*CR) carriage return only
! 460: (*LF) linefeed only
! 461: (*CRLF) carriage return followed by linefeed
! 462: (*ANYCRLF) all three of the above
! 463: (*ANY) any Unicode newline sequence
! 464: </PRE>
! 465: </P>
! 466: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
! 467: <P>
! 468: These are recognized only at the very start of the pattern or after a
! 469: (*...) option that sets the newline convention or UTF-8 or UCP mode.
! 470: <pre>
! 471: (*BSR_ANYCRLF) CR, LF, or CRLF
! 472: (*BSR_UNICODE) any Unicode newline sequence
! 473: </PRE>
! 474: </P>
! 475: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
! 476: <P>
! 477: <pre>
! 478: (?C) callout
! 479: (?Cn) callout with data n
! 480: </PRE>
! 481: </P>
! 482: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
! 483: <P>
! 484: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
! 485: <b>pcrematching</b>(3), <b>pcre</b>(3).
! 486: </P>
! 487: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
! 488: <P>
! 489: Philip Hazel
! 490: <br>
! 491: University Computing Service
! 492: <br>
! 493: Cambridge CB2 3QH, England.
! 494: <br>
! 495: </P>
! 496: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
! 497: <P>
! 498: Last updated: 21 November 2010
! 499: <br>
! 500: Copyright © 1997-2010 University of Cambridge.
! 501: <br>
! 502: <p>
! 503: Return to the <a href="index.html">PCRE index page</a>.
! 504: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>