Annotation of embedaddon/pcre/doc/html/pcresyntax.html, revision 1.1.1.4

1.1       misho       1: <html>
                      2: <head>
                      3: <title>pcresyntax specification</title>
                      4: </head>
                      5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
                      6: <h1>pcresyntax man page</h1>
                      7: <p>
                      8: Return to the <a href="index.html">PCRE index page</a>.
                      9: </p>
                     10: <p>
                     11: This page is part of the PCRE HTML documentation. It was generated automatically
                     12: from the original man page. If there is any nonsense in it, please consult the
                     13: man page, in case the conversion went wrong.
                     14: <br>
                     15: <ul>
                     16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
                     17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
                     18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
                     19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
                     20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
                     21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
                     22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
                     23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
                     24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
                     25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
                     26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
                     27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
                     28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
                     29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
                     30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
                     31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
                     32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
                     33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
                     34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
                     35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
                     36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
                     37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
                     38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
                     39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
                     40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
                     41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
                     42: <li><a name="TOC27" href="#SEC27">REVISION</a>
                     43: </ul>
                     44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
                     45: <P>
                     46: The full syntax and semantics of the regular expressions that are supported by
                     47: PCRE are described in the
                     48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
1.1.1.2   misho      49: documentation. This document contains a quick-reference summary of the syntax.
1.1       misho      50: </P>
                     51: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
                     52: <P>
                     53: <pre>
                     54:   \x         where x is non-alphanumeric is a literal x
                     55:   \Q...\E    treat enclosed characters as literal
                     56: </PRE>
                     57: </P>
                     58: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
                     59: <P>
                     60: <pre>
                     61:   \a         alarm, that is, the BEL character (hex 07)
                     62:   \cx        "control-x", where x is any ASCII character
                     63:   \e         escape (hex 1B)
1.1.1.3   misho      64:   \f         form feed (hex 0C)
1.1       misho      65:   \n         newline (hex 0A)
                     66:   \r         carriage return (hex 0D)
                     67:   \t         tab (hex 09)
                     68:   \ddd       character with octal code ddd, or backreference
                     69:   \xhh       character with hex code hh
                     70:   \x{hhh..}  character with hex code hhh..
                     71: </PRE>
                     72: </P>
                     73: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
                     74: <P>
                     75: <pre>
                     76:   .          any character except newline;
                     77:                in dotall mode, any character whatsoever
1.1.1.2   misho      78:   \C         one data unit, even in UTF mode (best avoided)
1.1       misho      79:   \d         a decimal digit
                     80:   \D         a character that is not a decimal digit
1.1.1.3   misho      81:   \h         a horizontal white space character
                     82:   \H         a character that is not a horizontal white space character
1.1       misho      83:   \N         a character that is not a newline
                     84:   \p{<i>xx</i>}     a character with the <i>xx</i> property
                     85:   \P{<i>xx</i>}     a character without the <i>xx</i> property
                     86:   \R         a newline sequence
1.1.1.3   misho      87:   \s         a white space character
                     88:   \S         a character that is not a white space character
                     89:   \v         a vertical white space character
                     90:   \V         a character that is not a vertical white space character
1.1       misho      91:   \w         a "word" character
                     92:   \W         a "non-word" character
1.1.1.4 ! misho      93:   \X         a Unicode extended grapheme cluster
1.1       misho      94: </pre>
                     95: In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
1.1.1.2   misho      96: characters, even in a UTF mode. However, this can be changed by setting the
1.1       misho      97: PCRE_UCP option.
                     98: </P>
                     99: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
                    100: <P>
                    101: <pre>
                    102:   C          Other
                    103:   Cc         Control
                    104:   Cf         Format
                    105:   Cn         Unassigned
                    106:   Co         Private use
                    107:   Cs         Surrogate
                    108: 
                    109:   L          Letter
                    110:   Ll         Lower case letter
                    111:   Lm         Modifier letter
                    112:   Lo         Other letter
                    113:   Lt         Title case letter
                    114:   Lu         Upper case letter
                    115:   L&         Ll, Lu, or Lt
                    116: 
                    117:   M          Mark
                    118:   Mc         Spacing mark
                    119:   Me         Enclosing mark
                    120:   Mn         Non-spacing mark
                    121: 
                    122:   N          Number
                    123:   Nd         Decimal number
                    124:   Nl         Letter number
                    125:   No         Other number
                    126: 
                    127:   P          Punctuation
                    128:   Pc         Connector punctuation
                    129:   Pd         Dash punctuation
                    130:   Pe         Close punctuation
                    131:   Pf         Final punctuation
                    132:   Pi         Initial punctuation
                    133:   Po         Other punctuation
                    134:   Ps         Open punctuation
                    135: 
                    136:   S          Symbol
                    137:   Sc         Currency symbol
                    138:   Sk         Modifier symbol
                    139:   Sm         Mathematical symbol
                    140:   So         Other symbol
                    141: 
                    142:   Z          Separator
                    143:   Zl         Line separator
                    144:   Zp         Paragraph separator
                    145:   Zs         Space separator
                    146: </PRE>
                    147: </P>
                    148: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
                    149: <P>
                    150: <pre>
                    151:   Xan        Alphanumeric: union of properties L and N
                    152:   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
                    153:   Xsp        Perl space: property Z or tab, NL, FF, CR
1.1.1.4 ! misho     154:   Xuc        Univerally-named character: one that can be
        !           155:                represented by a Universal Character Name
1.1       misho     156:   Xwd        Perl word: property Xan or underscore
                    157: </PRE>
                    158: </P>
                    159: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
                    160: <P>
                    161: Arabic,
                    162: Armenian,
                    163: Avestan,
                    164: Balinese,
                    165: Bamum,
1.1.1.3   misho     166: Batak,
1.1       misho     167: Bengali,
                    168: Bopomofo,
1.1.1.3   misho     169: Brahmi,
1.1       misho     170: Braille,
                    171: Buginese,
                    172: Buhid,
                    173: Canadian_Aboriginal,
                    174: Carian,
1.1.1.3   misho     175: Chakma,
1.1       misho     176: Cham,
                    177: Cherokee,
                    178: Common,
                    179: Coptic,
                    180: Cuneiform,
                    181: Cypriot,
                    182: Cyrillic,
                    183: Deseret,
                    184: Devanagari,
                    185: Egyptian_Hieroglyphs,
                    186: Ethiopic,
                    187: Georgian,
                    188: Glagolitic,
                    189: Gothic,
                    190: Greek,
                    191: Gujarati,
                    192: Gurmukhi,
                    193: Han,
                    194: Hangul,
                    195: Hanunoo,
                    196: Hebrew,
                    197: Hiragana,
                    198: Imperial_Aramaic,
                    199: Inherited,
                    200: Inscriptional_Pahlavi,
                    201: Inscriptional_Parthian,
                    202: Javanese,
                    203: Kaithi,
                    204: Kannada,
                    205: Katakana,
                    206: Kayah_Li,
                    207: Kharoshthi,
                    208: Khmer,
                    209: Lao,
                    210: Latin,
                    211: Lepcha,
                    212: Limbu,
                    213: Linear_B,
                    214: Lisu,
                    215: Lycian,
                    216: Lydian,
                    217: Malayalam,
1.1.1.3   misho     218: Mandaic,
1.1       misho     219: Meetei_Mayek,
1.1.1.3   misho     220: Meroitic_Cursive,
                    221: Meroitic_Hieroglyphs,
                    222: Miao,
1.1       misho     223: Mongolian,
                    224: Myanmar,
                    225: New_Tai_Lue,
                    226: Nko,
                    227: Ogham,
                    228: Old_Italic,
                    229: Old_Persian,
                    230: Old_South_Arabian,
                    231: Old_Turkic,
                    232: Ol_Chiki,
                    233: Oriya,
                    234: Osmanya,
                    235: Phags_Pa,
                    236: Phoenician,
                    237: Rejang,
                    238: Runic,
                    239: Samaritan,
                    240: Saurashtra,
1.1.1.3   misho     241: Sharada,
1.1       misho     242: Shavian,
                    243: Sinhala,
1.1.1.3   misho     244: Sora_Sompeng,
1.1       misho     245: Sundanese,
                    246: Syloti_Nagri,
                    247: Syriac,
                    248: Tagalog,
                    249: Tagbanwa,
                    250: Tai_Le,
                    251: Tai_Tham,
                    252: Tai_Viet,
1.1.1.3   misho     253: Takri,
1.1       misho     254: Tamil,
                    255: Telugu,
                    256: Thaana,
                    257: Thai,
                    258: Tibetan,
                    259: Tifinagh,
                    260: Ugaritic,
                    261: Vai,
                    262: Yi.
                    263: </P>
                    264: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
                    265: <P>
                    266: <pre>
                    267:   [...]       positive character class
                    268:   [^...]      negative character class
                    269:   [x-y]       range (can be used for hex characters)
                    270:   [[:xxx:]]   positive POSIX named set
                    271:   [[:^xxx:]]  negative POSIX named set
                    272: 
                    273:   alnum       alphanumeric
                    274:   alpha       alphabetic
                    275:   ascii       0-127
                    276:   blank       space or tab
                    277:   cntrl       control character
                    278:   digit       decimal digit
                    279:   graph       printing, excluding space
                    280:   lower       lower case letter
                    281:   print       printing, including space
                    282:   punct       printing, excluding alphanumeric
1.1.1.3   misho     283:   space       white space
1.1       misho     284:   upper       upper case letter
                    285:   word        same as \w
                    286:   xdigit      hexadecimal digit
                    287: </pre>
                    288: In PCRE, POSIX character set names recognize only ASCII characters by default,
                    289: but some of them use Unicode properties if PCRE_UCP is set. You can use
                    290: \Q...\E inside a character class.
                    291: </P>
                    292: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
                    293: <P>
                    294: <pre>
                    295:   ?           0 or 1, greedy
                    296:   ?+          0 or 1, possessive
                    297:   ??          0 or 1, lazy
                    298:   *           0 or more, greedy
                    299:   *+          0 or more, possessive
                    300:   *?          0 or more, lazy
                    301:   +           1 or more, greedy
                    302:   ++          1 or more, possessive
                    303:   +?          1 or more, lazy
                    304:   {n}         exactly n
                    305:   {n,m}       at least n, no more than m, greedy
                    306:   {n,m}+      at least n, no more than m, possessive
                    307:   {n,m}?      at least n, no more than m, lazy
                    308:   {n,}        n or more, greedy
                    309:   {n,}+       n or more, possessive
                    310:   {n,}?       n or more, lazy
                    311: </PRE>
                    312: </P>
                    313: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
                    314: <P>
                    315: <pre>
                    316:   \b          word boundary
                    317:   \B          not a word boundary
                    318:   ^           start of subject
                    319:                also after internal newline in multiline mode
                    320:   \A          start of subject
                    321:   $           end of subject
                    322:                also before newline at end of subject
                    323:                also before internal newline in multiline mode
                    324:   \Z          end of subject
                    325:                also before newline at end of subject
                    326:   \z          end of subject
                    327:   \G          first matching position in subject
                    328: </PRE>
                    329: </P>
                    330: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
                    331: <P>
                    332: <pre>
                    333:   \K          reset start of match
                    334: </PRE>
                    335: </P>
                    336: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
                    337: <P>
                    338: <pre>
                    339:   expr|expr|expr...
                    340: </PRE>
                    341: </P>
                    342: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
                    343: <P>
                    344: <pre>
                    345:   (...)           capturing group
                    346:   (?&#60;name&#62;...)    named capturing group (Perl)
                    347:   (?'name'...)    named capturing group (Perl)
                    348:   (?P&#60;name&#62;...)   named capturing group (Python)
                    349:   (?:...)         non-capturing group
                    350:   (?|...)         non-capturing group; reset group numbers for
                    351:                    capturing groups in each alternative
                    352: </PRE>
                    353: </P>
                    354: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
                    355: <P>
                    356: <pre>
                    357:   (?&#62;...)         atomic, non-capturing group
                    358: </PRE>
                    359: </P>
                    360: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
                    361: <P>
                    362: <pre>
                    363:   (?#....)        comment (not nestable)
                    364: </PRE>
                    365: </P>
                    366: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
                    367: <P>
                    368: <pre>
                    369:   (?i)            caseless
                    370:   (?J)            allow duplicate names
                    371:   (?m)            multiline
                    372:   (?s)            single line (dotall)
                    373:   (?U)            default ungreedy (lazy)
                    374:   (?x)            extended (ignore white space)
                    375:   (?-...)         unset option(s)
                    376: </pre>
                    377: The following are recognized only at the start of a pattern or after one of the
                    378: newline-setting options with similar syntax:
                    379: <pre>
1.1.1.4 ! misho     380:   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
        !           381:   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
1.1       misho     382:   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
1.1.1.2   misho     383:   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
                    384:   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
1.1.1.4 ! misho     385:   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
        !           386:   (*UTF)          set appropriate UTF mode for the library in use
1.1       misho     387:   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
                    388: </PRE>
                    389: </P>
                    390: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
                    391: <P>
                    392: <pre>
                    393:   (?=...)         positive look ahead
                    394:   (?!...)         negative look ahead
                    395:   (?&#60;=...)        positive look behind
                    396:   (?&#60;!...)        negative look behind
                    397: </pre>
                    398: Each top-level branch of a look behind must be of a fixed length.
                    399: </P>
                    400: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
                    401: <P>
                    402: <pre>
                    403:   \n              reference by number (can be ambiguous)
                    404:   \gn             reference by number
                    405:   \g{n}           reference by number
                    406:   \g{-n}          relative reference by number
                    407:   \k&#60;name&#62;        reference by name (Perl)
                    408:   \k'name'        reference by name (Perl)
                    409:   \g{name}        reference by name (Perl)
                    410:   \k{name}        reference by name (.NET)
                    411:   (?P=name)       reference by name (Python)
                    412: </PRE>
                    413: </P>
                    414: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
                    415: <P>
                    416: <pre>
                    417:   (?R)            recurse whole pattern
                    418:   (?n)            call subpattern by absolute number
                    419:   (?+n)           call subpattern by relative number
                    420:   (?-n)           call subpattern by relative number
                    421:   (?&name)        call subpattern by name (Perl)
                    422:   (?P&#62;name)       call subpattern by name (Python)
                    423:   \g&#60;name&#62;        call subpattern by name (Oniguruma)
                    424:   \g'name'        call subpattern by name (Oniguruma)
                    425:   \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
                    426:   \g'n'           call subpattern by absolute number (Oniguruma)
                    427:   \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
                    428:   \g'+n'          call subpattern by relative number (PCRE extension)
                    429:   \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
                    430:   \g'-n'          call subpattern by relative number (PCRE extension)
                    431: </PRE>
                    432: </P>
                    433: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
                    434: <P>
                    435: <pre>
                    436:   (?(condition)yes-pattern)
                    437:   (?(condition)yes-pattern|no-pattern)
                    438: 
                    439:   (?(n)...        absolute reference condition
                    440:   (?(+n)...       relative reference condition
                    441:   (?(-n)...       relative reference condition
                    442:   (?(&#60;name&#62;)...   named reference condition (Perl)
                    443:   (?('name')...   named reference condition (Perl)
                    444:   (?(name)...     named reference condition (PCRE)
                    445:   (?(R)...        overall recursion condition
                    446:   (?(Rn)...       specific group recursion condition
                    447:   (?(R&name)...   specific recursion condition
                    448:   (?(DEFINE)...   define subpattern for reference
                    449:   (?(assert)...   assertion condition
                    450: </PRE>
                    451: </P>
                    452: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
                    453: <P>
                    454: The following act immediately they are reached:
                    455: <pre>
                    456:   (*ACCEPT)       force successful match
                    457:   (*FAIL)         force backtrack; synonym (*F)
1.1.1.2   misho     458:   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
1.1       misho     459: </pre>
                    460: The following act only when a subsequent match failure causes a backtrack to
                    461: reach them. They all force a match failure, but they differ in what happens
                    462: afterwards. Those that advance the start-of-match point do so only if the
                    463: pattern is not anchored.
                    464: <pre>
                    465:   (*COMMIT)       overall failure, no advance of starting point
                    466:   (*PRUNE)        advance to next starting character
1.1.1.2   misho     467:   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
                    468:   (*SKIP)         advance to current matching position
                    469:   (*SKIP:NAME)    advance to position corresponding to an earlier
                    470:                   (*MARK:NAME); if not found, the (*SKIP) is ignored
1.1       misho     471:   (*THEN)         local failure, backtrack to next alternation
1.1.1.2   misho     472:   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
1.1       misho     473: </PRE>
                    474: </P>
                    475: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
                    476: <P>
                    477: These are recognized only at the very start of the pattern or after a
1.1.1.4 ! misho     478: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
1.1       misho     479: <pre>
                    480:   (*CR)           carriage return only
                    481:   (*LF)           linefeed only
                    482:   (*CRLF)         carriage return followed by linefeed
                    483:   (*ANYCRLF)      all three of the above
                    484:   (*ANY)          any Unicode newline sequence
                    485: </PRE>
                    486: </P>
                    487: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
                    488: <P>
                    489: These are recognized only at the very start of the pattern or after a
1.1.1.2   misho     490: (*...) option that sets the newline convention or a UTF or UCP mode.
1.1       misho     491: <pre>
                    492:   (*BSR_ANYCRLF)  CR, LF, or CRLF
                    493:   (*BSR_UNICODE)  any Unicode newline sequence
                    494: </PRE>
                    495: </P>
                    496: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
                    497: <P>
                    498: <pre>
                    499:   (?C)      callout
                    500:   (?Cn)     callout with data n
                    501: </PRE>
                    502: </P>
                    503: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
                    504: <P>
                    505: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
                    506: <b>pcrematching</b>(3), <b>pcre</b>(3).
                    507: </P>
                    508: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
                    509: <P>
                    510: Philip Hazel
                    511: <br>
                    512: University Computing Service
                    513: <br>
                    514: Cambridge CB2 3QH, England.
                    515: <br>
                    516: </P>
                    517: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
                    518: <P>
1.1.1.4 ! misho     519: Last updated: 26 April 2013
1.1       misho     520: <br>
1.1.1.4 ! misho     521: Copyright &copy; 1997-2013 University of Cambridge.
1.1       misho     522: <br>
                    523: <p>
                    524: Return to the <a href="index.html">PCRE index page</a>.
                    525: </p>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>