embedaddon/pcre/doc/html/pcresyntax.html - annotate

Return to pcresyntax.html CVS log
Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html
Annotation of embedaddon/pcre/doc/html/pcresyntax.html, revision 1.1.1.2

1.1       misho       1: <html>
                      2: <head>
                      3: <title>pcresyntax specification</title>
                      4: </head>
                      5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
                      6: <h1>pcresyntax man page</h1>
                      7: <p>
                      8: Return to the <a href="index.html">PCRE index page</a>.
                      9: </p>
                     10: <p>
                     11: This page is part of the PCRE HTML documentation. It was generated automatically
                     12: from the original man page. If there is any nonsense in it, please consult the
                     13: man page, in case the conversion went wrong.
                     14: <br>
                     15: <ul>
                     16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
                     17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
                     18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
                     19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
                     20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
                     21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
                     22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
                     23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
                     24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
                     25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
                     26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
                     27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
                     28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
                     29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
                     30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
                     31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
                     32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
                     33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
                     34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
                     35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
                     36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
                     37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
                     38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
                     39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
                     40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
                     41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
                     42: <li><a name="TOC27" href="#SEC27">REVISION</a>
                     43: </ul>
                     44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
                     45: <P>
                     46: The full syntax and semantics of the regular expressions that are supported by
                     47: PCRE are described in the
                     48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
1.1.1.2 ! misho      49: documentation. This document contains a quick-reference summary of the syntax.
1.1       misho      50: </P>
                     51: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
                     52: <P>
                     53: <pre>
                     54:   \x         where x is non-alphanumeric is a literal x
                     55:   \Q...\E    treat enclosed characters as literal
                     56: </PRE>
                     57: </P>
                     58: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
                     59: <P>
                     60: <pre>
                     61:   \a         alarm, that is, the BEL character (hex 07)
                     62:   \cx        "control-x", where x is any ASCII character
                     63:   \e         escape (hex 1B)
                     64:   \f         formfeed (hex 0C)
                     65:   \n         newline (hex 0A)
                     66:   \r         carriage return (hex 0D)
                     67:   \t         tab (hex 09)
                     68:   \ddd       character with octal code ddd, or backreference
                     69:   \xhh       character with hex code hh
                     70:   \x{hhh..}  character with hex code hhh..
                     71: </PRE>
                     72: </P>
                     73: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
                     74: <P>
                     75: <pre>
                     76:   .          any character except newline;
                     77:                in dotall mode, any character whatsoever
1.1.1.2 ! misho      78:   \C         one data unit, even in UTF mode (best avoided)
1.1       misho      79:   \d         a decimal digit
                     80:   \D         a character that is not a decimal digit
                     81:   \h         a horizontal whitespace character
                     82:   \H         a character that is not a horizontal whitespace character
                     83:   \N         a character that is not a newline
                     84:   \p{<i>xx</i>}     a character with the <i>xx</i> property
                     85:   \P{<i>xx</i>}     a character without the <i>xx</i> property
                     86:   \R         a newline sequence
                     87:   \s         a whitespace character
                     88:   \S         a character that is not a whitespace character
                     89:   \v         a vertical whitespace character
                     90:   \V         a character that is not a vertical whitespace character
                     91:   \w         a "word" character
                     92:   \W         a "non-word" character
                     93:   \X         an extended Unicode sequence
                     94: </pre>
                     95: In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
1.1.1.2 ! misho      96: characters, even in a UTF mode. However, this can be changed by setting the
1.1       misho      97: PCRE_UCP option.
                     98: </P>
                     99: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
                    100: <P>
                    101: <pre>
                    102:   C          Other
                    103:   Cc         Control
                    104:   Cf         Format
                    105:   Cn         Unassigned
                    106:   Co         Private use
                    107:   Cs         Surrogate
                    108: 
                    109:   L          Letter
                    110:   Ll         Lower case letter
                    111:   Lm         Modifier letter
                    112:   Lo         Other letter
                    113:   Lt         Title case letter
                    114:   Lu         Upper case letter
                    115:   L&         Ll, Lu, or Lt
                    116: 
                    117:   M          Mark
                    118:   Mc         Spacing mark
                    119:   Me         Enclosing mark
                    120:   Mn         Non-spacing mark
                    121: 
                    122:   N          Number
                    123:   Nd         Decimal number
                    124:   Nl         Letter number
                    125:   No         Other number
                    126: 
                    127:   P          Punctuation
                    128:   Pc         Connector punctuation
                    129:   Pd         Dash punctuation
                    130:   Pe         Close punctuation
                    131:   Pf         Final punctuation
                    132:   Pi         Initial punctuation
                    133:   Po         Other punctuation
                    134:   Ps         Open punctuation
                    135: 
                    136:   S          Symbol
                    137:   Sc         Currency symbol
                    138:   Sk         Modifier symbol
                    139:   Sm         Mathematical symbol
                    140:   So         Other symbol
                    141: 
                    142:   Z          Separator
                    143:   Zl         Line separator
                    144:   Zp         Paragraph separator
                    145:   Zs         Space separator
                    146: </PRE>
                    147: </P>
                    148: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
                    149: <P>
                    150: <pre>
                    151:   Xan        Alphanumeric: union of properties L and N
                    152:   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
                    153:   Xsp        Perl space: property Z or tab, NL, FF, CR
                    154:   Xwd        Perl word: property Xan or underscore
                    155: </PRE>
                    156: </P>
                    157: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
                    158: <P>
                    159: Arabic,
                    160: Armenian,
                    161: Avestan,
                    162: Balinese,
                    163: Bamum,
                    164: Bengali,
                    165: Bopomofo,
                    166: Braille,
                    167: Buginese,
                    168: Buhid,
                    169: Canadian_Aboriginal,
                    170: Carian,
                    171: Cham,
                    172: Cherokee,
                    173: Common,
                    174: Coptic,
                    175: Cuneiform,
                    176: Cypriot,
                    177: Cyrillic,
                    178: Deseret,
                    179: Devanagari,
                    180: Egyptian_Hieroglyphs,
                    181: Ethiopic,
                    182: Georgian,
                    183: Glagolitic,
                    184: Gothic,
                    185: Greek,
                    186: Gujarati,
                    187: Gurmukhi,
                    188: Han,
                    189: Hangul,
                    190: Hanunoo,
                    191: Hebrew,
                    192: Hiragana,
                    193: Imperial_Aramaic,
                    194: Inherited,
                    195: Inscriptional_Pahlavi,
                    196: Inscriptional_Parthian,
                    197: Javanese,
                    198: Kaithi,
                    199: Kannada,
                    200: Katakana,
                    201: Kayah_Li,
                    202: Kharoshthi,
                    203: Khmer,
                    204: Lao,
                    205: Latin,
                    206: Lepcha,
                    207: Limbu,
                    208: Linear_B,
                    209: Lisu,
                    210: Lycian,
                    211: Lydian,
                    212: Malayalam,
                    213: Meetei_Mayek,
                    214: Mongolian,
                    215: Myanmar,
                    216: New_Tai_Lue,
                    217: Nko,
                    218: Ogham,
                    219: Old_Italic,
                    220: Old_Persian,
                    221: Old_South_Arabian,
                    222: Old_Turkic,
                    223: Ol_Chiki,
                    224: Oriya,
                    225: Osmanya,
                    226: Phags_Pa,
                    227: Phoenician,
                    228: Rejang,
                    229: Runic,
                    230: Samaritan,
                    231: Saurashtra,
                    232: Shavian,
                    233: Sinhala,
                    234: Sundanese,
                    235: Syloti_Nagri,
                    236: Syriac,
                    237: Tagalog,
                    238: Tagbanwa,
                    239: Tai_Le,
                    240: Tai_Tham,
                    241: Tai_Viet,
                    242: Tamil,
                    243: Telugu,
                    244: Thaana,
                    245: Thai,
                    246: Tibetan,
                    247: Tifinagh,
                    248: Ugaritic,
                    249: Vai,
                    250: Yi.
                    251: </P>
                    252: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
                    253: <P>
                    254: <pre>
                    255:   [...]       positive character class
                    256:   [^...]      negative character class
                    257:   [x-y]       range (can be used for hex characters)
                    258:   [[:xxx:]]   positive POSIX named set
                    259:   [[:^xxx:]]  negative POSIX named set
                    260: 
                    261:   alnum       alphanumeric
                    262:   alpha       alphabetic
                    263:   ascii       0-127
                    264:   blank       space or tab
                    265:   cntrl       control character
                    266:   digit       decimal digit
                    267:   graph       printing, excluding space
                    268:   lower       lower case letter
                    269:   print       printing, including space
                    270:   punct       printing, excluding alphanumeric
                    271:   space       whitespace
                    272:   upper       upper case letter
                    273:   word        same as \w
                    274:   xdigit      hexadecimal digit
                    275: </pre>
                    276: In PCRE, POSIX character set names recognize only ASCII characters by default,
                    277: but some of them use Unicode properties if PCRE_UCP is set. You can use
                    278: \Q...\E inside a character class.
                    279: </P>
                    280: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
                    281: <P>
                    282: <pre>
                    283:   ?           0 or 1, greedy
                    284:   ?+          0 or 1, possessive
                    285:   ??          0 or 1, lazy
                    286:   *           0 or more, greedy
                    287:   *+          0 or more, possessive
                    288:   *?          0 or more, lazy
                    289:   +           1 or more, greedy
                    290:   ++          1 or more, possessive
                    291:   +?          1 or more, lazy
                    292:   {n}         exactly n
                    293:   {n,m}       at least n, no more than m, greedy
                    294:   {n,m}+      at least n, no more than m, possessive
                    295:   {n,m}?      at least n, no more than m, lazy
                    296:   {n,}        n or more, greedy
                    297:   {n,}+       n or more, possessive
                    298:   {n,}?       n or more, lazy
                    299: </PRE>
                    300: </P>
                    301: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
                    302: <P>
                    303: <pre>
                    304:   \b          word boundary
                    305:   \B          not a word boundary
                    306:   ^           start of subject
                    307:                also after internal newline in multiline mode
                    308:   \A          start of subject
                    309:   $           end of subject
                    310:                also before newline at end of subject
                    311:                also before internal newline in multiline mode
                    312:   \Z          end of subject
                    313:                also before newline at end of subject
                    314:   \z          end of subject
                    315:   \G          first matching position in subject
                    316: </PRE>
                    317: </P>
                    318: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
                    319: <P>
                    320: <pre>
                    321:   \K          reset start of match
                    322: </PRE>
                    323: </P>
                    324: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
                    325: <P>
                    326: <pre>
                    327:   expr|expr|expr...
                    328: </PRE>
                    329: </P>
                    330: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
                    331: <P>
                    332: <pre>
                    333:   (...)           capturing group
                    334:   (?&#60;name&#62;...)    named capturing group (Perl)
                    335:   (?'name'...)    named capturing group (Perl)
                    336:   (?P&#60;name&#62;...)   named capturing group (Python)
                    337:   (?:...)         non-capturing group
                    338:   (?|...)         non-capturing group; reset group numbers for
                    339:                    capturing groups in each alternative
                    340: </PRE>
                    341: </P>
                    342: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
                    343: <P>
                    344: <pre>
                    345:   (?&#62;...)         atomic, non-capturing group
                    346: </PRE>
                    347: </P>
                    348: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
                    349: <P>
                    350: <pre>
                    351:   (?#....)        comment (not nestable)
                    352: </PRE>
                    353: </P>
                    354: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
                    355: <P>
                    356: <pre>
                    357:   (?i)            caseless
                    358:   (?J)            allow duplicate names
                    359:   (?m)            multiline
                    360:   (?s)            single line (dotall)
                    361:   (?U)            default ungreedy (lazy)
                    362:   (?x)            extended (ignore white space)
                    363:   (?-...)         unset option(s)
                    364: </pre>
                    365: The following are recognized only at the start of a pattern or after one of the
                    366: newline-setting options with similar syntax:
                    367: <pre>
                    368:   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
1.1.1.2 ! misho     369:   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
        !           370:   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
1.1       misho     371:   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
                    372: </PRE>
                    373: </P>
                    374: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
                    375: <P>
                    376: <pre>
                    377:   (?=...)         positive look ahead
                    378:   (?!...)         negative look ahead
                    379:   (?&#60;=...)        positive look behind
                    380:   (?&#60;!...)        negative look behind
                    381: </pre>
                    382: Each top-level branch of a look behind must be of a fixed length.
                    383: </P>
                    384: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
                    385: <P>
                    386: <pre>
                    387:   \n              reference by number (can be ambiguous)
                    388:   \gn             reference by number
                    389:   \g{n}           reference by number
                    390:   \g{-n}          relative reference by number
                    391:   \k&#60;name&#62;        reference by name (Perl)
                    392:   \k'name'        reference by name (Perl)
                    393:   \g{name}        reference by name (Perl)
                    394:   \k{name}        reference by name (.NET)
                    395:   (?P=name)       reference by name (Python)
                    396: </PRE>
                    397: </P>
                    398: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
                    399: <P>
                    400: <pre>
                    401:   (?R)            recurse whole pattern
                    402:   (?n)            call subpattern by absolute number
                    403:   (?+n)           call subpattern by relative number
                    404:   (?-n)           call subpattern by relative number
                    405:   (?&name)        call subpattern by name (Perl)
                    406:   (?P&#62;name)       call subpattern by name (Python)
                    407:   \g&#60;name&#62;        call subpattern by name (Oniguruma)
                    408:   \g'name'        call subpattern by name (Oniguruma)
                    409:   \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
                    410:   \g'n'           call subpattern by absolute number (Oniguruma)
                    411:   \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
                    412:   \g'+n'          call subpattern by relative number (PCRE extension)
                    413:   \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
                    414:   \g'-n'          call subpattern by relative number (PCRE extension)
                    415: </PRE>
                    416: </P>
                    417: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
                    418: <P>
                    419: <pre>
                    420:   (?(condition)yes-pattern)
                    421:   (?(condition)yes-pattern|no-pattern)
                    422: 
                    423:   (?(n)...        absolute reference condition
                    424:   (?(+n)...       relative reference condition
                    425:   (?(-n)...       relative reference condition
                    426:   (?(&#60;name&#62;)...   named reference condition (Perl)
                    427:   (?('name')...   named reference condition (Perl)
                    428:   (?(name)...     named reference condition (PCRE)
                    429:   (?(R)...        overall recursion condition
                    430:   (?(Rn)...       specific group recursion condition
                    431:   (?(R&name)...   specific recursion condition
                    432:   (?(DEFINE)...   define subpattern for reference
                    433:   (?(assert)...   assertion condition
                    434: </PRE>
                    435: </P>
                    436: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
                    437: <P>
                    438: The following act immediately they are reached:
                    439: <pre>
                    440:   (*ACCEPT)       force successful match
                    441:   (*FAIL)         force backtrack; synonym (*F)
1.1.1.2 ! misho     442:   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
1.1       misho     443: </pre>
                    444: The following act only when a subsequent match failure causes a backtrack to
                    445: reach them. They all force a match failure, but they differ in what happens
                    446: afterwards. Those that advance the start-of-match point do so only if the
                    447: pattern is not anchored.
                    448: <pre>
                    449:   (*COMMIT)       overall failure, no advance of starting point
                    450:   (*PRUNE)        advance to next starting character
1.1.1.2 ! misho     451:   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
        !           452:   (*SKIP)         advance to current matching position
        !           453:   (*SKIP:NAME)    advance to position corresponding to an earlier
        !           454:                   (*MARK:NAME); if not found, the (*SKIP) is ignored
1.1       misho     455:   (*THEN)         local failure, backtrack to next alternation
1.1.1.2 ! misho     456:   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
1.1       misho     457: </PRE>
                    458: </P>
                    459: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
                    460: <P>
                    461: These are recognized only at the very start of the pattern or after a
1.1.1.2 ! misho     462: (*BSR_...), (*UTF8), (*UTF16) or (*UCP) option.
1.1       misho     463: <pre>
                    464:   (*CR)           carriage return only
                    465:   (*LF)           linefeed only
                    466:   (*CRLF)         carriage return followed by linefeed
                    467:   (*ANYCRLF)      all three of the above
                    468:   (*ANY)          any Unicode newline sequence
                    469: </PRE>
                    470: </P>
                    471: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
                    472: <P>
                    473: These are recognized only at the very start of the pattern or after a
1.1.1.2 ! misho     474: (*...) option that sets the newline convention or a UTF or UCP mode.
1.1       misho     475: <pre>
                    476:   (*BSR_ANYCRLF)  CR, LF, or CRLF
                    477:   (*BSR_UNICODE)  any Unicode newline sequence
                    478: </PRE>
                    479: </P>
                    480: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
                    481: <P>
                    482: <pre>
                    483:   (?C)      callout
                    484:   (?Cn)     callout with data n
                    485: </PRE>
                    486: </P>
                    487: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
                    488: <P>
                    489: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
                    490: <b>pcrematching</b>(3), <b>pcre</b>(3).
                    491: </P>
                    492: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
                    493: <P>
                    494: Philip Hazel
                    495: <br>
                    496: University Computing Service
                    497: <br>
                    498: Cambridge CB2 3QH, England.
                    499: <br>
                    500: </P>
                    501: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
                    502: <P>
1.1.1.2 ! misho     503: Last updated: 10 January 2012
1.1       misho     504: <br>
1.1.1.2 ! misho     505: Copyright &copy; 1997-2012 University of Cambridge.
1.1       misho     506: <br>
                    507: <p>
                    508: Return to the <a href="index.html">PCRE index page</a>.
                    509: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>