File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html / pcresyntax.html
Revision 1.1.1.5 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:05 2014 UTC (10 years, 9 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD
pcre 8.34

    1: <html>
    2: <head>
    3: <title>pcresyntax specification</title>
    4: </head>
    5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
    6: <h1>pcresyntax man page</h1>
    7: <p>
    8: Return to the <a href="index.html">PCRE index page</a>.
    9: </p>
   10: <p>
   11: This page is part of the PCRE HTML documentation. It was generated automatically
   12: from the original man page. If there is any nonsense in it, please consult the
   13: man page, in case the conversion went wrong.
   14: <br>
   15: <ul>
   16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
   17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
   18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
   19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
   20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
   21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
   22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
   23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
   24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
   25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
   26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
   27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
   28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
   29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
   30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
   31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
   32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
   33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
   34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
   35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
   36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
   37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
   38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
   39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
   40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
   41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
   42: <li><a name="TOC27" href="#SEC27">REVISION</a>
   43: </ul>
   44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
   45: <P>
   46: The full syntax and semantics of the regular expressions that are supported by
   47: PCRE are described in the
   48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
   49: documentation. This document contains a quick-reference summary of the syntax.
   50: </P>
   51: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
   52: <P>
   53: <pre>
   54:   \x         where x is non-alphanumeric is a literal x
   55:   \Q...\E    treat enclosed characters as literal
   56: </PRE>
   57: </P>
   58: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
   59: <P>
   60: <pre>
   61:   \a         alarm, that is, the BEL character (hex 07)
   62:   \cx        "control-x", where x is any ASCII character
   63:   \e         escape (hex 1B)
   64:   \f         form feed (hex 0C)
   65:   \n         newline (hex 0A)
   66:   \r         carriage return (hex 0D)
   67:   \t         tab (hex 09)
   68:   \0dd       character with octal code 0dd
   69:   \ddd       character with octal code ddd, or backreference
   70:   \o{ddd..}  character with octal code ddd..
   71:   \xhh       character with hex code hh
   72:   \x{hhh..}  character with hex code hhh..
   73: </pre>
   74: Note that \0dd is always an octal code, and that \8 and \9 are the literal
   75: characters "8" and "9".
   76: </P>
   77: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
   78: <P>
   79: <pre>
   80:   .          any character except newline;
   81:                in dotall mode, any character whatsoever
   82:   \C         one data unit, even in UTF mode (best avoided)
   83:   \d         a decimal digit
   84:   \D         a character that is not a decimal digit
   85:   \h         a horizontal white space character
   86:   \H         a character that is not a horizontal white space character
   87:   \N         a character that is not a newline
   88:   \p{<i>xx</i>}     a character with the <i>xx</i> property
   89:   \P{<i>xx</i>}     a character without the <i>xx</i> property
   90:   \R         a newline sequence
   91:   \s         a white space character
   92:   \S         a character that is not a white space character
   93:   \v         a vertical white space character
   94:   \V         a character that is not a vertical white space character
   95:   \w         a "word" character
   96:   \W         a "non-word" character
   97:   \X         a Unicode extended grapheme cluster
   98: </pre>
   99: By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
  100: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
  101: happening, \s and \w may also match characters with code points in the range
  102: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
  103: is changed to use Unicode properties and they match many more characters.
  104: </P>
  105: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  106: <P>
  107: <pre>
  108:   C          Other
  109:   Cc         Control
  110:   Cf         Format
  111:   Cn         Unassigned
  112:   Co         Private use
  113:   Cs         Surrogate
  114: 
  115:   L          Letter
  116:   Ll         Lower case letter
  117:   Lm         Modifier letter
  118:   Lo         Other letter
  119:   Lt         Title case letter
  120:   Lu         Upper case letter
  121:   L&         Ll, Lu, or Lt
  122: 
  123:   M          Mark
  124:   Mc         Spacing mark
  125:   Me         Enclosing mark
  126:   Mn         Non-spacing mark
  127: 
  128:   N          Number
  129:   Nd         Decimal number
  130:   Nl         Letter number
  131:   No         Other number
  132: 
  133:   P          Punctuation
  134:   Pc         Connector punctuation
  135:   Pd         Dash punctuation
  136:   Pe         Close punctuation
  137:   Pf         Final punctuation
  138:   Pi         Initial punctuation
  139:   Po         Other punctuation
  140:   Ps         Open punctuation
  141: 
  142:   S          Symbol
  143:   Sc         Currency symbol
  144:   Sk         Modifier symbol
  145:   Sm         Mathematical symbol
  146:   So         Other symbol
  147: 
  148:   Z          Separator
  149:   Zl         Line separator
  150:   Zp         Paragraph separator
  151:   Zs         Space separator
  152: </PRE>
  153: </P>
  154: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
  155: <P>
  156: <pre>
  157:   Xan        Alphanumeric: union of properties L and N
  158:   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
  159:   Xsp        Perl space: property Z or tab, NL, VT, FF, CR
  160:   Xuc        Univerally-named character: one that can be
  161:                represented by a Universal Character Name
  162:   Xwd        Perl word: property Xan or underscore
  163: </pre>
  164: Perl and POSIX space are now the same. Perl added VT to its space character set
  165: at release 5.18 and PCRE changed at release 8.34.
  166: </P>
  167: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
  168: <P>
  169: Arabic,
  170: Armenian,
  171: Avestan,
  172: Balinese,
  173: Bamum,
  174: Batak,
  175: Bengali,
  176: Bopomofo,
  177: Brahmi,
  178: Braille,
  179: Buginese,
  180: Buhid,
  181: Canadian_Aboriginal,
  182: Carian,
  183: Chakma,
  184: Cham,
  185: Cherokee,
  186: Common,
  187: Coptic,
  188: Cuneiform,
  189: Cypriot,
  190: Cyrillic,
  191: Deseret,
  192: Devanagari,
  193: Egyptian_Hieroglyphs,
  194: Ethiopic,
  195: Georgian,
  196: Glagolitic,
  197: Gothic,
  198: Greek,
  199: Gujarati,
  200: Gurmukhi,
  201: Han,
  202: Hangul,
  203: Hanunoo,
  204: Hebrew,
  205: Hiragana,
  206: Imperial_Aramaic,
  207: Inherited,
  208: Inscriptional_Pahlavi,
  209: Inscriptional_Parthian,
  210: Javanese,
  211: Kaithi,
  212: Kannada,
  213: Katakana,
  214: Kayah_Li,
  215: Kharoshthi,
  216: Khmer,
  217: Lao,
  218: Latin,
  219: Lepcha,
  220: Limbu,
  221: Linear_B,
  222: Lisu,
  223: Lycian,
  224: Lydian,
  225: Malayalam,
  226: Mandaic,
  227: Meetei_Mayek,
  228: Meroitic_Cursive,
  229: Meroitic_Hieroglyphs,
  230: Miao,
  231: Mongolian,
  232: Myanmar,
  233: New_Tai_Lue,
  234: Nko,
  235: Ogham,
  236: Old_Italic,
  237: Old_Persian,
  238: Old_South_Arabian,
  239: Old_Turkic,
  240: Ol_Chiki,
  241: Oriya,
  242: Osmanya,
  243: Phags_Pa,
  244: Phoenician,
  245: Rejang,
  246: Runic,
  247: Samaritan,
  248: Saurashtra,
  249: Sharada,
  250: Shavian,
  251: Sinhala,
  252: Sora_Sompeng,
  253: Sundanese,
  254: Syloti_Nagri,
  255: Syriac,
  256: Tagalog,
  257: Tagbanwa,
  258: Tai_Le,
  259: Tai_Tham,
  260: Tai_Viet,
  261: Takri,
  262: Tamil,
  263: Telugu,
  264: Thaana,
  265: Thai,
  266: Tibetan,
  267: Tifinagh,
  268: Ugaritic,
  269: Vai,
  270: Yi.
  271: </P>
  272: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
  273: <P>
  274: <pre>
  275:   [...]       positive character class
  276:   [^...]      negative character class
  277:   [x-y]       range (can be used for hex characters)
  278:   [[:xxx:]]   positive POSIX named set
  279:   [[:^xxx:]]  negative POSIX named set
  280: 
  281:   alnum       alphanumeric
  282:   alpha       alphabetic
  283:   ascii       0-127
  284:   blank       space or tab
  285:   cntrl       control character
  286:   digit       decimal digit
  287:   graph       printing, excluding space
  288:   lower       lower case letter
  289:   print       printing, including space
  290:   punct       printing, excluding alphanumeric
  291:   space       white space
  292:   upper       upper case letter
  293:   word        same as \w
  294:   xdigit      hexadecimal digit
  295: </pre>
  296: In PCRE, POSIX character set names recognize only ASCII characters by default,
  297: but some of them use Unicode properties if PCRE_UCP is set. You can use
  298: \Q...\E inside a character class.
  299: </P>
  300: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
  301: <P>
  302: <pre>
  303:   ?           0 or 1, greedy
  304:   ?+          0 or 1, possessive
  305:   ??          0 or 1, lazy
  306:   *           0 or more, greedy
  307:   *+          0 or more, possessive
  308:   *?          0 or more, lazy
  309:   +           1 or more, greedy
  310:   ++          1 or more, possessive
  311:   +?          1 or more, lazy
  312:   {n}         exactly n
  313:   {n,m}       at least n, no more than m, greedy
  314:   {n,m}+      at least n, no more than m, possessive
  315:   {n,m}?      at least n, no more than m, lazy
  316:   {n,}        n or more, greedy
  317:   {n,}+       n or more, possessive
  318:   {n,}?       n or more, lazy
  319: </PRE>
  320: </P>
  321: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
  322: <P>
  323: <pre>
  324:   \b          word boundary
  325:   \B          not a word boundary
  326:   ^           start of subject
  327:                also after internal newline in multiline mode
  328:   \A          start of subject
  329:   $           end of subject
  330:                also before newline at end of subject
  331:                also before internal newline in multiline mode
  332:   \Z          end of subject
  333:                also before newline at end of subject
  334:   \z          end of subject
  335:   \G          first matching position in subject
  336: </PRE>
  337: </P>
  338: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
  339: <P>
  340: <pre>
  341:   \K          reset start of match
  342: </PRE>
  343: </P>
  344: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
  345: <P>
  346: <pre>
  347:   expr|expr|expr...
  348: </PRE>
  349: </P>
  350: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
  351: <P>
  352: <pre>
  353:   (...)           capturing group
  354:   (?&#60;name&#62;...)    named capturing group (Perl)
  355:   (?'name'...)    named capturing group (Perl)
  356:   (?P&#60;name&#62;...)   named capturing group (Python)
  357:   (?:...)         non-capturing group
  358:   (?|...)         non-capturing group; reset group numbers for
  359:                    capturing groups in each alternative
  360: </PRE>
  361: </P>
  362: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
  363: <P>
  364: <pre>
  365:   (?&#62;...)         atomic, non-capturing group
  366: </PRE>
  367: </P>
  368: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
  369: <P>
  370: <pre>
  371:   (?#....)        comment (not nestable)
  372: </PRE>
  373: </P>
  374: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
  375: <P>
  376: <pre>
  377:   (?i)            caseless
  378:   (?J)            allow duplicate names
  379:   (?m)            multiline
  380:   (?s)            single line (dotall)
  381:   (?U)            default ungreedy (lazy)
  382:   (?x)            extended (ignore white space)
  383:   (?-...)         unset option(s)
  384: </pre>
  385: The following are recognized only at the start of a pattern or after one of the
  386: newline-setting options with similar syntax:
  387: <pre>
  388:   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  389:   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  390:   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  391:   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
  392:   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
  393:   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
  394:   (*UTF)          set appropriate UTF mode for the library in use
  395:   (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
  396: </pre>
  397: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
  398: limits set by the caller of pcre_exec(), not increase them.
  399: </P>
  400: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
  401: <P>
  402: <pre>
  403:   (?=...)         positive look ahead
  404:   (?!...)         negative look ahead
  405:   (?&#60;=...)        positive look behind
  406:   (?&#60;!...)        negative look behind
  407: </pre>
  408: Each top-level branch of a look behind must be of a fixed length.
  409: </P>
  410: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
  411: <P>
  412: <pre>
  413:   \n              reference by number (can be ambiguous)
  414:   \gn             reference by number
  415:   \g{n}           reference by number
  416:   \g{-n}          relative reference by number
  417:   \k&#60;name&#62;        reference by name (Perl)
  418:   \k'name'        reference by name (Perl)
  419:   \g{name}        reference by name (Perl)
  420:   \k{name}        reference by name (.NET)
  421:   (?P=name)       reference by name (Python)
  422: </PRE>
  423: </P>
  424: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
  425: <P>
  426: <pre>
  427:   (?R)            recurse whole pattern
  428:   (?n)            call subpattern by absolute number
  429:   (?+n)           call subpattern by relative number
  430:   (?-n)           call subpattern by relative number
  431:   (?&name)        call subpattern by name (Perl)
  432:   (?P&#62;name)       call subpattern by name (Python)
  433:   \g&#60;name&#62;        call subpattern by name (Oniguruma)
  434:   \g'name'        call subpattern by name (Oniguruma)
  435:   \g&#60;n&#62;           call subpattern by absolute number (Oniguruma)
  436:   \g'n'           call subpattern by absolute number (Oniguruma)
  437:   \g&#60;+n&#62;          call subpattern by relative number (PCRE extension)
  438:   \g'+n'          call subpattern by relative number (PCRE extension)
  439:   \g&#60;-n&#62;          call subpattern by relative number (PCRE extension)
  440:   \g'-n'          call subpattern by relative number (PCRE extension)
  441: </PRE>
  442: </P>
  443: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
  444: <P>
  445: <pre>
  446:   (?(condition)yes-pattern)
  447:   (?(condition)yes-pattern|no-pattern)
  448: 
  449:   (?(n)...        absolute reference condition
  450:   (?(+n)...       relative reference condition
  451:   (?(-n)...       relative reference condition
  452:   (?(&#60;name&#62;)...   named reference condition (Perl)
  453:   (?('name')...   named reference condition (Perl)
  454:   (?(name)...     named reference condition (PCRE)
  455:   (?(R)...        overall recursion condition
  456:   (?(Rn)...       specific group recursion condition
  457:   (?(R&name)...   specific recursion condition
  458:   (?(DEFINE)...   define subpattern for reference
  459:   (?(assert)...   assertion condition
  460: </PRE>
  461: </P>
  462: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
  463: <P>
  464: The following act immediately they are reached:
  465: <pre>
  466:   (*ACCEPT)       force successful match
  467:   (*FAIL)         force backtrack; synonym (*F)
  468:   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
  469: </pre>
  470: The following act only when a subsequent match failure causes a backtrack to
  471: reach them. They all force a match failure, but they differ in what happens
  472: afterwards. Those that advance the start-of-match point do so only if the
  473: pattern is not anchored.
  474: <pre>
  475:   (*COMMIT)       overall failure, no advance of starting point
  476:   (*PRUNE)        advance to next starting character
  477:   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
  478:   (*SKIP)         advance to current matching position
  479:   (*SKIP:NAME)    advance to position corresponding to an earlier
  480:                   (*MARK:NAME); if not found, the (*SKIP) is ignored
  481:   (*THEN)         local failure, backtrack to next alternation
  482:   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
  483: </PRE>
  484: </P>
  485: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
  486: <P>
  487: These are recognized only at the very start of the pattern or after a
  488: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
  489: <pre>
  490:   (*CR)           carriage return only
  491:   (*LF)           linefeed only
  492:   (*CRLF)         carriage return followed by linefeed
  493:   (*ANYCRLF)      all three of the above
  494:   (*ANY)          any Unicode newline sequence
  495: </PRE>
  496: </P>
  497: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
  498: <P>
  499: These are recognized only at the very start of the pattern or after a
  500: (*...) option that sets the newline convention or a UTF or UCP mode.
  501: <pre>
  502:   (*BSR_ANYCRLF)  CR, LF, or CRLF
  503:   (*BSR_UNICODE)  any Unicode newline sequence
  504: </PRE>
  505: </P>
  506: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
  507: <P>
  508: <pre>
  509:   (?C)      callout
  510:   (?Cn)     callout with data n
  511: </PRE>
  512: </P>
  513: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
  514: <P>
  515: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
  516: <b>pcrematching</b>(3), <b>pcre</b>(3).
  517: </P>
  518: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
  519: <P>
  520: Philip Hazel
  521: <br>
  522: University Computing Service
  523: <br>
  524: Cambridge CB2 3QH, England.
  525: <br>
  526: </P>
  527: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
  528: <P>
  529: Last updated: 12 November 2013
  530: <br>
  531: Copyright &copy; 1997-2013 University of Cambridge.
  532: <br>
  533: <p>
  534: Return to the <a href="index.html">PCRE index page</a>.
  535: </p>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>