File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / pcresyntax.3
Revision 1.1.1.5 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:05 2014 UTC (10 years, 9 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD
pcre 8.34

    1: .TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34"
    2: .SH NAME
    3: PCRE - Perl-compatible regular expressions
    4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
    5: .rs
    6: .sp
    7: The full syntax and semantics of the regular expressions that are supported by
    8: PCRE are described in the
    9: .\" HREF
   10: \fBpcrepattern\fP
   11: .\"
   12: documentation. This document contains a quick-reference summary of the syntax.
   13: .
   14: .
   15: .SH "QUOTING"
   16: .rs
   17: .sp
   18:   \ex         where x is non-alphanumeric is a literal x
   19:   \eQ...\eE    treat enclosed characters as literal
   20: .
   21: .
   22: .SH "CHARACTERS"
   23: .rs
   24: .sp
   25:   \ea         alarm, that is, the BEL character (hex 07)
   26:   \ecx        "control-x", where x is any ASCII character
   27:   \ee         escape (hex 1B)
   28:   \ef         form feed (hex 0C)
   29:   \en         newline (hex 0A)
   30:   \er         carriage return (hex 0D)
   31:   \et         tab (hex 09)
   32:   \e0dd       character with octal code 0dd
   33:   \eddd       character with octal code ddd, or backreference
   34:   \eo{ddd..}  character with octal code ddd..
   35:   \exhh       character with hex code hh
   36:   \ex{hhh..}  character with hex code hhh..
   37: .sp
   38: Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
   39: characters "8" and "9".
   40: .
   41: .
   42: .SH "CHARACTER TYPES"
   43: .rs
   44: .sp
   45:   .          any character except newline;
   46:                in dotall mode, any character whatsoever
   47:   \eC         one data unit, even in UTF mode (best avoided)
   48:   \ed         a decimal digit
   49:   \eD         a character that is not a decimal digit
   50:   \eh         a horizontal white space character
   51:   \eH         a character that is not a horizontal white space character
   52:   \eN         a character that is not a newline
   53:   \ep{\fIxx\fP}     a character with the \fIxx\fP property
   54:   \eP{\fIxx\fP}     a character without the \fIxx\fP property
   55:   \eR         a newline sequence
   56:   \es         a white space character
   57:   \eS         a character that is not a white space character
   58:   \ev         a vertical white space character
   59:   \eV         a character that is not a vertical white space character
   60:   \ew         a "word" character
   61:   \eW         a "non-word" character
   62:   \eX         a Unicode extended grapheme cluster
   63: .sp
   64: By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
   65: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
   66: happening, \es and \ew may also match characters with code points in the range
   67: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
   68: is changed to use Unicode properties and they match many more characters.
   69: .
   70: .
   71: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
   72: .rs
   73: .sp
   74:   C          Other
   75:   Cc         Control
   76:   Cf         Format
   77:   Cn         Unassigned
   78:   Co         Private use
   79:   Cs         Surrogate
   80: .sp
   81:   L          Letter
   82:   Ll         Lower case letter
   83:   Lm         Modifier letter
   84:   Lo         Other letter
   85:   Lt         Title case letter
   86:   Lu         Upper case letter
   87:   L&         Ll, Lu, or Lt
   88: .sp
   89:   M          Mark
   90:   Mc         Spacing mark
   91:   Me         Enclosing mark
   92:   Mn         Non-spacing mark
   93: .sp
   94:   N          Number
   95:   Nd         Decimal number
   96:   Nl         Letter number
   97:   No         Other number
   98: .sp
   99:   P          Punctuation
  100:   Pc         Connector punctuation
  101:   Pd         Dash punctuation
  102:   Pe         Close punctuation
  103:   Pf         Final punctuation
  104:   Pi         Initial punctuation
  105:   Po         Other punctuation
  106:   Ps         Open punctuation
  107: .sp
  108:   S          Symbol
  109:   Sc         Currency symbol
  110:   Sk         Modifier symbol
  111:   Sm         Mathematical symbol
  112:   So         Other symbol
  113: .sp
  114:   Z          Separator
  115:   Zl         Line separator
  116:   Zp         Paragraph separator
  117:   Zs         Space separator
  118: .
  119: .
  120: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
  121: .rs
  122: .sp
  123:   Xan        Alphanumeric: union of properties L and N
  124:   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
  125:   Xsp        Perl space: property Z or tab, NL, VT, FF, CR
  126:   Xuc        Univerally-named character: one that can be
  127:                represented by a Universal Character Name
  128:   Xwd        Perl word: property Xan or underscore
  129: .sp
  130: Perl and POSIX space are now the same. Perl added VT to its space character set
  131: at release 5.18 and PCRE changed at release 8.34.
  132: .
  133: .
  134: .SH "SCRIPT NAMES FOR \ep AND \eP"
  135: .rs
  136: .sp
  137: Arabic,
  138: Armenian,
  139: Avestan,
  140: Balinese,
  141: Bamum,
  142: Batak,
  143: Bengali,
  144: Bopomofo,
  145: Brahmi,
  146: Braille,
  147: Buginese,
  148: Buhid,
  149: Canadian_Aboriginal,
  150: Carian,
  151: Chakma,
  152: Cham,
  153: Cherokee,
  154: Common,
  155: Coptic,
  156: Cuneiform,
  157: Cypriot,
  158: Cyrillic,
  159: Deseret,
  160: Devanagari,
  161: Egyptian_Hieroglyphs,
  162: Ethiopic,
  163: Georgian,
  164: Glagolitic,
  165: Gothic,
  166: Greek,
  167: Gujarati,
  168: Gurmukhi,
  169: Han,
  170: Hangul,
  171: Hanunoo,
  172: Hebrew,
  173: Hiragana,
  174: Imperial_Aramaic,
  175: Inherited,
  176: Inscriptional_Pahlavi,
  177: Inscriptional_Parthian,
  178: Javanese,
  179: Kaithi,
  180: Kannada,
  181: Katakana,
  182: Kayah_Li,
  183: Kharoshthi,
  184: Khmer,
  185: Lao,
  186: Latin,
  187: Lepcha,
  188: Limbu,
  189: Linear_B,
  190: Lisu,
  191: Lycian,
  192: Lydian,
  193: Malayalam,
  194: Mandaic,
  195: Meetei_Mayek,
  196: Meroitic_Cursive,
  197: Meroitic_Hieroglyphs,
  198: Miao,
  199: Mongolian,
  200: Myanmar,
  201: New_Tai_Lue,
  202: Nko,
  203: Ogham,
  204: Old_Italic,
  205: Old_Persian,
  206: Old_South_Arabian,
  207: Old_Turkic,
  208: Ol_Chiki,
  209: Oriya,
  210: Osmanya,
  211: Phags_Pa,
  212: Phoenician,
  213: Rejang,
  214: Runic,
  215: Samaritan,
  216: Saurashtra,
  217: Sharada,
  218: Shavian,
  219: Sinhala,
  220: Sora_Sompeng,
  221: Sundanese,
  222: Syloti_Nagri,
  223: Syriac,
  224: Tagalog,
  225: Tagbanwa,
  226: Tai_Le,
  227: Tai_Tham,
  228: Tai_Viet,
  229: Takri,
  230: Tamil,
  231: Telugu,
  232: Thaana,
  233: Thai,
  234: Tibetan,
  235: Tifinagh,
  236: Ugaritic,
  237: Vai,
  238: Yi.
  239: .
  240: .
  241: .SH "CHARACTER CLASSES"
  242: .rs
  243: .sp
  244:   [...]       positive character class
  245:   [^...]      negative character class
  246:   [x-y]       range (can be used for hex characters)
  247:   [[:xxx:]]   positive POSIX named set
  248:   [[:^xxx:]]  negative POSIX named set
  249: .sp
  250:   alnum       alphanumeric
  251:   alpha       alphabetic
  252:   ascii       0-127
  253:   blank       space or tab
  254:   cntrl       control character
  255:   digit       decimal digit
  256:   graph       printing, excluding space
  257:   lower       lower case letter
  258:   print       printing, including space
  259:   punct       printing, excluding alphanumeric
  260:   space       white space
  261:   upper       upper case letter
  262:   word        same as \ew
  263:   xdigit      hexadecimal digit
  264: .sp
  265: In PCRE, POSIX character set names recognize only ASCII characters by default,
  266: but some of them use Unicode properties if PCRE_UCP is set. You can use
  267: \eQ...\eE inside a character class.
  268: .
  269: .
  270: .SH "QUANTIFIERS"
  271: .rs
  272: .sp
  273:   ?           0 or 1, greedy
  274:   ?+          0 or 1, possessive
  275:   ??          0 or 1, lazy
  276:   *           0 or more, greedy
  277:   *+          0 or more, possessive
  278:   *?          0 or more, lazy
  279:   +           1 or more, greedy
  280:   ++          1 or more, possessive
  281:   +?          1 or more, lazy
  282:   {n}         exactly n
  283:   {n,m}       at least n, no more than m, greedy
  284:   {n,m}+      at least n, no more than m, possessive
  285:   {n,m}?      at least n, no more than m, lazy
  286:   {n,}        n or more, greedy
  287:   {n,}+       n or more, possessive
  288:   {n,}?       n or more, lazy
  289: .
  290: .
  291: .SH "ANCHORS AND SIMPLE ASSERTIONS"
  292: .rs
  293: .sp
  294:   \eb          word boundary
  295:   \eB          not a word boundary
  296:   ^           start of subject
  297:                also after internal newline in multiline mode
  298:   \eA          start of subject
  299:   $           end of subject
  300:                also before newline at end of subject
  301:                also before internal newline in multiline mode
  302:   \eZ          end of subject
  303:                also before newline at end of subject
  304:   \ez          end of subject
  305:   \eG          first matching position in subject
  306: .
  307: .
  308: .SH "MATCH POINT RESET"
  309: .rs
  310: .sp
  311:   \eK          reset start of match
  312: .
  313: .
  314: .SH "ALTERNATION"
  315: .rs
  316: .sp
  317:   expr|expr|expr...
  318: .
  319: .
  320: .SH "CAPTURING"
  321: .rs
  322: .sp
  323:   (...)           capturing group
  324:   (?<name>...)    named capturing group (Perl)
  325:   (?'name'...)    named capturing group (Perl)
  326:   (?P<name>...)   named capturing group (Python)
  327:   (?:...)         non-capturing group
  328:   (?|...)         non-capturing group; reset group numbers for
  329:                    capturing groups in each alternative
  330: .
  331: .
  332: .SH "ATOMIC GROUPS"
  333: .rs
  334: .sp
  335:   (?>...)         atomic, non-capturing group
  336: .
  337: .
  338: .
  339: .
  340: .SH "COMMENT"
  341: .rs
  342: .sp
  343:   (?#....)        comment (not nestable)
  344: .
  345: .
  346: .SH "OPTION SETTING"
  347: .rs
  348: .sp
  349:   (?i)            caseless
  350:   (?J)            allow duplicate names
  351:   (?m)            multiline
  352:   (?s)            single line (dotall)
  353:   (?U)            default ungreedy (lazy)
  354:   (?x)            extended (ignore white space)
  355:   (?-...)         unset option(s)
  356: .sp
  357: The following are recognized only at the start of a pattern or after one of the
  358: newline-setting options with similar syntax:
  359: .sp
  360:   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  361:   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  362:   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  363:   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
  364:   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
  365:   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
  366:   (*UTF)          set appropriate UTF mode for the library in use
  367:   (*UCP)          set PCRE_UCP (use Unicode properties for \ed etc)
  368: .sp
  369: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
  370: limits set by the caller of pcre_exec(), not increase them.
  371: .
  372: .
  373: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
  374: .rs
  375: .sp
  376:   (?=...)         positive look ahead
  377:   (?!...)         negative look ahead
  378:   (?<=...)        positive look behind
  379:   (?<!...)        negative look behind
  380: .sp
  381: Each top-level branch of a look behind must be of a fixed length.
  382: .
  383: .
  384: .SH "BACKREFERENCES"
  385: .rs
  386: .sp
  387:   \en              reference by number (can be ambiguous)
  388:   \egn             reference by number
  389:   \eg{n}           reference by number
  390:   \eg{-n}          relative reference by number
  391:   \ek<name>        reference by name (Perl)
  392:   \ek'name'        reference by name (Perl)
  393:   \eg{name}        reference by name (Perl)
  394:   \ek{name}        reference by name (.NET)
  395:   (?P=name)       reference by name (Python)
  396: .
  397: .
  398: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
  399: .rs
  400: .sp
  401:   (?R)            recurse whole pattern
  402:   (?n)            call subpattern by absolute number
  403:   (?+n)           call subpattern by relative number
  404:   (?-n)           call subpattern by relative number
  405:   (?&name)        call subpattern by name (Perl)
  406:   (?P>name)       call subpattern by name (Python)
  407:   \eg<name>        call subpattern by name (Oniguruma)
  408:   \eg'name'        call subpattern by name (Oniguruma)
  409:   \eg<n>           call subpattern by absolute number (Oniguruma)
  410:   \eg'n'           call subpattern by absolute number (Oniguruma)
  411:   \eg<+n>          call subpattern by relative number (PCRE extension)
  412:   \eg'+n'          call subpattern by relative number (PCRE extension)
  413:   \eg<-n>          call subpattern by relative number (PCRE extension)
  414:   \eg'-n'          call subpattern by relative number (PCRE extension)
  415: .
  416: .
  417: .SH "CONDITIONAL PATTERNS"
  418: .rs
  419: .sp
  420:   (?(condition)yes-pattern)
  421:   (?(condition)yes-pattern|no-pattern)
  422: .sp
  423:   (?(n)...        absolute reference condition
  424:   (?(+n)...       relative reference condition
  425:   (?(-n)...       relative reference condition
  426:   (?(<name>)...   named reference condition (Perl)
  427:   (?('name')...   named reference condition (Perl)
  428:   (?(name)...     named reference condition (PCRE)
  429:   (?(R)...        overall recursion condition
  430:   (?(Rn)...       specific group recursion condition
  431:   (?(R&name)...   specific recursion condition
  432:   (?(DEFINE)...   define subpattern for reference
  433:   (?(assert)...   assertion condition
  434: .
  435: .
  436: .SH "BACKTRACKING CONTROL"
  437: .rs
  438: .sp
  439: The following act immediately they are reached:
  440: .sp
  441:   (*ACCEPT)       force successful match
  442:   (*FAIL)         force backtrack; synonym (*F)
  443:   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
  444: .sp
  445: The following act only when a subsequent match failure causes a backtrack to
  446: reach them. They all force a match failure, but they differ in what happens
  447: afterwards. Those that advance the start-of-match point do so only if the
  448: pattern is not anchored.
  449: .sp
  450:   (*COMMIT)       overall failure, no advance of starting point
  451:   (*PRUNE)        advance to next starting character
  452:   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
  453:   (*SKIP)         advance to current matching position
  454:   (*SKIP:NAME)    advance to position corresponding to an earlier
  455:                   (*MARK:NAME); if not found, the (*SKIP) is ignored
  456:   (*THEN)         local failure, backtrack to next alternation
  457:   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
  458: .
  459: .
  460: .SH "NEWLINE CONVENTIONS"
  461: .rs
  462: .sp
  463: These are recognized only at the very start of the pattern or after a
  464: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
  465: .sp
  466:   (*CR)           carriage return only
  467:   (*LF)           linefeed only
  468:   (*CRLF)         carriage return followed by linefeed
  469:   (*ANYCRLF)      all three of the above
  470:   (*ANY)          any Unicode newline sequence
  471: .
  472: .
  473: .SH "WHAT \eR MATCHES"
  474: .rs
  475: .sp
  476: These are recognized only at the very start of the pattern or after a
  477: (*...) option that sets the newline convention or a UTF or UCP mode.
  478: .sp
  479:   (*BSR_ANYCRLF)  CR, LF, or CRLF
  480:   (*BSR_UNICODE)  any Unicode newline sequence
  481: .
  482: .
  483: .SH "CALLOUTS"
  484: .rs
  485: .sp
  486:   (?C)      callout
  487:   (?Cn)     callout with data n
  488: .
  489: .
  490: .SH "SEE ALSO"
  491: .rs
  492: .sp
  493: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
  494: \fBpcrematching\fP(3), \fBpcre\fP(3).
  495: .
  496: .
  497: .SH AUTHOR
  498: .rs
  499: .sp
  500: .nf
  501: Philip Hazel
  502: University Computing Service
  503: Cambridge CB2 3QH, England.
  504: .fi
  505: .
  506: .
  507: .SH REVISION
  508: .rs
  509: .sp
  510: .nf
  511: Last updated: 12 November 2013
  512: Copyright (c) 1997-2013 University of Cambridge.
  513: .fi

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>