File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / pcresyntax.3
Revision 1.1.1.4 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Jul 22 08:25:56 2013 UTC (10 years, 11 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_33, HEAD
8.33

    1: .TH PCRESYNTAX 3 "26 April 2013" "PCRE 8.33"
    2: .SH NAME
    3: PCRE - Perl-compatible regular expressions
    4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
    5: .rs
    6: .sp
    7: The full syntax and semantics of the regular expressions that are supported by
    8: PCRE are described in the
    9: .\" HREF
   10: \fBpcrepattern\fP
   11: .\"
   12: documentation. This document contains a quick-reference summary of the syntax.
   13: .
   14: .
   15: .SH "QUOTING"
   16: .rs
   17: .sp
   18:   \ex         where x is non-alphanumeric is a literal x
   19:   \eQ...\eE    treat enclosed characters as literal
   20: .
   21: .
   22: .SH "CHARACTERS"
   23: .rs
   24: .sp
   25:   \ea         alarm, that is, the BEL character (hex 07)
   26:   \ecx        "control-x", where x is any ASCII character
   27:   \ee         escape (hex 1B)
   28:   \ef         form feed (hex 0C)
   29:   \en         newline (hex 0A)
   30:   \er         carriage return (hex 0D)
   31:   \et         tab (hex 09)
   32:   \eddd       character with octal code ddd, or backreference
   33:   \exhh       character with hex code hh
   34:   \ex{hhh..}  character with hex code hhh..
   35: .
   36: .
   37: .SH "CHARACTER TYPES"
   38: .rs
   39: .sp
   40:   .          any character except newline;
   41:                in dotall mode, any character whatsoever
   42:   \eC         one data unit, even in UTF mode (best avoided)
   43:   \ed         a decimal digit
   44:   \eD         a character that is not a decimal digit
   45:   \eh         a horizontal white space character
   46:   \eH         a character that is not a horizontal white space character
   47:   \eN         a character that is not a newline
   48:   \ep{\fIxx\fP}     a character with the \fIxx\fP property
   49:   \eP{\fIxx\fP}     a character without the \fIxx\fP property
   50:   \eR         a newline sequence
   51:   \es         a white space character
   52:   \eS         a character that is not a white space character
   53:   \ev         a vertical white space character
   54:   \eV         a character that is not a vertical white space character
   55:   \ew         a "word" character
   56:   \eW         a "non-word" character
   57:   \eX         a Unicode extended grapheme cluster
   58: .sp
   59: In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
   60: characters, even in a UTF mode. However, this can be changed by setting the
   61: PCRE_UCP option.
   62: .
   63: .
   64: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
   65: .rs
   66: .sp
   67:   C          Other
   68:   Cc         Control
   69:   Cf         Format
   70:   Cn         Unassigned
   71:   Co         Private use
   72:   Cs         Surrogate
   73: .sp
   74:   L          Letter
   75:   Ll         Lower case letter
   76:   Lm         Modifier letter
   77:   Lo         Other letter
   78:   Lt         Title case letter
   79:   Lu         Upper case letter
   80:   L&         Ll, Lu, or Lt
   81: .sp
   82:   M          Mark
   83:   Mc         Spacing mark
   84:   Me         Enclosing mark
   85:   Mn         Non-spacing mark
   86: .sp
   87:   N          Number
   88:   Nd         Decimal number
   89:   Nl         Letter number
   90:   No         Other number
   91: .sp
   92:   P          Punctuation
   93:   Pc         Connector punctuation
   94:   Pd         Dash punctuation
   95:   Pe         Close punctuation
   96:   Pf         Final punctuation
   97:   Pi         Initial punctuation
   98:   Po         Other punctuation
   99:   Ps         Open punctuation
  100: .sp
  101:   S          Symbol
  102:   Sc         Currency symbol
  103:   Sk         Modifier symbol
  104:   Sm         Mathematical symbol
  105:   So         Other symbol
  106: .sp
  107:   Z          Separator
  108:   Zl         Line separator
  109:   Zp         Paragraph separator
  110:   Zs         Space separator
  111: .
  112: .
  113: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
  114: .rs
  115: .sp
  116:   Xan        Alphanumeric: union of properties L and N
  117:   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
  118:   Xsp        Perl space: property Z or tab, NL, FF, CR
  119:   Xuc        Univerally-named character: one that can be
  120:                represented by a Universal Character Name
  121:   Xwd        Perl word: property Xan or underscore
  122: .
  123: .
  124: .SH "SCRIPT NAMES FOR \ep AND \eP"
  125: .rs
  126: .sp
  127: Arabic,
  128: Armenian,
  129: Avestan,
  130: Balinese,
  131: Bamum,
  132: Batak,
  133: Bengali,
  134: Bopomofo,
  135: Brahmi,
  136: Braille,
  137: Buginese,
  138: Buhid,
  139: Canadian_Aboriginal,
  140: Carian,
  141: Chakma,
  142: Cham,
  143: Cherokee,
  144: Common,
  145: Coptic,
  146: Cuneiform,
  147: Cypriot,
  148: Cyrillic,
  149: Deseret,
  150: Devanagari,
  151: Egyptian_Hieroglyphs,
  152: Ethiopic,
  153: Georgian,
  154: Glagolitic,
  155: Gothic,
  156: Greek,
  157: Gujarati,
  158: Gurmukhi,
  159: Han,
  160: Hangul,
  161: Hanunoo,
  162: Hebrew,
  163: Hiragana,
  164: Imperial_Aramaic,
  165: Inherited,
  166: Inscriptional_Pahlavi,
  167: Inscriptional_Parthian,
  168: Javanese,
  169: Kaithi,
  170: Kannada,
  171: Katakana,
  172: Kayah_Li,
  173: Kharoshthi,
  174: Khmer,
  175: Lao,
  176: Latin,
  177: Lepcha,
  178: Limbu,
  179: Linear_B,
  180: Lisu,
  181: Lycian,
  182: Lydian,
  183: Malayalam,
  184: Mandaic,
  185: Meetei_Mayek,
  186: Meroitic_Cursive,
  187: Meroitic_Hieroglyphs,
  188: Miao,
  189: Mongolian,
  190: Myanmar,
  191: New_Tai_Lue,
  192: Nko,
  193: Ogham,
  194: Old_Italic,
  195: Old_Persian,
  196: Old_South_Arabian,
  197: Old_Turkic,
  198: Ol_Chiki,
  199: Oriya,
  200: Osmanya,
  201: Phags_Pa,
  202: Phoenician,
  203: Rejang,
  204: Runic,
  205: Samaritan,
  206: Saurashtra,
  207: Sharada,
  208: Shavian,
  209: Sinhala,
  210: Sora_Sompeng,
  211: Sundanese,
  212: Syloti_Nagri,
  213: Syriac,
  214: Tagalog,
  215: Tagbanwa,
  216: Tai_Le,
  217: Tai_Tham,
  218: Tai_Viet,
  219: Takri,
  220: Tamil,
  221: Telugu,
  222: Thaana,
  223: Thai,
  224: Tibetan,
  225: Tifinagh,
  226: Ugaritic,
  227: Vai,
  228: Yi.
  229: .
  230: .
  231: .SH "CHARACTER CLASSES"
  232: .rs
  233: .sp
  234:   [...]       positive character class
  235:   [^...]      negative character class
  236:   [x-y]       range (can be used for hex characters)
  237:   [[:xxx:]]   positive POSIX named set
  238:   [[:^xxx:]]  negative POSIX named set
  239: .sp
  240:   alnum       alphanumeric
  241:   alpha       alphabetic
  242:   ascii       0-127
  243:   blank       space or tab
  244:   cntrl       control character
  245:   digit       decimal digit
  246:   graph       printing, excluding space
  247:   lower       lower case letter
  248:   print       printing, including space
  249:   punct       printing, excluding alphanumeric
  250:   space       white space
  251:   upper       upper case letter
  252:   word        same as \ew
  253:   xdigit      hexadecimal digit
  254: .sp
  255: In PCRE, POSIX character set names recognize only ASCII characters by default,
  256: but some of them use Unicode properties if PCRE_UCP is set. You can use
  257: \eQ...\eE inside a character class.
  258: .
  259: .
  260: .SH "QUANTIFIERS"
  261: .rs
  262: .sp
  263:   ?           0 or 1, greedy
  264:   ?+          0 or 1, possessive
  265:   ??          0 or 1, lazy
  266:   *           0 or more, greedy
  267:   *+          0 or more, possessive
  268:   *?          0 or more, lazy
  269:   +           1 or more, greedy
  270:   ++          1 or more, possessive
  271:   +?          1 or more, lazy
  272:   {n}         exactly n
  273:   {n,m}       at least n, no more than m, greedy
  274:   {n,m}+      at least n, no more than m, possessive
  275:   {n,m}?      at least n, no more than m, lazy
  276:   {n,}        n or more, greedy
  277:   {n,}+       n or more, possessive
  278:   {n,}?       n or more, lazy
  279: .
  280: .
  281: .SH "ANCHORS AND SIMPLE ASSERTIONS"
  282: .rs
  283: .sp
  284:   \eb          word boundary
  285:   \eB          not a word boundary
  286:   ^           start of subject
  287:                also after internal newline in multiline mode
  288:   \eA          start of subject
  289:   $           end of subject
  290:                also before newline at end of subject
  291:                also before internal newline in multiline mode
  292:   \eZ          end of subject
  293:                also before newline at end of subject
  294:   \ez          end of subject
  295:   \eG          first matching position in subject
  296: .
  297: .
  298: .SH "MATCH POINT RESET"
  299: .rs
  300: .sp
  301:   \eK          reset start of match
  302: .
  303: .
  304: .SH "ALTERNATION"
  305: .rs
  306: .sp
  307:   expr|expr|expr...
  308: .
  309: .
  310: .SH "CAPTURING"
  311: .rs
  312: .sp
  313:   (...)           capturing group
  314:   (?<name>...)    named capturing group (Perl)
  315:   (?'name'...)    named capturing group (Perl)
  316:   (?P<name>...)   named capturing group (Python)
  317:   (?:...)         non-capturing group
  318:   (?|...)         non-capturing group; reset group numbers for
  319:                    capturing groups in each alternative
  320: .
  321: .
  322: .SH "ATOMIC GROUPS"
  323: .rs
  324: .sp
  325:   (?>...)         atomic, non-capturing group
  326: .
  327: .
  328: .
  329: .
  330: .SH "COMMENT"
  331: .rs
  332: .sp
  333:   (?#....)        comment (not nestable)
  334: .
  335: .
  336: .SH "OPTION SETTING"
  337: .rs
  338: .sp
  339:   (?i)            caseless
  340:   (?J)            allow duplicate names
  341:   (?m)            multiline
  342:   (?s)            single line (dotall)
  343:   (?U)            default ungreedy (lazy)
  344:   (?x)            extended (ignore white space)
  345:   (?-...)         unset option(s)
  346: .sp
  347: The following are recognized only at the start of a pattern or after one of the
  348: newline-setting options with similar syntax:
  349: .sp
  350:   (*LIMIT_MATCH=d) set the match limit to d (decimal number)
  351:   (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
  352:   (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
  353:   (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
  354:   (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
  355:   (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
  356:   (*UTF)          set appropriate UTF mode for the library in use
  357:   (*UCP)          set PCRE_UCP (use Unicode properties for \ed etc)
  358: .
  359: .
  360: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
  361: .rs
  362: .sp
  363:   (?=...)         positive look ahead
  364:   (?!...)         negative look ahead
  365:   (?<=...)        positive look behind
  366:   (?<!...)        negative look behind
  367: .sp
  368: Each top-level branch of a look behind must be of a fixed length.
  369: .
  370: .
  371: .SH "BACKREFERENCES"
  372: .rs
  373: .sp
  374:   \en              reference by number (can be ambiguous)
  375:   \egn             reference by number
  376:   \eg{n}           reference by number
  377:   \eg{-n}          relative reference by number
  378:   \ek<name>        reference by name (Perl)
  379:   \ek'name'        reference by name (Perl)
  380:   \eg{name}        reference by name (Perl)
  381:   \ek{name}        reference by name (.NET)
  382:   (?P=name)       reference by name (Python)
  383: .
  384: .
  385: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
  386: .rs
  387: .sp
  388:   (?R)            recurse whole pattern
  389:   (?n)            call subpattern by absolute number
  390:   (?+n)           call subpattern by relative number
  391:   (?-n)           call subpattern by relative number
  392:   (?&name)        call subpattern by name (Perl)
  393:   (?P>name)       call subpattern by name (Python)
  394:   \eg<name>        call subpattern by name (Oniguruma)
  395:   \eg'name'        call subpattern by name (Oniguruma)
  396:   \eg<n>           call subpattern by absolute number (Oniguruma)
  397:   \eg'n'           call subpattern by absolute number (Oniguruma)
  398:   \eg<+n>          call subpattern by relative number (PCRE extension)
  399:   \eg'+n'          call subpattern by relative number (PCRE extension)
  400:   \eg<-n>          call subpattern by relative number (PCRE extension)
  401:   \eg'-n'          call subpattern by relative number (PCRE extension)
  402: .
  403: .
  404: .SH "CONDITIONAL PATTERNS"
  405: .rs
  406: .sp
  407:   (?(condition)yes-pattern)
  408:   (?(condition)yes-pattern|no-pattern)
  409: .sp
  410:   (?(n)...        absolute reference condition
  411:   (?(+n)...       relative reference condition
  412:   (?(-n)...       relative reference condition
  413:   (?(<name>)...   named reference condition (Perl)
  414:   (?('name')...   named reference condition (Perl)
  415:   (?(name)...     named reference condition (PCRE)
  416:   (?(R)...        overall recursion condition
  417:   (?(Rn)...       specific group recursion condition
  418:   (?(R&name)...   specific recursion condition
  419:   (?(DEFINE)...   define subpattern for reference
  420:   (?(assert)...   assertion condition
  421: .
  422: .
  423: .SH "BACKTRACKING CONTROL"
  424: .rs
  425: .sp
  426: The following act immediately they are reached:
  427: .sp
  428:   (*ACCEPT)       force successful match
  429:   (*FAIL)         force backtrack; synonym (*F)
  430:   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
  431: .sp
  432: The following act only when a subsequent match failure causes a backtrack to
  433: reach them. They all force a match failure, but they differ in what happens
  434: afterwards. Those that advance the start-of-match point do so only if the
  435: pattern is not anchored.
  436: .sp
  437:   (*COMMIT)       overall failure, no advance of starting point
  438:   (*PRUNE)        advance to next starting character
  439:   (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
  440:   (*SKIP)         advance to current matching position
  441:   (*SKIP:NAME)    advance to position corresponding to an earlier
  442:                   (*MARK:NAME); if not found, the (*SKIP) is ignored
  443:   (*THEN)         local failure, backtrack to next alternation
  444:   (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
  445: .
  446: .
  447: .SH "NEWLINE CONVENTIONS"
  448: .rs
  449: .sp
  450: These are recognized only at the very start of the pattern or after a
  451: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
  452: .sp
  453:   (*CR)           carriage return only
  454:   (*LF)           linefeed only
  455:   (*CRLF)         carriage return followed by linefeed
  456:   (*ANYCRLF)      all three of the above
  457:   (*ANY)          any Unicode newline sequence
  458: .
  459: .
  460: .SH "WHAT \eR MATCHES"
  461: .rs
  462: .sp
  463: These are recognized only at the very start of the pattern or after a
  464: (*...) option that sets the newline convention or a UTF or UCP mode.
  465: .sp
  466:   (*BSR_ANYCRLF)  CR, LF, or CRLF
  467:   (*BSR_UNICODE)  any Unicode newline sequence
  468: .
  469: .
  470: .SH "CALLOUTS"
  471: .rs
  472: .sp
  473:   (?C)      callout
  474:   (?Cn)     callout with data n
  475: .
  476: .
  477: .SH "SEE ALSO"
  478: .rs
  479: .sp
  480: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
  481: \fBpcrematching\fP(3), \fBpcre\fP(3).
  482: .
  483: .
  484: .SH AUTHOR
  485: .rs
  486: .sp
  487: .nf
  488: Philip Hazel
  489: University Computing Service
  490: Cambridge CB2 3QH, England.
  491: .fi
  492: .
  493: .
  494: .SH REVISION
  495: .rs
  496: .sp
  497: .nf
  498: Last updated: 26 April 2013
  499: Copyright (c) 1997-2013 University of Cambridge.
  500: .fi

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>