embedaddon/pcre/doc/pcrecpp.3 - annotate

Return to pcrecpp.3 CVS log
Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc
Annotation of embedaddon/pcre/doc/pcrecpp.3, revision 1.1.1.1

1.1       misho       1: .TH PCRECPP 3
                      2: .SH NAME
                      3: PCRE - Perl-compatible regular expressions.
                      4: .SH "SYNOPSIS OF C++ WRAPPER"
                      5: .rs
                      6: .sp
                      7: .B #include <pcrecpp.h>
                      8: .
                      9: .SH DESCRIPTION
                     10: .rs
                     11: .sp
                     12: The C++ wrapper for PCRE was provided by Google Inc. Some additional
                     13: functionality was added by Giuseppe Maxia. This brief man page was constructed
                     14: from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
                     15: further details.
                     16: .
                     17: .
                     18: .SH "MATCHING INTERFACE"
                     19: .rs
                     20: .sp
                     21: The "FullMatch" operation checks that supplied text matches a supplied pattern
                     22: exactly. If pointer arguments are supplied, it copies matched sub-strings that
                     23: match sub-patterns into them.
                     24: .sp
                     25:   Example: successful match
                     26:      pcrecpp::RE re("h.*o");
                     27:      re.FullMatch("hello");
                     28: .sp
                     29:   Example: unsuccessful match (requires full match):
                     30:      pcrecpp::RE re("e");
                     31:      !re.FullMatch("hello");
                     32: .sp
                     33:   Example: creating a temporary RE object:
                     34:      pcrecpp::RE("h.*o").FullMatch("hello");
                     35: .sp
                     36: You can pass in a "const char*" or a "string" for "text". The examples below
                     37: tend to use a const char*. You can, as in the different examples above, store
                     38: the RE object explicitly in a variable or use a temporary RE object. The
                     39: examples below use one mode or the other arbitrarily. Either could correctly be
                     40: used for any of these examples.
                     41: .P
                     42: You must supply extra pointer arguments to extract matched subpieces.
                     43: .sp
                     44:   Example: extracts "ruby" into "s" and 1234 into "i"
                     45:      int i;
                     46:      string s;
                     47:      pcrecpp::RE re("(\e\ew+):(\e\ed+)");
                     48:      re.FullMatch("ruby:1234", &s, &i);
                     49: .sp
                     50:   Example: does not try to extract any extra sub-patterns
                     51:      re.FullMatch("ruby:1234", &s);
                     52: .sp
                     53:   Example: does not try to extract into NULL
                     54:      re.FullMatch("ruby:1234", NULL, &i);
                     55: .sp
                     56:   Example: integer overflow causes failure
                     57:      !re.FullMatch("ruby:1234567891234", NULL, &i);
                     58: .sp
                     59:   Example: fails because there aren't enough sub-patterns:
                     60:      !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
                     61: .sp
                     62:   Example: fails because string cannot be stored in integer
                     63:      !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
                     64: .sp
                     65: The provided pointer arguments can be pointers to any scalar numeric
                     66: type, or one of:
                     67: .sp
                     68:    string        (matched piece is copied to string)
                     69:    StringPiece   (StringPiece is mutated to point to matched piece)
                     70:    T             (where "bool T::ParseFrom(const char*, int)" exists)
                     71:    NULL          (the corresponding matched sub-pattern is not copied)
                     72: .sp
                     73: The function returns true iff all of the following conditions are satisfied:
                     74: .sp
                     75:   a. "text" matches "pattern" exactly;
                     76: .sp
                     77:   b. The number of matched sub-patterns is >= number of supplied
                     78:      pointers;
                     79: .sp
                     80:   c. The "i"th argument has a suitable type for holding the
                     81:      string captured as the "i"th sub-pattern. If you pass in
                     82:      void * NULL for the "i"th argument, or a non-void * NULL
                     83:      of the correct type, or pass fewer arguments than the
                     84:      number of sub-patterns, "i"th captured sub-pattern is
                     85:      ignored.
                     86: .sp
                     87: CAVEAT: An optional sub-pattern that does not exist in the matched
                     88: string is assigned the empty string. Therefore, the following will
                     89: return false (because the empty string is not a valid number):
                     90: .sp
                     91:    int number;
                     92:    pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
                     93: .sp
                     94: The matching interface supports at most 16 arguments per call.
                     95: If you need more, consider using the more general interface
                     96: \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
                     97: \fBDoMatch\fP.
                     98: .P
                     99: NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
                    100: list of optional arguments, as a placeholder for missing arguments, as this can
                    101: lead to segfaults.
                    102: .
                    103: .
                    104: .SH "QUOTING METACHARACTERS"
                    105: .rs
                    106: .sp
                    107: You can use the "QuoteMeta" operation to insert backslashes before all
                    108: potentially meaningful characters in a string. The returned string, used as a
                    109: regular expression, will exactly match the original string.
                    110: .sp
                    111:   Example:
                    112:      string quoted = RE::QuoteMeta(unquoted);
                    113: .sp
                    114: Note that it's legal to escape a character even if it has no special meaning in
                    115: a regular expression -- so this function does that. (This also makes it
                    116: identical to the perl function of the same name; see "perldoc -f quotemeta".)
                    117: For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
                    118: .
                    119: .SH "PARTIAL MATCHES"
                    120: .rs
                    121: .sp
                    122: You can use the "PartialMatch" operation when you want the pattern
                    123: to match any substring of the text.
                    124: .sp
                    125:   Example: simple search for a string:
                    126:      pcrecpp::RE("ell").PartialMatch("hello");
                    127: .sp
                    128:   Example: find first number in a string:
                    129:      int number;
                    130:      pcrecpp::RE re("(\e\ed+)");
                    131:      re.PartialMatch("x*100 + 20", &number);
                    132:      assert(number == 100);
                    133: .
                    134: .
                    135: .SH "UTF-8 AND THE MATCHING INTERFACE"
                    136: .rs
                    137: .sp
                    138: By default, pattern and text are plain text, one byte per character. The UTF8
                    139: flag, passed to the constructor, causes both pattern and string to be treated
                    140: as UTF-8 text, still a byte stream but potentially multiple bytes per
                    141: character. In practice, the text is likelier to be UTF-8 than the pattern, but
                    142: the match returned may depend on the UTF8 flag, so always use it when matching
                    143: UTF8 text. For example, "." will match one byte normally but with UTF8 set may
                    144: match up to three bytes of a multi-byte character.
                    145: .sp
                    146:   Example:
                    147:      pcrecpp::RE_Options options;
                    148:      options.set_utf8();
                    149:      pcrecpp::RE re(utf8_pattern, options);
                    150:      re.FullMatch(utf8_string);
                    151: .sp
                    152:   Example: using the convenience function UTF8():
                    153:      pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
                    154:      re.FullMatch(utf8_string);
                    155: .sp
                    156: NOTE: The UTF8 flag is ignored if pcre was not configured with the
                    157:       --enable-utf8 flag.
                    158: .
                    159: .
                    160: .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
                    161: .rs
                    162: .sp
                    163: PCRE defines some modifiers to change the behavior of the regular expression
                    164: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
                    165: pass such modifiers to a RE class. Currently, the following modifiers are
                    166: supported:
                    167: .sp
                    168:    modifier              description               Perl corresponding
                    169: .sp
                    170:    PCRE_CASELESS         case insensitive match      /i
                    171:    PCRE_MULTILINE        multiple lines match        /m
                    172:    PCRE_DOTALL           dot matches newlines        /s
                    173:    PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
                    174:    PCRE_EXTRA            strict escape parsing       N/A
                    175:    PCRE_EXTENDED         ignore whitespaces          /x
                    176:    PCRE_UTF8             handles UTF8 chars          built-in
                    177:    PCRE_UNGREEDY         reverses * and *?           N/A
                    178:    PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
                    179: .sp
                    180: (*) Both Perl and PCRE allow non capturing parentheses by means of the
                    181: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
                    182: capture, while (ab|cd) does.
                    183: .P
                    184: For a full account on how each modifier works, please check the
                    185: PCRE API reference page.
                    186: .P
                    187: For each modifier, there are two member functions whose name is made
                    188: out of the modifier in lowercase, without the "PCRE_" prefix. For
                    189: instance, PCRE_CASELESS is handled by
                    190: .sp
                    191:   bool caseless()
                    192: .sp
                    193: which returns true if the modifier is set, and
                    194: .sp
                    195:   RE_Options & set_caseless(bool)
                    196: .sp
                    197: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
                    198: accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
                    199: functions. Setting \fImatch_limit\fP to a non-zero value will limit the
                    200: execution of pcre to keep it from doing bad things like blowing the stack or
                    201: taking an eternity to return a result. A value of 5000 is good enough to stop
                    202: stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
                    203: match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
                    204: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
                    205: recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
                    206: \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
                    207: therefore the amount of stack that is used.
                    208: .P
                    209: Normally, to pass one or more modifiers to a RE class, you declare
                    210: a \fIRE_Options\fP object, set the appropriate options, and pass this
                    211: object to a RE constructor. Example:
                    212: .sp
                    213:    RE_Options opt;
                    214:    opt.set_caseless(true);
                    215:    if (RE("HELLO", opt).PartialMatch("hello world")) ...
                    216: .sp
                    217: RE_options has two constructors. The default constructor takes no arguments and
                    218: creates a set of flags that are off by default. The optional parameter
                    219: \fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
                    220: This lets you do
                    221: .sp
                    222:    RE(pattern,
                    223:      RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
                    224: .sp
                    225: However, new code is better off doing
                    226: .sp
                    227:    RE(pattern,
                    228:      RE_Options().set_caseless(true).set_multiline(true))
                    229:        .PartialMatch(str);
                    230: .sp
                    231: If you are going to pass one of the most used modifiers, there are some
                    232: convenience functions that return a RE_Options class with the
                    233: appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
                    234: \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
                    235: .P
                    236: If you need to set several options at once, and you don't want to go through
                    237: the pains of declaring a RE_Options object and setting several options, there
                    238: is a parallel method that give you such ability on the fly. You can concatenate
                    239: several \fBset_xxxxx()\fP member functions, since each of them returns a
                    240: reference to its class object. For example, to pass PCRE_CASELESS,
                    241: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
                    242: .sp
                    243:    RE(" ^ xyz \e\es+ .* blah$",
                    244:      RE_Options()
                    245:        .set_caseless(true)
                    246:        .set_extended(true)
                    247:        .set_multiline(true)).PartialMatch(sometext);
                    248: .sp
                    249: .
                    250: .
                    251: .SH "SCANNING TEXT INCREMENTALLY"
                    252: .rs
                    253: .sp
                    254: The "Consume" operation may be useful if you want to repeatedly
                    255: match regular expressions at the front of a string and skip over
                    256: them as they match. This requires use of the "StringPiece" type,
                    257: which represents a sub-range of a real string. Like RE, StringPiece
                    258: is defined in the pcrecpp namespace.
                    259: .sp
                    260:   Example: read lines of the form "var = value" from a string.
                    261:      string contents = ...;                 // Fill string somehow
                    262:      pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
                    263: .sp
                    264:      string var;
                    265:      int value;
                    266:      pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
                    267:      while (re.Consume(&input, &var, &value)) {
                    268:        ...;
                    269:      }
                    270: .sp
                    271: Each successful call to "Consume" will set "var/value", and also
                    272: advance "input" so it points past the matched text.
                    273: .P
                    274: The "FindAndConsume" operation is similar to "Consume" but does not
                    275: anchor your match at the beginning of the string. For example, you
                    276: could extract all words from a string by repeatedly calling
                    277: .sp
                    278:   pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
                    279: .
                    280: .
                    281: .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
                    282: .rs
                    283: .sp
                    284: By default, if you pass a pointer to a numeric value, the
                    285: corresponding text is interpreted as a base-10 number. You can
                    286: instead wrap the pointer with a call to one of the operators Hex(),
                    287: Octal(), or CRadix() to interpret the text in another base. The
                    288: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
                    289: prefixes, but defaults to base-10.
                    290: .sp
                    291:   Example:
                    292:     int a, b, c, d;
                    293:     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
                    294:     re.FullMatch("100 40 0100 0x40",
                    295:                  pcrecpp::Octal(&a), pcrecpp::Hex(&b),
                    296:                  pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
                    297: .sp
                    298: will leave 64 in a, b, c, and d.
                    299: .
                    300: .
                    301: .SH "REPLACING PARTS OF STRINGS"
                    302: .rs
                    303: .sp
                    304: You can replace the first match of "pattern" in "str" with "rewrite".
                    305: Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
                    306: used to insert text matching corresponding parenthesized group
                    307: from the pattern. \e0 in "rewrite" refers to the entire matching
                    308: text. For example:
                    309: .sp
                    310:   string s = "yabba dabba doo";
                    311:   pcrecpp::RE("b+").Replace("d", &s);
                    312: .sp
                    313: will leave "s" containing "yada dabba doo". The result is true if the pattern
                    314: matches and a replacement occurs, false otherwise.
                    315: .P
                    316: \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
                    317: occurrences of the pattern in the string with the rewrite. Replacements are
                    318: not subject to re-matching. For example:
                    319: .sp
                    320:   string s = "yabba dabba doo";
                    321:   pcrecpp::RE("b+").GlobalReplace("d", &s);
                    322: .sp
                    323: will leave "s" containing "yada dada doo". It returns the number of
                    324: replacements made.
                    325: .P
                    326: \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
                    327: "rewrite" is copied into "out" (an additional argument) with substitutions.
                    328: The non-matching portions of "text" are ignored. Returns true iff a match
                    329: occurred and the extraction happened successfully;  if no match occurs, the
                    330: string is left unaffected.
                    331: .
                    332: .
                    333: .SH AUTHOR
                    334: .rs
                    335: .sp
                    336: .nf
                    337: The C++ wrapper was contributed by Google Inc.
                    338: Copyright (c) 2007 Google Inc.
                    339: .fi
                    340: .
                    341: .
                    342: .SH REVISION
                    343: .rs
                    344: .sp
                    345: .nf
                    346: Last updated: 17 March 2009
                    347: Minor typo fixed: 25 July 2011
                    348: .fi
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>