embedaddon/pcre/doc/html/pcrecpp.html - annotate

Return to pcrecpp.html CVS log
Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html
Annotation of embedaddon/pcre/doc/html/pcrecpp.html, revision 1.1.1.4

1.1       misho       1: <html>
                      2: <head>
                      3: <title>pcrecpp specification</title>
                      4: </head>
                      5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
                      6: <h1>pcrecpp man page</h1>
                      7: <p>
                      8: Return to the <a href="index.html">PCRE index page</a>.
                      9: </p>
                     10: <p>
                     11: This page is part of the PCRE HTML documentation. It was generated automatically
                     12: from the original man page. If there is any nonsense in it, please consult the
                     13: man page, in case the conversion went wrong.
                     14: <br>
                     15: <ul>
                     16: <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
                     17: <li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
                     18: <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
                     19: <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
                     20: <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
                     21: <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
                     22: <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>
                     23: <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
                     24: <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
                     25: <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
                     26: <li><a name="TOC11" href="#SEC11">AUTHOR</a>
                     27: <li><a name="TOC12" href="#SEC12">REVISION</a>
                     28: </ul>
                     29: <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
                     30: <P>
                     31: <b>#include &#60;pcrecpp.h&#62;</b>
                     32: </P>
                     33: <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
                     34: <P>
                     35: The C++ wrapper for PCRE was provided by Google Inc. Some additional
                     36: functionality was added by Giuseppe Maxia. This brief man page was constructed
                     37: from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
1.1.1.2   misho      38: further details. Note that the C++ wrapper supports only the original 8-bit
1.1.1.4 ! misho      39: PCRE library. There is no 16-bit or 32-bit support at present.
1.1       misho      40: </P>
                     41: <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
                     42: <P>
                     43: The "FullMatch" operation checks that supplied text matches a supplied pattern
                     44: exactly. If pointer arguments are supplied, it copies matched sub-strings that
                     45: match sub-patterns into them.
                     46: <pre>
                     47:   Example: successful match
                     48:      pcrecpp::RE re("h.*o");
                     49:      re.FullMatch("hello");
                     50: 
                     51:   Example: unsuccessful match (requires full match):
                     52:      pcrecpp::RE re("e");
                     53:      !re.FullMatch("hello");
                     54: 
                     55:   Example: creating a temporary RE object:
                     56:      pcrecpp::RE("h.*o").FullMatch("hello");
                     57: </pre>
                     58: You can pass in a "const char*" or a "string" for "text". The examples below
                     59: tend to use a const char*. You can, as in the different examples above, store
                     60: the RE object explicitly in a variable or use a temporary RE object. The
                     61: examples below use one mode or the other arbitrarily. Either could correctly be
                     62: used for any of these examples.
                     63: </P>
                     64: <P>
                     65: You must supply extra pointer arguments to extract matched subpieces.
                     66: <pre>
                     67:   Example: extracts "ruby" into "s" and 1234 into "i"
                     68:      int i;
                     69:      string s;
                     70:      pcrecpp::RE re("(\\w+):(\\d+)");
                     71:      re.FullMatch("ruby:1234", &s, &i);
                     72: 
                     73:   Example: does not try to extract any extra sub-patterns
                     74:      re.FullMatch("ruby:1234", &s);
                     75: 
                     76:   Example: does not try to extract into NULL
                     77:      re.FullMatch("ruby:1234", NULL, &i);
                     78: 
                     79:   Example: integer overflow causes failure
                     80:      !re.FullMatch("ruby:1234567891234", NULL, &i);
                     81: 
                     82:   Example: fails because there aren't enough sub-patterns:
                     83:      !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
                     84: 
                     85:   Example: fails because string cannot be stored in integer
                     86:      !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
                     87: </pre>
                     88: The provided pointer arguments can be pointers to any scalar numeric
                     89: type, or one of:
                     90: <pre>
                     91:    string        (matched piece is copied to string)
                     92:    StringPiece   (StringPiece is mutated to point to matched piece)
                     93:    T             (where "bool T::ParseFrom(const char*, int)" exists)
                     94:    NULL          (the corresponding matched sub-pattern is not copied)
                     95: </pre>
                     96: The function returns true iff all of the following conditions are satisfied:
                     97: <pre>
                     98:   a. "text" matches "pattern" exactly;
                     99: 
                    100:   b. The number of matched sub-patterns is &#62;= number of supplied
                    101:      pointers;
                    102: 
                    103:   c. The "i"th argument has a suitable type for holding the
                    104:      string captured as the "i"th sub-pattern. If you pass in
                    105:      void * NULL for the "i"th argument, or a non-void * NULL
                    106:      of the correct type, or pass fewer arguments than the
                    107:      number of sub-patterns, "i"th captured sub-pattern is
                    108:      ignored.
                    109: </pre>
                    110: CAVEAT: An optional sub-pattern that does not exist in the matched
                    111: string is assigned the empty string. Therefore, the following will
                    112: return false (because the empty string is not a valid number):
                    113: <pre>
                    114:    int number;
                    115:    pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
                    116: </pre>
                    117: The matching interface supports at most 16 arguments per call.
                    118: If you need more, consider using the more general interface
                    119: <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
                    120: <b>DoMatch</b>.
                    121: </P>
                    122: <P>
                    123: NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a
                    124: list of optional arguments, as a placeholder for missing arguments, as this can
                    125: lead to segfaults.
                    126: </P>
                    127: <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
                    128: <P>
                    129: You can use the "QuoteMeta" operation to insert backslashes before all
                    130: potentially meaningful characters in a string. The returned string, used as a
                    131: regular expression, will exactly match the original string.
                    132: <pre>
                    133:   Example:
                    134:      string quoted = RE::QuoteMeta(unquoted);
                    135: </pre>
                    136: Note that it's legal to escape a character even if it has no special meaning in
                    137: a regular expression -- so this function does that. (This also makes it
                    138: identical to the perl function of the same name; see "perldoc -f quotemeta".)
                    139: For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
                    140: </P>
                    141: <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
                    142: <P>
                    143: You can use the "PartialMatch" operation when you want the pattern
                    144: to match any substring of the text.
                    145: <pre>
                    146:   Example: simple search for a string:
                    147:      pcrecpp::RE("ell").PartialMatch("hello");
                    148: 
                    149:   Example: find first number in a string:
                    150:      int number;
                    151:      pcrecpp::RE re("(\\d+)");
                    152:      re.PartialMatch("x*100 + 20", &number);
                    153:      assert(number == 100);
                    154: </PRE>
                    155: </P>
                    156: <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
                    157: <P>
                    158: By default, pattern and text are plain text, one byte per character. The UTF8
                    159: flag, passed to the constructor, causes both pattern and string to be treated
                    160: as UTF-8 text, still a byte stream but potentially multiple bytes per
                    161: character. In practice, the text is likelier to be UTF-8 than the pattern, but
                    162: the match returned may depend on the UTF8 flag, so always use it when matching
                    163: UTF8 text. For example, "." will match one byte normally but with UTF8 set may
                    164: match up to three bytes of a multi-byte character.
                    165: <pre>
                    166:   Example:
                    167:      pcrecpp::RE_Options options;
                    168:      options.set_utf8();
                    169:      pcrecpp::RE re(utf8_pattern, options);
                    170:      re.FullMatch(utf8_string);
                    171: 
                    172:   Example: using the convenience function UTF8():
                    173:      pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
                    174:      re.FullMatch(utf8_string);
                    175: </pre>
                    176: NOTE: The UTF8 flag is ignored if pcre was not configured with the
                    177: <pre>
                    178:       --enable-utf8 flag.
                    179: </PRE>
                    180: </P>
                    181: <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
                    182: <P>
                    183: PCRE defines some modifiers to change the behavior of the regular expression
                    184: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
                    185: pass such modifiers to a RE class. Currently, the following modifiers are
                    186: supported:
                    187: <pre>
                    188:    modifier              description               Perl corresponding
                    189: 
                    190:    PCRE_CASELESS         case insensitive match      /i
                    191:    PCRE_MULTILINE        multiple lines match        /m
                    192:    PCRE_DOTALL           dot matches newlines        /s
                    193:    PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
                    194:    PCRE_EXTRA            strict escape parsing       N/A
1.1.1.3   misho     195:    PCRE_EXTENDED         ignore white spaces         /x
1.1       misho     196:    PCRE_UTF8             handles UTF8 chars          built-in
                    197:    PCRE_UNGREEDY         reverses * and *?           N/A
                    198:    PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
                    199: </pre>
                    200: (*) Both Perl and PCRE allow non capturing parentheses by means of the
                    201: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
                    202: capture, while (ab|cd) does.
                    203: </P>
                    204: <P>
                    205: For a full account on how each modifier works, please check the
                    206: PCRE API reference page.
                    207: </P>
                    208: <P>
                    209: For each modifier, there are two member functions whose name is made
                    210: out of the modifier in lowercase, without the "PCRE_" prefix. For
                    211: instance, PCRE_CASELESS is handled by
                    212: <pre>
                    213:   bool caseless()
                    214: </pre>
                    215: which returns true if the modifier is set, and
                    216: <pre>
                    217:   RE_Options & set_caseless(bool)
                    218: </pre>
                    219: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
                    220: accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
                    221: functions. Setting <i>match_limit</i> to a non-zero value will limit the
                    222: execution of pcre to keep it from doing bad things like blowing the stack or
                    223: taking an eternity to return a result. A value of 5000 is good enough to stop
                    224: stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
                    225: match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
                    226: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
                    227: recurses. <b>match_limit()</b> limits the number of matches PCRE does;
                    228: <b>match_limit_recursion()</b> limits the depth of internal recursion, and
                    229: therefore the amount of stack that is used.
                    230: </P>
                    231: <P>
                    232: Normally, to pass one or more modifiers to a RE class, you declare
                    233: a <i>RE_Options</i> object, set the appropriate options, and pass this
                    234: object to a RE constructor. Example:
                    235: <pre>
                    236:    RE_Options opt;
                    237:    opt.set_caseless(true);
                    238:    if (RE("HELLO", opt).PartialMatch("hello world")) ...
                    239: </pre>
                    240: RE_options has two constructors. The default constructor takes no arguments and
                    241: creates a set of flags that are off by default. The optional parameter
                    242: <i>option_flags</i> is to facilitate transfer of legacy code from C programs.
                    243: This lets you do
                    244: <pre>
                    245:    RE(pattern,
                    246:      RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
                    247: </pre>
                    248: However, new code is better off doing
                    249: <pre>
                    250:    RE(pattern,
                    251:      RE_Options().set_caseless(true).set_multiline(true))
                    252:        .PartialMatch(str);
                    253: </pre>
                    254: If you are going to pass one of the most used modifiers, there are some
                    255: convenience functions that return a RE_Options class with the
                    256: appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
                    257: <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
                    258: </P>
                    259: <P>
                    260: If you need to set several options at once, and you don't want to go through
                    261: the pains of declaring a RE_Options object and setting several options, there
                    262: is a parallel method that give you such ability on the fly. You can concatenate
                    263: several <b>set_xxxxx()</b> member functions, since each of them returns a
                    264: reference to its class object. For example, to pass PCRE_CASELESS,
                    265: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
                    266: <pre>
                    267:    RE(" ^ xyz \\s+ .* blah$",
                    268:      RE_Options()
                    269:        .set_caseless(true)
                    270:        .set_extended(true)
                    271:        .set_multiline(true)).PartialMatch(sometext);
                    272: 
                    273: </PRE>
                    274: </P>
                    275: <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
                    276: <P>
                    277: The "Consume" operation may be useful if you want to repeatedly
                    278: match regular expressions at the front of a string and skip over
                    279: them as they match. This requires use of the "StringPiece" type,
                    280: which represents a sub-range of a real string. Like RE, StringPiece
                    281: is defined in the pcrecpp namespace.
                    282: <pre>
                    283:   Example: read lines of the form "var = value" from a string.
                    284:      string contents = ...;                 // Fill string somehow
                    285:      pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
                    286: 
                    287:      string var;
                    288:      int value;
                    289:      pcrecpp::RE re("(\\w+) = (\\d+)\n");
                    290:      while (re.Consume(&input, &var, &value)) {
                    291:        ...;
                    292:      }
                    293: </pre>
                    294: Each successful call to "Consume" will set "var/value", and also
                    295: advance "input" so it points past the matched text.
                    296: </P>
                    297: <P>
                    298: The "FindAndConsume" operation is similar to "Consume" but does not
                    299: anchor your match at the beginning of the string. For example, you
                    300: could extract all words from a string by repeatedly calling
                    301: <pre>
                    302:   pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
                    303: </PRE>
                    304: </P>
                    305: <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
                    306: <P>
                    307: By default, if you pass a pointer to a numeric value, the
                    308: corresponding text is interpreted as a base-10 number. You can
                    309: instead wrap the pointer with a call to one of the operators Hex(),
                    310: Octal(), or CRadix() to interpret the text in another base. The
                    311: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
                    312: prefixes, but defaults to base-10.
                    313: <pre>
                    314:   Example:
                    315:     int a, b, c, d;
                    316:     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
                    317:     re.FullMatch("100 40 0100 0x40",
                    318:                  pcrecpp::Octal(&a), pcrecpp::Hex(&b),
                    319:                  pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
                    320: </pre>
                    321: will leave 64 in a, b, c, and d.
                    322: </P>
                    323: <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
                    324: <P>
                    325: You can replace the first match of "pattern" in "str" with "rewrite".
                    326: Within "rewrite", backslash-escaped digits (\1 to \9) can be
                    327: used to insert text matching corresponding parenthesized group
                    328: from the pattern. \0 in "rewrite" refers to the entire matching
                    329: text. For example:
                    330: <pre>
                    331:   string s = "yabba dabba doo";
                    332:   pcrecpp::RE("b+").Replace("d", &s);
                    333: </pre>
                    334: will leave "s" containing "yada dabba doo". The result is true if the pattern
                    335: matches and a replacement occurs, false otherwise.
                    336: </P>
                    337: <P>
                    338: <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
                    339: occurrences of the pattern in the string with the rewrite. Replacements are
                    340: not subject to re-matching. For example:
                    341: <pre>
                    342:   string s = "yabba dabba doo";
                    343:   pcrecpp::RE("b+").GlobalReplace("d", &s);
                    344: </pre>
                    345: will leave "s" containing "yada dada doo". It returns the number of
                    346: replacements made.
                    347: </P>
                    348: <P>
                    349: <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
                    350: "rewrite" is copied into "out" (an additional argument) with substitutions.
                    351: The non-matching portions of "text" are ignored. Returns true iff a match
                    352: occurred and the extraction happened successfully;  if no match occurs, the
                    353: string is left unaffected.
                    354: </P>
                    355: <br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
                    356: <P>
                    357: The C++ wrapper was contributed by Google Inc.
                    358: <br>
                    359: Copyright &copy; 2007 Google Inc.
                    360: <br>
                    361: </P>
                    362: <br><a name="SEC12" href="#TOC1">REVISION</a><br>
                    363: <P>
1.1.1.2   misho     364: Last updated: 08 January 2012
1.1       misho     365: <br>
                    366: <p>
                    367: Return to the <a href="index.html">PCRE index page</a>.
                    368: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>