File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / pcrecpp.3
Revision 1.1.1.4 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Jul 22 08:25:56 2013 UTC (11 years, 8 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, v8_33, HEAD
8.33

    1: .TH PCRECPP 3 "08 January 2012" "PCRE 8.30"
    2: .SH NAME
    3: PCRE - Perl-compatible regular expressions.
    4: .SH "SYNOPSIS OF C++ WRAPPER"
    5: .rs
    6: .sp
    7: .B #include <pcrecpp.h>
    8: .
    9: .SH DESCRIPTION
   10: .rs
   11: .sp
   12: The C++ wrapper for PCRE was provided by Google Inc. Some additional
   13: functionality was added by Giuseppe Maxia. This brief man page was constructed
   14: from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
   15: further details. Note that the C++ wrapper supports only the original 8-bit
   16: PCRE library. There is no 16-bit or 32-bit support at present.
   17: .
   18: .
   19: .SH "MATCHING INTERFACE"
   20: .rs
   21: .sp
   22: The "FullMatch" operation checks that supplied text matches a supplied pattern
   23: exactly. If pointer arguments are supplied, it copies matched sub-strings that
   24: match sub-patterns into them.
   25: .sp
   26:   Example: successful match
   27:      pcrecpp::RE re("h.*o");
   28:      re.FullMatch("hello");
   29: .sp
   30:   Example: unsuccessful match (requires full match):
   31:      pcrecpp::RE re("e");
   32:      !re.FullMatch("hello");
   33: .sp
   34:   Example: creating a temporary RE object:
   35:      pcrecpp::RE("h.*o").FullMatch("hello");
   36: .sp
   37: You can pass in a "const char*" or a "string" for "text". The examples below
   38: tend to use a const char*. You can, as in the different examples above, store
   39: the RE object explicitly in a variable or use a temporary RE object. The
   40: examples below use one mode or the other arbitrarily. Either could correctly be
   41: used for any of these examples.
   42: .P
   43: You must supply extra pointer arguments to extract matched subpieces.
   44: .sp
   45:   Example: extracts "ruby" into "s" and 1234 into "i"
   46:      int i;
   47:      string s;
   48:      pcrecpp::RE re("(\e\ew+):(\e\ed+)");
   49:      re.FullMatch("ruby:1234", &s, &i);
   50: .sp
   51:   Example: does not try to extract any extra sub-patterns
   52:      re.FullMatch("ruby:1234", &s);
   53: .sp
   54:   Example: does not try to extract into NULL
   55:      re.FullMatch("ruby:1234", NULL, &i);
   56: .sp
   57:   Example: integer overflow causes failure
   58:      !re.FullMatch("ruby:1234567891234", NULL, &i);
   59: .sp
   60:   Example: fails because there aren't enough sub-patterns:
   61:      !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
   62: .sp
   63:   Example: fails because string cannot be stored in integer
   64:      !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
   65: .sp
   66: The provided pointer arguments can be pointers to any scalar numeric
   67: type, or one of:
   68: .sp
   69:    string        (matched piece is copied to string)
   70:    StringPiece   (StringPiece is mutated to point to matched piece)
   71:    T             (where "bool T::ParseFrom(const char*, int)" exists)
   72:    NULL          (the corresponding matched sub-pattern is not copied)
   73: .sp
   74: The function returns true iff all of the following conditions are satisfied:
   75: .sp
   76:   a. "text" matches "pattern" exactly;
   77: .sp
   78:   b. The number of matched sub-patterns is >= number of supplied
   79:      pointers;
   80: .sp
   81:   c. The "i"th argument has a suitable type for holding the
   82:      string captured as the "i"th sub-pattern. If you pass in
   83:      void * NULL for the "i"th argument, or a non-void * NULL
   84:      of the correct type, or pass fewer arguments than the
   85:      number of sub-patterns, "i"th captured sub-pattern is
   86:      ignored.
   87: .sp
   88: CAVEAT: An optional sub-pattern that does not exist in the matched
   89: string is assigned the empty string. Therefore, the following will
   90: return false (because the empty string is not a valid number):
   91: .sp
   92:    int number;
   93:    pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
   94: .sp
   95: The matching interface supports at most 16 arguments per call.
   96: If you need more, consider using the more general interface
   97: \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
   98: \fBDoMatch\fP.
   99: .P
  100: NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
  101: list of optional arguments, as a placeholder for missing arguments, as this can
  102: lead to segfaults.
  103: .
  104: .
  105: .SH "QUOTING METACHARACTERS"
  106: .rs
  107: .sp
  108: You can use the "QuoteMeta" operation to insert backslashes before all
  109: potentially meaningful characters in a string. The returned string, used as a
  110: regular expression, will exactly match the original string.
  111: .sp
  112:   Example:
  113:      string quoted = RE::QuoteMeta(unquoted);
  114: .sp
  115: Note that it's legal to escape a character even if it has no special meaning in
  116: a regular expression -- so this function does that. (This also makes it
  117: identical to the perl function of the same name; see "perldoc -f quotemeta".)
  118: For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
  119: .
  120: .SH "PARTIAL MATCHES"
  121: .rs
  122: .sp
  123: You can use the "PartialMatch" operation when you want the pattern
  124: to match any substring of the text.
  125: .sp
  126:   Example: simple search for a string:
  127:      pcrecpp::RE("ell").PartialMatch("hello");
  128: .sp
  129:   Example: find first number in a string:
  130:      int number;
  131:      pcrecpp::RE re("(\e\ed+)");
  132:      re.PartialMatch("x*100 + 20", &number);
  133:      assert(number == 100);
  134: .
  135: .
  136: .SH "UTF-8 AND THE MATCHING INTERFACE"
  137: .rs
  138: .sp
  139: By default, pattern and text are plain text, one byte per character. The UTF8
  140: flag, passed to the constructor, causes both pattern and string to be treated
  141: as UTF-8 text, still a byte stream but potentially multiple bytes per
  142: character. In practice, the text is likelier to be UTF-8 than the pattern, but
  143: the match returned may depend on the UTF8 flag, so always use it when matching
  144: UTF8 text. For example, "." will match one byte normally but with UTF8 set may
  145: match up to three bytes of a multi-byte character.
  146: .sp
  147:   Example:
  148:      pcrecpp::RE_Options options;
  149:      options.set_utf8();
  150:      pcrecpp::RE re(utf8_pattern, options);
  151:      re.FullMatch(utf8_string);
  152: .sp
  153:   Example: using the convenience function UTF8():
  154:      pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
  155:      re.FullMatch(utf8_string);
  156: .sp
  157: NOTE: The UTF8 flag is ignored if pcre was not configured with the
  158:       --enable-utf8 flag.
  159: .
  160: .
  161: .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
  162: .rs
  163: .sp
  164: PCRE defines some modifiers to change the behavior of the regular expression
  165: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
  166: pass such modifiers to a RE class. Currently, the following modifiers are
  167: supported:
  168: .sp
  169:    modifier              description               Perl corresponding
  170: .sp
  171:    PCRE_CASELESS         case insensitive match      /i
  172:    PCRE_MULTILINE        multiple lines match        /m
  173:    PCRE_DOTALL           dot matches newlines        /s
  174:    PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
  175:    PCRE_EXTRA            strict escape parsing       N/A
  176:    PCRE_EXTENDED         ignore white spaces         /x
  177:    PCRE_UTF8             handles UTF8 chars          built-in
  178:    PCRE_UNGREEDY         reverses * and *?           N/A
  179:    PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
  180: .sp
  181: (*) Both Perl and PCRE allow non capturing parentheses by means of the
  182: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
  183: capture, while (ab|cd) does.
  184: .P
  185: For a full account on how each modifier works, please check the
  186: PCRE API reference page.
  187: .P
  188: For each modifier, there are two member functions whose name is made
  189: out of the modifier in lowercase, without the "PCRE_" prefix. For
  190: instance, PCRE_CASELESS is handled by
  191: .sp
  192:   bool caseless()
  193: .sp
  194: which returns true if the modifier is set, and
  195: .sp
  196:   RE_Options & set_caseless(bool)
  197: .sp
  198: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
  199: accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
  200: functions. Setting \fImatch_limit\fP to a non-zero value will limit the
  201: execution of pcre to keep it from doing bad things like blowing the stack or
  202: taking an eternity to return a result. A value of 5000 is good enough to stop
  203: stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
  204: match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
  205: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
  206: recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
  207: \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
  208: therefore the amount of stack that is used.
  209: .P
  210: Normally, to pass one or more modifiers to a RE class, you declare
  211: a \fIRE_Options\fP object, set the appropriate options, and pass this
  212: object to a RE constructor. Example:
  213: .sp
  214:    RE_Options opt;
  215:    opt.set_caseless(true);
  216:    if (RE("HELLO", opt).PartialMatch("hello world")) ...
  217: .sp
  218: RE_options has two constructors. The default constructor takes no arguments and
  219: creates a set of flags that are off by default. The optional parameter
  220: \fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
  221: This lets you do
  222: .sp
  223:    RE(pattern,
  224:      RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
  225: .sp
  226: However, new code is better off doing
  227: .sp
  228:    RE(pattern,
  229:      RE_Options().set_caseless(true).set_multiline(true))
  230:        .PartialMatch(str);
  231: .sp
  232: If you are going to pass one of the most used modifiers, there are some
  233: convenience functions that return a RE_Options class with the
  234: appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
  235: \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
  236: .P
  237: If you need to set several options at once, and you don't want to go through
  238: the pains of declaring a RE_Options object and setting several options, there
  239: is a parallel method that give you such ability on the fly. You can concatenate
  240: several \fBset_xxxxx()\fP member functions, since each of them returns a
  241: reference to its class object. For example, to pass PCRE_CASELESS,
  242: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
  243: .sp
  244:    RE(" ^ xyz \e\es+ .* blah$",
  245:      RE_Options()
  246:        .set_caseless(true)
  247:        .set_extended(true)
  248:        .set_multiline(true)).PartialMatch(sometext);
  249: .sp
  250: .
  251: .
  252: .SH "SCANNING TEXT INCREMENTALLY"
  253: .rs
  254: .sp
  255: The "Consume" operation may be useful if you want to repeatedly
  256: match regular expressions at the front of a string and skip over
  257: them as they match. This requires use of the "StringPiece" type,
  258: which represents a sub-range of a real string. Like RE, StringPiece
  259: is defined in the pcrecpp namespace.
  260: .sp
  261:   Example: read lines of the form "var = value" from a string.
  262:      string contents = ...;                 // Fill string somehow
  263:      pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
  264: .sp
  265:      string var;
  266:      int value;
  267:      pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
  268:      while (re.Consume(&input, &var, &value)) {
  269:        ...;
  270:      }
  271: .sp
  272: Each successful call to "Consume" will set "var/value", and also
  273: advance "input" so it points past the matched text.
  274: .P
  275: The "FindAndConsume" operation is similar to "Consume" but does not
  276: anchor your match at the beginning of the string. For example, you
  277: could extract all words from a string by repeatedly calling
  278: .sp
  279:   pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
  280: .
  281: .
  282: .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
  283: .rs
  284: .sp
  285: By default, if you pass a pointer to a numeric value, the
  286: corresponding text is interpreted as a base-10 number. You can
  287: instead wrap the pointer with a call to one of the operators Hex(),
  288: Octal(), or CRadix() to interpret the text in another base. The
  289: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
  290: prefixes, but defaults to base-10.
  291: .sp
  292:   Example:
  293:     int a, b, c, d;
  294:     pcrecpp::RE re("(.*) (.*) (.*) (.*)");
  295:     re.FullMatch("100 40 0100 0x40",
  296:                  pcrecpp::Octal(&a), pcrecpp::Hex(&b),
  297:                  pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
  298: .sp
  299: will leave 64 in a, b, c, and d.
  300: .
  301: .
  302: .SH "REPLACING PARTS OF STRINGS"
  303: .rs
  304: .sp
  305: You can replace the first match of "pattern" in "str" with "rewrite".
  306: Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
  307: used to insert text matching corresponding parenthesized group
  308: from the pattern. \e0 in "rewrite" refers to the entire matching
  309: text. For example:
  310: .sp
  311:   string s = "yabba dabba doo";
  312:   pcrecpp::RE("b+").Replace("d", &s);
  313: .sp
  314: will leave "s" containing "yada dabba doo". The result is true if the pattern
  315: matches and a replacement occurs, false otherwise.
  316: .P
  317: \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
  318: occurrences of the pattern in the string with the rewrite. Replacements are
  319: not subject to re-matching. For example:
  320: .sp
  321:   string s = "yabba dabba doo";
  322:   pcrecpp::RE("b+").GlobalReplace("d", &s);
  323: .sp
  324: will leave "s" containing "yada dada doo". It returns the number of
  325: replacements made.
  326: .P
  327: \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
  328: "rewrite" is copied into "out" (an additional argument) with substitutions.
  329: The non-matching portions of "text" are ignored. Returns true iff a match
  330: occurred and the extraction happened successfully;  if no match occurs, the
  331: string is left unaffected.
  332: .
  333: .
  334: .SH AUTHOR
  335: .rs
  336: .sp
  337: .nf
  338: The C++ wrapper was contributed by Google Inc.
  339: Copyright (c) 2007 Google Inc.
  340: .fi
  341: .
  342: .
  343: .SH REVISION
  344: .rs
  345: .sp
  346: .nf
  347: Last updated: 08 January 2012
  348: .fi

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>