embedaddon/pcre/doc/html/pcrecpp.html - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html / pcrecpp.html
Revision 1.1.1.4 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Mon Jul 22 08:25:57 2013 UTC (11 years, 11 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, v8_33, HEAD

8.33

1: <html> 2: <head> 3: <title>pcrecpp specification</title> 4: </head> 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6: <h1>pcrecpp man page</h1> 7:  8: Return to the <a href="index.html">PCRE index page</a>. 9:  10:  11: This page is part of the PCRE HTML documentation. It was generated automatically 12: from the original man page. If there is any nonsense in it, please consult the 13: man page, in case the conversion went wrong. 14:   15: <ul> 16: <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a> 17: <li><a name="TOC2" href="#SEC2">DESCRIPTION</a> 18: <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a> 19: <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a> 20: <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a> 21: <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a> 22: <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a> 23: <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a> 24: <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a> 25: <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a> 26: <li><a name="TOC11" href="#SEC11">AUTHOR</a> 27: <li><a name="TOC12" href="#SEC12">REVISION</a> 28: </ul> 29:  <a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a>  30:  31: #include <pcrecpp.h> 32:  33:  <a name="SEC2" href="#TOC1">DESCRIPTION</a>  34:  35: The C++ wrapper for PCRE was provided by Google Inc. Some additional 36: functionality was added by Giuseppe Maxia. This brief man page was constructed 37: from the notes in the pcrecpp.h file, which should be consulted for 38: further details. Note that the C++ wrapper supports only the original 8-bit 39: PCRE library. There is no 16-bit or 32-bit support at present. 40:  41:  <a name="SEC3" href="#TOC1">MATCHING INTERFACE</a>  42:  43: The "FullMatch" operation checks that supplied text matches a supplied pattern 44: exactly. If pointer arguments are supplied, it copies matched sub-strings that 45: match sub-patterns into them. 46: <pre> 47: Example: successful match 48: pcrecpp::RE re("h.*o"); 49: re.FullMatch("hello"); 50: 51: Example: unsuccessful match (requires full match): 52: pcrecpp::RE re("e"); 53: !re.FullMatch("hello"); 54: 55: Example: creating a temporary RE object: 56: pcrecpp::RE("h.*o").FullMatch("hello"); 57: </pre> 58: You can pass in a "const char*" or a "string" for "text". The examples below 59: tend to use a const char*. You can, as in the different examples above, store 60: the RE object explicitly in a variable or use a temporary RE object. The 61: examples below use one mode or the other arbitrarily. Either could correctly be 62: used for any of these examples. 63:  64:  65: You must supply extra pointer arguments to extract matched subpieces. 66: <pre> 67: Example: extracts "ruby" into "s" and 1234 into "i" 68: int i; 69: string s; 70: pcrecpp::RE re("(\\w+):(\\d+)"); 71: re.FullMatch("ruby:1234", &s, &i); 72: 73: Example: does not try to extract any extra sub-patterns 74: re.FullMatch("ruby:1234", &s); 75: 76: Example: does not try to extract into NULL 77: re.FullMatch("ruby:1234", NULL, &i); 78: 79: Example: integer overflow causes failure 80: !re.FullMatch("ruby:1234567891234", NULL, &i); 81: 82: Example: fails because there aren't enough sub-patterns: 83: !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s); 84: 85: Example: fails because string cannot be stored in integer 86: !pcrecpp::RE("(.*)").FullMatch("ruby", &i); 87: </pre> 88: The provided pointer arguments can be pointers to any scalar numeric 89: type, or one of: 90: <pre> 91: string (matched piece is copied to string) 92: StringPiece (StringPiece is mutated to point to matched piece) 93: T (where "bool T::ParseFrom(const char*, int)" exists) 94: NULL (the corresponding matched sub-pattern is not copied) 95: </pre> 96: The function returns true iff all of the following conditions are satisfied: 97: <pre> 98: a. "text" matches "pattern" exactly; 99: 100: b. The number of matched sub-patterns is >= number of supplied 101: pointers; 102: 103: c. The "i"th argument has a suitable type for holding the 104: string captured as the "i"th sub-pattern. If you pass in 105: void * NULL for the "i"th argument, or a non-void * NULL 106: of the correct type, or pass fewer arguments than the 107: number of sub-patterns, "i"th captured sub-pattern is 108: ignored. 109: </pre> 110: CAVEAT: An optional sub-pattern that does not exist in the matched 111: string is assigned the empty string. Therefore, the following will 112: return false (because the empty string is not a valid number): 113: <pre> 114: int number; 115: pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number); 116: </pre> 117: The matching interface supports at most 16 arguments per call. 118: If you need more, consider using the more general interface 119: pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for 120: DoMatch. 121:  122:  123: NOTE: Do not use no_arg, which is used internally to mark the end of a 124: list of optional arguments, as a placeholder for missing arguments, as this can 125: lead to segfaults. 126:  127:  <a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a>  128:  129: You can use the "QuoteMeta" operation to insert backslashes before all 130: potentially meaningful characters in a string. The returned string, used as a 131: regular expression, will exactly match the original string. 132: <pre> 133: Example: 134: string quoted = RE::QuoteMeta(unquoted); 135: </pre> 136: Note that it's legal to escape a character even if it has no special meaning in 137: a regular expression -- so this function does that. (This also makes it 138: identical to the perl function of the same name; see "perldoc -f quotemeta".) 139: For example, "1.5-2.0?" becomes "1\.5\-2\.0\?". 140:  141:  <a name="SEC5" href="#TOC1">PARTIAL MATCHES</a>  142:  143: You can use the "PartialMatch" operation when you want the pattern 144: to match any substring of the text. 145: <pre> 146: Example: simple search for a string: 147: pcrecpp::RE("ell").PartialMatch("hello"); 148: 149: Example: find first number in a string: 150: int number; 151: pcrecpp::RE re("(\\d+)"); 152: re.PartialMatch("x*100 + 20", &number); 153: assert(number == 100); 154: </PRE> 155:  156:  <a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a>  157:  158: By default, pattern and text are plain text, one byte per character. The UTF8 159: flag, passed to the constructor, causes both pattern and string to be treated 160: as UTF-8 text, still a byte stream but potentially multiple bytes per 161: character. In practice, the text is likelier to be UTF-8 than the pattern, but 162: the match returned may depend on the UTF8 flag, so always use it when matching 163: UTF8 text. For example, "." will match one byte normally but with UTF8 set may 164: match up to three bytes of a multi-byte character. 165: <pre> 166: Example: 167: pcrecpp::RE_Options options; 168: options.set_utf8(); 169: pcrecpp::RE re(utf8_pattern, options); 170: re.FullMatch(utf8_string); 171: 172: Example: using the convenience function UTF8(): 173: pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); 174: re.FullMatch(utf8_string); 175: </pre> 176: NOTE: The UTF8 flag is ignored if pcre was not configured with the 177: <pre> 178: --enable-utf8 flag. 179: </PRE> 180:  181:  <a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>  182:  183: PCRE defines some modifiers to change the behavior of the regular expression 184: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to 185: pass such modifiers to a RE class. Currently, the following modifiers are 186: supported: 187: <pre> 188: modifier description Perl corresponding 189: 190: PCRE_CASELESS case insensitive match /i 191: PCRE_MULTILINE multiple lines match /m 192: PCRE_DOTALL dot matches newlines /s 193: PCRE_DOLLAR_ENDONLY $ matches only at end N/A 194: PCRE_EXTRA strict escape parsing N/A 195: PCRE_EXTENDED ignore white spaces /x 196: PCRE_UTF8 handles UTF8 chars built-in 197: PCRE_UNGREEDY reverses * and *? N/A 198: PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) 199: </pre> 200: (*) Both Perl and PCRE allow non capturing parentheses by means of the 201: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not 202: capture, while (ab|cd) does. 203:  204:  205: For a full account on how each modifier works, please check the 206: PCRE API reference page. 207:  208:  209: For each modifier, there are two member functions whose name is made 210: out of the modifier in lowercase, without the "PCRE_" prefix. For 211: instance, PCRE_CASELESS is handled by 212: <pre> 213: bool caseless() 214: </pre> 215: which returns true if the modifier is set, and 216: <pre> 217: RE_Options & set_caseless(bool) 218: </pre> 219: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be 220: accessed through the set_match_limit() and match_limit() member 221: functions. Setting match_limit to a non-zero value will limit the 222: execution of pcre to keep it from doing bad things like blowing the stack or 223: taking an eternity to return a result. A value of 5000 is good enough to stop 224: stack blowup in a 2MB thread stack. Setting match_limit to zero disables 225: match limiting. Alternatively, you can call match_limit_recursion() 226: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE 227: recurses. match_limit() limits the number of matches PCRE does; 228: match_limit_recursion() limits the depth of internal recursion, and 229: therefore the amount of stack that is used. 230:  231:  232: Normally, to pass one or more modifiers to a RE class, you declare 233: a RE_Options object, set the appropriate options, and pass this 234: object to a RE constructor. Example: 235: <pre> 236: RE_Options opt; 237: opt.set_caseless(true); 238: if (RE("HELLO", opt).PartialMatch("hello world")) ... 239: </pre> 240: RE_options has two constructors. The default constructor takes no arguments and 241: creates a set of flags that are off by default. The optional parameter 242: option_flags is to facilitate transfer of legacy code from C programs. 243: This lets you do 244: <pre> 245: RE(pattern, 246: RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); 247: </pre> 248: However, new code is better off doing 249: <pre> 250: RE(pattern, 251: RE_Options().set_caseless(true).set_multiline(true)) 252: .PartialMatch(str); 253: </pre> 254: If you are going to pass one of the most used modifiers, there are some 255: convenience functions that return a RE_Options class with the 256: appropriate modifier already set: CASELESS(), UTF8(), 257: MULTILINE(), DOTALL(), and EXTENDED(). 258:  259:  260: If you need to set several options at once, and you don't want to go through 261: the pains of declaring a RE_Options object and setting several options, there 262: is a parallel method that give you such ability on the fly. You can concatenate 263: several set_xxxxx() member functions, since each of them returns a 264: reference to its class object. For example, to pass PCRE_CASELESS, 265: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: 266: <pre> 267: RE(" ^ xyz \\s+ .* blah$", 268: RE_Options() 269: .set_caseless(true) 270: .set_extended(true) 271: .set_multiline(true)).PartialMatch(sometext); 272: 273: </PRE> 274:  275:  <a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a>  276:  277: The "Consume" operation may be useful if you want to repeatedly 278: match regular expressions at the front of a string and skip over 279: them as they match. This requires use of the "StringPiece" type, 280: which represents a sub-range of a real string. Like RE, StringPiece 281: is defined in the pcrecpp namespace. 282: <pre> 283: Example: read lines of the form "var = value" from a string. 284: string contents = ...; // Fill string somehow 285: pcrecpp::StringPiece input(contents); // Wrap in a StringPiece 286: 287: string var; 288: int value; 289: pcrecpp::RE re("(\\w+) = (\\d+)\n"); 290: while (re.Consume(&input, &var, &value)) { 291: ...; 292: } 293: </pre> 294: Each successful call to "Consume" will set "var/value", and also 295: advance "input" so it points past the matched text. 296:  297:  298: The "FindAndConsume" operation is similar to "Consume" but does not 299: anchor your match at the beginning of the string. For example, you 300: could extract all words from a string by repeatedly calling 301: <pre> 302: pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) 303: </PRE> 304: 305: <a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a> 306: 307: By default, if you pass a pointer to a numeric value, the 308: corresponding text is interpreted as a base-10 number. You can 309: instead wrap the pointer with a call to one of the operators Hex(), 310: Octal(), or CRadix() to interpret the text in another base. The 311: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) 312: prefixes, but defaults to base-10. 313: <pre> 314: Example: 315: int a, b, c, d; 316: pcrecpp::RE re("(.*) (.*) (.*) (.*)"); 317: re.FullMatch("100 40 0100 0x40", 318: pcrecpp::Octal(&a), pcrecpp::Hex(&b), 319: pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); 320: </pre> 321: will leave 64 in a, b, c, and d. 322:  323:  <a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a>  324:  325: You can replace the first match of "pattern" in "str" with "rewrite". 326: Within "rewrite", backslash-escaped digits (\1 to \9) can be 327: used to insert text matching corresponding parenthesized group 328: from the pattern. \0 in "rewrite" refers to the entire matching 329: text. For example: 330: <pre> 331: string s = "yabba dabba doo"; 332: pcrecpp::RE("b+").Replace("d", &s); 333: </pre> 334: will leave "s" containing "yada dabba doo". The result is true if the pattern 335: matches and a replacement occurs, false otherwise. 336:  337:  338: GlobalReplace is like Replace except that it replaces all 339: occurrences of the pattern in the string with the rewrite. Replacements are 340: not subject to re-matching. For example: 341: <pre> 342: string s = "yabba dabba doo"; 343: pcrecpp::RE("b+").GlobalReplace("d", &s); 344: </pre> 345: will leave "s" containing "yada dada doo". It returns the number of 346: replacements made. 347:  348:  349: Extract is like Replace, except that if the pattern matches, 350: "rewrite" is copied into "out" (an additional argument) with substitutions. 351: The non-matching portions of "text" are ignored. Returns true iff a match 352: occurred and the extraction happened successfully; if no match occurs, the 353: string is left unaffected. 354:  355:  <a name="SEC11" href="#TOC1">AUTHOR</a>  356:  357: The C++ wrapper was contributed by Google Inc. 358:   359: Copyright © 2007 Google Inc. 360:   361:  362:  <a name="SEC12" href="#TOC1">REVISION</a>  363:  364: Last updated: 08 January 2012 365:   366:  367: Return to the <a href="index.html">PCRE index page</a>. 368: