Return to pcrecpp.html CVS log | Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html |
1.1 ! misho 1: <html> ! 2: <head> ! 3: <title>pcrecpp specification</title> ! 4: </head> ! 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> ! 6: <h1>pcrecpp man page</h1> ! 7: <p> ! 8: Return to the <a href="index.html">PCRE index page</a>. ! 9: </p> ! 10: <p> ! 11: This page is part of the PCRE HTML documentation. It was generated automatically ! 12: from the original man page. If there is any nonsense in it, please consult the ! 13: man page, in case the conversion went wrong. ! 14: <br> ! 15: <ul> ! 16: <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a> ! 17: <li><a name="TOC2" href="#SEC2">DESCRIPTION</a> ! 18: <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a> ! 19: <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a> ! 20: <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a> ! 21: <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a> ! 22: <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a> ! 23: <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a> ! 24: <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a> ! 25: <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a> ! 26: <li><a name="TOC11" href="#SEC11">AUTHOR</a> ! 27: <li><a name="TOC12" href="#SEC12">REVISION</a> ! 28: </ul> ! 29: <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br> ! 30: <P> ! 31: <b>#include <pcrecpp.h></b> ! 32: </P> ! 33: <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br> ! 34: <P> ! 35: The C++ wrapper for PCRE was provided by Google Inc. Some additional ! 36: functionality was added by Giuseppe Maxia. This brief man page was constructed ! 37: from the notes in the <i>pcrecpp.h</i> file, which should be consulted for ! 38: further details. ! 39: </P> ! 40: <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br> ! 41: <P> ! 42: The "FullMatch" operation checks that supplied text matches a supplied pattern ! 43: exactly. If pointer arguments are supplied, it copies matched sub-strings that ! 44: match sub-patterns into them. ! 45: <pre> ! 46: Example: successful match ! 47: pcrecpp::RE re("h.*o"); ! 48: re.FullMatch("hello"); ! 49: ! 50: Example: unsuccessful match (requires full match): ! 51: pcrecpp::RE re("e"); ! 52: !re.FullMatch("hello"); ! 53: ! 54: Example: creating a temporary RE object: ! 55: pcrecpp::RE("h.*o").FullMatch("hello"); ! 56: </pre> ! 57: You can pass in a "const char*" or a "string" for "text". The examples below ! 58: tend to use a const char*. You can, as in the different examples above, store ! 59: the RE object explicitly in a variable or use a temporary RE object. The ! 60: examples below use one mode or the other arbitrarily. Either could correctly be ! 61: used for any of these examples. ! 62: </P> ! 63: <P> ! 64: You must supply extra pointer arguments to extract matched subpieces. ! 65: <pre> ! 66: Example: extracts "ruby" into "s" and 1234 into "i" ! 67: int i; ! 68: string s; ! 69: pcrecpp::RE re("(\\w+):(\\d+)"); ! 70: re.FullMatch("ruby:1234", &s, &i); ! 71: ! 72: Example: does not try to extract any extra sub-patterns ! 73: re.FullMatch("ruby:1234", &s); ! 74: ! 75: Example: does not try to extract into NULL ! 76: re.FullMatch("ruby:1234", NULL, &i); ! 77: ! 78: Example: integer overflow causes failure ! 79: !re.FullMatch("ruby:1234567891234", NULL, &i); ! 80: ! 81: Example: fails because there aren't enough sub-patterns: ! 82: !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s); ! 83: ! 84: Example: fails because string cannot be stored in integer ! 85: !pcrecpp::RE("(.*)").FullMatch("ruby", &i); ! 86: </pre> ! 87: The provided pointer arguments can be pointers to any scalar numeric ! 88: type, or one of: ! 89: <pre> ! 90: string (matched piece is copied to string) ! 91: StringPiece (StringPiece is mutated to point to matched piece) ! 92: T (where "bool T::ParseFrom(const char*, int)" exists) ! 93: NULL (the corresponding matched sub-pattern is not copied) ! 94: </pre> ! 95: The function returns true iff all of the following conditions are satisfied: ! 96: <pre> ! 97: a. "text" matches "pattern" exactly; ! 98: ! 99: b. The number of matched sub-patterns is >= number of supplied ! 100: pointers; ! 101: ! 102: c. The "i"th argument has a suitable type for holding the ! 103: string captured as the "i"th sub-pattern. If you pass in ! 104: void * NULL for the "i"th argument, or a non-void * NULL ! 105: of the correct type, or pass fewer arguments than the ! 106: number of sub-patterns, "i"th captured sub-pattern is ! 107: ignored. ! 108: </pre> ! 109: CAVEAT: An optional sub-pattern that does not exist in the matched ! 110: string is assigned the empty string. Therefore, the following will ! 111: return false (because the empty string is not a valid number): ! 112: <pre> ! 113: int number; ! 114: pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number); ! 115: </pre> ! 116: The matching interface supports at most 16 arguments per call. ! 117: If you need more, consider using the more general interface ! 118: <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for ! 119: <b>DoMatch</b>. ! 120: </P> ! 121: <P> ! 122: NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a ! 123: list of optional arguments, as a placeholder for missing arguments, as this can ! 124: lead to segfaults. ! 125: </P> ! 126: <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br> ! 127: <P> ! 128: You can use the "QuoteMeta" operation to insert backslashes before all ! 129: potentially meaningful characters in a string. The returned string, used as a ! 130: regular expression, will exactly match the original string. ! 131: <pre> ! 132: Example: ! 133: string quoted = RE::QuoteMeta(unquoted); ! 134: </pre> ! 135: Note that it's legal to escape a character even if it has no special meaning in ! 136: a regular expression -- so this function does that. (This also makes it ! 137: identical to the perl function of the same name; see "perldoc -f quotemeta".) ! 138: For example, "1.5-2.0?" becomes "1\.5\-2\.0\?". ! 139: </P> ! 140: <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br> ! 141: <P> ! 142: You can use the "PartialMatch" operation when you want the pattern ! 143: to match any substring of the text. ! 144: <pre> ! 145: Example: simple search for a string: ! 146: pcrecpp::RE("ell").PartialMatch("hello"); ! 147: ! 148: Example: find first number in a string: ! 149: int number; ! 150: pcrecpp::RE re("(\\d+)"); ! 151: re.PartialMatch("x*100 + 20", &number); ! 152: assert(number == 100); ! 153: </PRE> ! 154: </P> ! 155: <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br> ! 156: <P> ! 157: By default, pattern and text are plain text, one byte per character. The UTF8 ! 158: flag, passed to the constructor, causes both pattern and string to be treated ! 159: as UTF-8 text, still a byte stream but potentially multiple bytes per ! 160: character. In practice, the text is likelier to be UTF-8 than the pattern, but ! 161: the match returned may depend on the UTF8 flag, so always use it when matching ! 162: UTF8 text. For example, "." will match one byte normally but with UTF8 set may ! 163: match up to three bytes of a multi-byte character. ! 164: <pre> ! 165: Example: ! 166: pcrecpp::RE_Options options; ! 167: options.set_utf8(); ! 168: pcrecpp::RE re(utf8_pattern, options); ! 169: re.FullMatch(utf8_string); ! 170: ! 171: Example: using the convenience function UTF8(): ! 172: pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8()); ! 173: re.FullMatch(utf8_string); ! 174: </pre> ! 175: NOTE: The UTF8 flag is ignored if pcre was not configured with the ! 176: <pre> ! 177: --enable-utf8 flag. ! 178: </PRE> ! 179: </P> ! 180: <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br> ! 181: <P> ! 182: PCRE defines some modifiers to change the behavior of the regular expression ! 183: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to ! 184: pass such modifiers to a RE class. Currently, the following modifiers are ! 185: supported: ! 186: <pre> ! 187: modifier description Perl corresponding ! 188: ! 189: PCRE_CASELESS case insensitive match /i ! 190: PCRE_MULTILINE multiple lines match /m ! 191: PCRE_DOTALL dot matches newlines /s ! 192: PCRE_DOLLAR_ENDONLY $ matches only at end N/A ! 193: PCRE_EXTRA strict escape parsing N/A ! 194: PCRE_EXTENDED ignore whitespaces /x ! 195: PCRE_UTF8 handles UTF8 chars built-in ! 196: PCRE_UNGREEDY reverses * and *? N/A ! 197: PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*) ! 198: </pre> ! 199: (*) Both Perl and PCRE allow non capturing parentheses by means of the ! 200: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not ! 201: capture, while (ab|cd) does. ! 202: </P> ! 203: <P> ! 204: For a full account on how each modifier works, please check the ! 205: PCRE API reference page. ! 206: </P> ! 207: <P> ! 208: For each modifier, there are two member functions whose name is made ! 209: out of the modifier in lowercase, without the "PCRE_" prefix. For ! 210: instance, PCRE_CASELESS is handled by ! 211: <pre> ! 212: bool caseless() ! 213: </pre> ! 214: which returns true if the modifier is set, and ! 215: <pre> ! 216: RE_Options & set_caseless(bool) ! 217: </pre> ! 218: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be ! 219: accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member ! 220: functions. Setting <i>match_limit</i> to a non-zero value will limit the ! 221: execution of pcre to keep it from doing bad things like blowing the stack or ! 222: taking an eternity to return a result. A value of 5000 is good enough to stop ! 223: stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables ! 224: match limiting. Alternatively, you can call <b>match_limit_recursion()</b> ! 225: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE ! 226: recurses. <b>match_limit()</b> limits the number of matches PCRE does; ! 227: <b>match_limit_recursion()</b> limits the depth of internal recursion, and ! 228: therefore the amount of stack that is used. ! 229: </P> ! 230: <P> ! 231: Normally, to pass one or more modifiers to a RE class, you declare ! 232: a <i>RE_Options</i> object, set the appropriate options, and pass this ! 233: object to a RE constructor. Example: ! 234: <pre> ! 235: RE_Options opt; ! 236: opt.set_caseless(true); ! 237: if (RE("HELLO", opt).PartialMatch("hello world")) ... ! 238: </pre> ! 239: RE_options has two constructors. The default constructor takes no arguments and ! 240: creates a set of flags that are off by default. The optional parameter ! 241: <i>option_flags</i> is to facilitate transfer of legacy code from C programs. ! 242: This lets you do ! 243: <pre> ! 244: RE(pattern, ! 245: RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str); ! 246: </pre> ! 247: However, new code is better off doing ! 248: <pre> ! 249: RE(pattern, ! 250: RE_Options().set_caseless(true).set_multiline(true)) ! 251: .PartialMatch(str); ! 252: </pre> ! 253: If you are going to pass one of the most used modifiers, there are some ! 254: convenience functions that return a RE_Options class with the ! 255: appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>, ! 256: <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>. ! 257: </P> ! 258: <P> ! 259: If you need to set several options at once, and you don't want to go through ! 260: the pains of declaring a RE_Options object and setting several options, there ! 261: is a parallel method that give you such ability on the fly. You can concatenate ! 262: several <b>set_xxxxx()</b> member functions, since each of them returns a ! 263: reference to its class object. For example, to pass PCRE_CASELESS, ! 264: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write: ! 265: <pre> ! 266: RE(" ^ xyz \\s+ .* blah$", ! 267: RE_Options() ! 268: .set_caseless(true) ! 269: .set_extended(true) ! 270: .set_multiline(true)).PartialMatch(sometext); ! 271: ! 272: </PRE> ! 273: </P> ! 274: <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br> ! 275: <P> ! 276: The "Consume" operation may be useful if you want to repeatedly ! 277: match regular expressions at the front of a string and skip over ! 278: them as they match. This requires use of the "StringPiece" type, ! 279: which represents a sub-range of a real string. Like RE, StringPiece ! 280: is defined in the pcrecpp namespace. ! 281: <pre> ! 282: Example: read lines of the form "var = value" from a string. ! 283: string contents = ...; // Fill string somehow ! 284: pcrecpp::StringPiece input(contents); // Wrap in a StringPiece ! 285: ! 286: string var; ! 287: int value; ! 288: pcrecpp::RE re("(\\w+) = (\\d+)\n"); ! 289: while (re.Consume(&input, &var, &value)) { ! 290: ...; ! 291: } ! 292: </pre> ! 293: Each successful call to "Consume" will set "var/value", and also ! 294: advance "input" so it points past the matched text. ! 295: </P> ! 296: <P> ! 297: The "FindAndConsume" operation is similar to "Consume" but does not ! 298: anchor your match at the beginning of the string. For example, you ! 299: could extract all words from a string by repeatedly calling ! 300: <pre> ! 301: pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word) ! 302: </PRE> ! 303: </P> ! 304: <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br> ! 305: <P> ! 306: By default, if you pass a pointer to a numeric value, the ! 307: corresponding text is interpreted as a base-10 number. You can ! 308: instead wrap the pointer with a call to one of the operators Hex(), ! 309: Octal(), or CRadix() to interpret the text in another base. The ! 310: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16) ! 311: prefixes, but defaults to base-10. ! 312: <pre> ! 313: Example: ! 314: int a, b, c, d; ! 315: pcrecpp::RE re("(.*) (.*) (.*) (.*)"); ! 316: re.FullMatch("100 40 0100 0x40", ! 317: pcrecpp::Octal(&a), pcrecpp::Hex(&b), ! 318: pcrecpp::CRadix(&c), pcrecpp::CRadix(&d)); ! 319: </pre> ! 320: will leave 64 in a, b, c, and d. ! 321: </P> ! 322: <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br> ! 323: <P> ! 324: You can replace the first match of "pattern" in "str" with "rewrite". ! 325: Within "rewrite", backslash-escaped digits (\1 to \9) can be ! 326: used to insert text matching corresponding parenthesized group ! 327: from the pattern. \0 in "rewrite" refers to the entire matching ! 328: text. For example: ! 329: <pre> ! 330: string s = "yabba dabba doo"; ! 331: pcrecpp::RE("b+").Replace("d", &s); ! 332: </pre> ! 333: will leave "s" containing "yada dabba doo". The result is true if the pattern ! 334: matches and a replacement occurs, false otherwise. ! 335: </P> ! 336: <P> ! 337: <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all ! 338: occurrences of the pattern in the string with the rewrite. Replacements are ! 339: not subject to re-matching. For example: ! 340: <pre> ! 341: string s = "yabba dabba doo"; ! 342: pcrecpp::RE("b+").GlobalReplace("d", &s); ! 343: </pre> ! 344: will leave "s" containing "yada dada doo". It returns the number of ! 345: replacements made. ! 346: </P> ! 347: <P> ! 348: <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches, ! 349: "rewrite" is copied into "out" (an additional argument) with substitutions. ! 350: The non-matching portions of "text" are ignored. Returns true iff a match ! 351: occurred and the extraction happened successfully; if no match occurs, the ! 352: string is left unaffected. ! 353: </P> ! 354: <br><a name="SEC11" href="#TOC1">AUTHOR</a><br> ! 355: <P> ! 356: The C++ wrapper was contributed by Google Inc. ! 357: <br> ! 358: Copyright © 2007 Google Inc. ! 359: <br> ! 360: </P> ! 361: <br><a name="SEC12" href="#TOC1">REVISION</a><br> ! 362: <P> ! 363: Last updated: 17 March 2009 ! 364: <br> ! 365: Minor typo fixed: 25 July 2011 ! 366: <br> ! 367: <p> ! 368: Return to the <a href="index.html">PCRE index page</a>. ! 369: </p>