1: <html>
2: <head>
3: <title>pcrecpp specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcrecpp man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">SYNOPSIS OF C++ WRAPPER</a>
17: <li><a name="TOC2" href="#SEC2">DESCRIPTION</a>
18: <li><a name="TOC3" href="#SEC3">MATCHING INTERFACE</a>
19: <li><a name="TOC4" href="#SEC4">QUOTING METACHARACTERS</a>
20: <li><a name="TOC5" href="#SEC5">PARTIAL MATCHES</a>
21: <li><a name="TOC6" href="#SEC6">UTF-8 AND THE MATCHING INTERFACE</a>
22: <li><a name="TOC7" href="#SEC7">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a>
23: <li><a name="TOC8" href="#SEC8">SCANNING TEXT INCREMENTALLY</a>
24: <li><a name="TOC9" href="#SEC9">PARSING HEX/OCTAL/C-RADIX NUMBERS</a>
25: <li><a name="TOC10" href="#SEC10">REPLACING PARTS OF STRINGS</a>
26: <li><a name="TOC11" href="#SEC11">AUTHOR</a>
27: <li><a name="TOC12" href="#SEC12">REVISION</a>
28: </ul>
29: <br><a name="SEC1" href="#TOC1">SYNOPSIS OF C++ WRAPPER</a><br>
30: <P>
31: <b>#include <pcrecpp.h></b>
32: </P>
33: <br><a name="SEC2" href="#TOC1">DESCRIPTION</a><br>
34: <P>
35: The C++ wrapper for PCRE was provided by Google Inc. Some additional
36: functionality was added by Giuseppe Maxia. This brief man page was constructed
37: from the notes in the <i>pcrecpp.h</i> file, which should be consulted for
38: further details. Note that the C++ wrapper supports only the original 8-bit
39: PCRE library. There is no 16-bit or 32-bit support at present.
40: </P>
41: <br><a name="SEC3" href="#TOC1">MATCHING INTERFACE</a><br>
42: <P>
43: The "FullMatch" operation checks that supplied text matches a supplied pattern
44: exactly. If pointer arguments are supplied, it copies matched sub-strings that
45: match sub-patterns into them.
46: <pre>
47: Example: successful match
48: pcrecpp::RE re("h.*o");
49: re.FullMatch("hello");
50:
51: Example: unsuccessful match (requires full match):
52: pcrecpp::RE re("e");
53: !re.FullMatch("hello");
54:
55: Example: creating a temporary RE object:
56: pcrecpp::RE("h.*o").FullMatch("hello");
57: </pre>
58: You can pass in a "const char*" or a "string" for "text". The examples below
59: tend to use a const char*. You can, as in the different examples above, store
60: the RE object explicitly in a variable or use a temporary RE object. The
61: examples below use one mode or the other arbitrarily. Either could correctly be
62: used for any of these examples.
63: </P>
64: <P>
65: You must supply extra pointer arguments to extract matched subpieces.
66: <pre>
67: Example: extracts "ruby" into "s" and 1234 into "i"
68: int i;
69: string s;
70: pcrecpp::RE re("(\\w+):(\\d+)");
71: re.FullMatch("ruby:1234", &s, &i);
72:
73: Example: does not try to extract any extra sub-patterns
74: re.FullMatch("ruby:1234", &s);
75:
76: Example: does not try to extract into NULL
77: re.FullMatch("ruby:1234", NULL, &i);
78:
79: Example: integer overflow causes failure
80: !re.FullMatch("ruby:1234567891234", NULL, &i);
81:
82: Example: fails because there aren't enough sub-patterns:
83: !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
84:
85: Example: fails because string cannot be stored in integer
86: !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
87: </pre>
88: The provided pointer arguments can be pointers to any scalar numeric
89: type, or one of:
90: <pre>
91: string (matched piece is copied to string)
92: StringPiece (StringPiece is mutated to point to matched piece)
93: T (where "bool T::ParseFrom(const char*, int)" exists)
94: NULL (the corresponding matched sub-pattern is not copied)
95: </pre>
96: The function returns true iff all of the following conditions are satisfied:
97: <pre>
98: a. "text" matches "pattern" exactly;
99:
100: b. The number of matched sub-patterns is >= number of supplied
101: pointers;
102:
103: c. The "i"th argument has a suitable type for holding the
104: string captured as the "i"th sub-pattern. If you pass in
105: void * NULL for the "i"th argument, or a non-void * NULL
106: of the correct type, or pass fewer arguments than the
107: number of sub-patterns, "i"th captured sub-pattern is
108: ignored.
109: </pre>
110: CAVEAT: An optional sub-pattern that does not exist in the matched
111: string is assigned the empty string. Therefore, the following will
112: return false (because the empty string is not a valid number):
113: <pre>
114: int number;
115: pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
116: </pre>
117: The matching interface supports at most 16 arguments per call.
118: If you need more, consider using the more general interface
119: <b>pcrecpp::RE::DoMatch</b>. See <b>pcrecpp.h</b> for the signature for
120: <b>DoMatch</b>.
121: </P>
122: <P>
123: NOTE: Do not use <b>no_arg</b>, which is used internally to mark the end of a
124: list of optional arguments, as a placeholder for missing arguments, as this can
125: lead to segfaults.
126: </P>
127: <br><a name="SEC4" href="#TOC1">QUOTING METACHARACTERS</a><br>
128: <P>
129: You can use the "QuoteMeta" operation to insert backslashes before all
130: potentially meaningful characters in a string. The returned string, used as a
131: regular expression, will exactly match the original string.
132: <pre>
133: Example:
134: string quoted = RE::QuoteMeta(unquoted);
135: </pre>
136: Note that it's legal to escape a character even if it has no special meaning in
137: a regular expression -- so this function does that. (This also makes it
138: identical to the perl function of the same name; see "perldoc -f quotemeta".)
139: For example, "1.5-2.0?" becomes "1\.5\-2\.0\?".
140: </P>
141: <br><a name="SEC5" href="#TOC1">PARTIAL MATCHES</a><br>
142: <P>
143: You can use the "PartialMatch" operation when you want the pattern
144: to match any substring of the text.
145: <pre>
146: Example: simple search for a string:
147: pcrecpp::RE("ell").PartialMatch("hello");
148:
149: Example: find first number in a string:
150: int number;
151: pcrecpp::RE re("(\\d+)");
152: re.PartialMatch("x*100 + 20", &number);
153: assert(number == 100);
154: </PRE>
155: </P>
156: <br><a name="SEC6" href="#TOC1">UTF-8 AND THE MATCHING INTERFACE</a><br>
157: <P>
158: By default, pattern and text are plain text, one byte per character. The UTF8
159: flag, passed to the constructor, causes both pattern and string to be treated
160: as UTF-8 text, still a byte stream but potentially multiple bytes per
161: character. In practice, the text is likelier to be UTF-8 than the pattern, but
162: the match returned may depend on the UTF8 flag, so always use it when matching
163: UTF8 text. For example, "." will match one byte normally but with UTF8 set may
164: match up to three bytes of a multi-byte character.
165: <pre>
166: Example:
167: pcrecpp::RE_Options options;
168: options.set_utf8();
169: pcrecpp::RE re(utf8_pattern, options);
170: re.FullMatch(utf8_string);
171:
172: Example: using the convenience function UTF8():
173: pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
174: re.FullMatch(utf8_string);
175: </pre>
176: NOTE: The UTF8 flag is ignored if pcre was not configured with the
177: <pre>
178: --enable-utf8 flag.
179: </PRE>
180: </P>
181: <br><a name="SEC7" href="#TOC1">PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE</a><br>
182: <P>
183: PCRE defines some modifiers to change the behavior of the regular expression
184: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
185: pass such modifiers to a RE class. Currently, the following modifiers are
186: supported:
187: <pre>
188: modifier description Perl corresponding
189:
190: PCRE_CASELESS case insensitive match /i
191: PCRE_MULTILINE multiple lines match /m
192: PCRE_DOTALL dot matches newlines /s
193: PCRE_DOLLAR_ENDONLY $ matches only at end N/A
194: PCRE_EXTRA strict escape parsing N/A
195: PCRE_EXTENDED ignore white spaces /x
196: PCRE_UTF8 handles UTF8 chars built-in
197: PCRE_UNGREEDY reverses * and *? N/A
198: PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
199: </pre>
200: (*) Both Perl and PCRE allow non capturing parentheses by means of the
201: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
202: capture, while (ab|cd) does.
203: </P>
204: <P>
205: For a full account on how each modifier works, please check the
206: PCRE API reference page.
207: </P>
208: <P>
209: For each modifier, there are two member functions whose name is made
210: out of the modifier in lowercase, without the "PCRE_" prefix. For
211: instance, PCRE_CASELESS is handled by
212: <pre>
213: bool caseless()
214: </pre>
215: which returns true if the modifier is set, and
216: <pre>
217: RE_Options & set_caseless(bool)
218: </pre>
219: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
220: accessed through the <b>set_match_limit()</b> and <b>match_limit()</b> member
221: functions. Setting <i>match_limit</i> to a non-zero value will limit the
222: execution of pcre to keep it from doing bad things like blowing the stack or
223: taking an eternity to return a result. A value of 5000 is good enough to stop
224: stack blowup in a 2MB thread stack. Setting <i>match_limit</i> to zero disables
225: match limiting. Alternatively, you can call <b>match_limit_recursion()</b>
226: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
227: recurses. <b>match_limit()</b> limits the number of matches PCRE does;
228: <b>match_limit_recursion()</b> limits the depth of internal recursion, and
229: therefore the amount of stack that is used.
230: </P>
231: <P>
232: Normally, to pass one or more modifiers to a RE class, you declare
233: a <i>RE_Options</i> object, set the appropriate options, and pass this
234: object to a RE constructor. Example:
235: <pre>
236: RE_Options opt;
237: opt.set_caseless(true);
238: if (RE("HELLO", opt).PartialMatch("hello world")) ...
239: </pre>
240: RE_options has two constructors. The default constructor takes no arguments and
241: creates a set of flags that are off by default. The optional parameter
242: <i>option_flags</i> is to facilitate transfer of legacy code from C programs.
243: This lets you do
244: <pre>
245: RE(pattern,
246: RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
247: </pre>
248: However, new code is better off doing
249: <pre>
250: RE(pattern,
251: RE_Options().set_caseless(true).set_multiline(true))
252: .PartialMatch(str);
253: </pre>
254: If you are going to pass one of the most used modifiers, there are some
255: convenience functions that return a RE_Options class with the
256: appropriate modifier already set: <b>CASELESS()</b>, <b>UTF8()</b>,
257: <b>MULTILINE()</b>, <b>DOTALL</b>(), and <b>EXTENDED()</b>.
258: </P>
259: <P>
260: If you need to set several options at once, and you don't want to go through
261: the pains of declaring a RE_Options object and setting several options, there
262: is a parallel method that give you such ability on the fly. You can concatenate
263: several <b>set_xxxxx()</b> member functions, since each of them returns a
264: reference to its class object. For example, to pass PCRE_CASELESS,
265: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
266: <pre>
267: RE(" ^ xyz \\s+ .* blah$",
268: RE_Options()
269: .set_caseless(true)
270: .set_extended(true)
271: .set_multiline(true)).PartialMatch(sometext);
272:
273: </PRE>
274: </P>
275: <br><a name="SEC8" href="#TOC1">SCANNING TEXT INCREMENTALLY</a><br>
276: <P>
277: The "Consume" operation may be useful if you want to repeatedly
278: match regular expressions at the front of a string and skip over
279: them as they match. This requires use of the "StringPiece" type,
280: which represents a sub-range of a real string. Like RE, StringPiece
281: is defined in the pcrecpp namespace.
282: <pre>
283: Example: read lines of the form "var = value" from a string.
284: string contents = ...; // Fill string somehow
285: pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
286:
287: string var;
288: int value;
289: pcrecpp::RE re("(\\w+) = (\\d+)\n");
290: while (re.Consume(&input, &var, &value)) {
291: ...;
292: }
293: </pre>
294: Each successful call to "Consume" will set "var/value", and also
295: advance "input" so it points past the matched text.
296: </P>
297: <P>
298: The "FindAndConsume" operation is similar to "Consume" but does not
299: anchor your match at the beginning of the string. For example, you
300: could extract all words from a string by repeatedly calling
301: <pre>
302: pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
303: </PRE>
304: </P>
305: <br><a name="SEC9" href="#TOC1">PARSING HEX/OCTAL/C-RADIX NUMBERS</a><br>
306: <P>
307: By default, if you pass a pointer to a numeric value, the
308: corresponding text is interpreted as a base-10 number. You can
309: instead wrap the pointer with a call to one of the operators Hex(),
310: Octal(), or CRadix() to interpret the text in another base. The
311: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
312: prefixes, but defaults to base-10.
313: <pre>
314: Example:
315: int a, b, c, d;
316: pcrecpp::RE re("(.*) (.*) (.*) (.*)");
317: re.FullMatch("100 40 0100 0x40",
318: pcrecpp::Octal(&a), pcrecpp::Hex(&b),
319: pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
320: </pre>
321: will leave 64 in a, b, c, and d.
322: </P>
323: <br><a name="SEC10" href="#TOC1">REPLACING PARTS OF STRINGS</a><br>
324: <P>
325: You can replace the first match of "pattern" in "str" with "rewrite".
326: Within "rewrite", backslash-escaped digits (\1 to \9) can be
327: used to insert text matching corresponding parenthesized group
328: from the pattern. \0 in "rewrite" refers to the entire matching
329: text. For example:
330: <pre>
331: string s = "yabba dabba doo";
332: pcrecpp::RE("b+").Replace("d", &s);
333: </pre>
334: will leave "s" containing "yada dabba doo". The result is true if the pattern
335: matches and a replacement occurs, false otherwise.
336: </P>
337: <P>
338: <b>GlobalReplace</b> is like <b>Replace</b> except that it replaces all
339: occurrences of the pattern in the string with the rewrite. Replacements are
340: not subject to re-matching. For example:
341: <pre>
342: string s = "yabba dabba doo";
343: pcrecpp::RE("b+").GlobalReplace("d", &s);
344: </pre>
345: will leave "s" containing "yada dada doo". It returns the number of
346: replacements made.
347: </P>
348: <P>
349: <b>Extract</b> is like <b>Replace</b>, except that if the pattern matches,
350: "rewrite" is copied into "out" (an additional argument) with substitutions.
351: The non-matching portions of "text" are ignored. Returns true iff a match
352: occurred and the extraction happened successfully; if no match occurs, the
353: string is left unaffected.
354: </P>
355: <br><a name="SEC11" href="#TOC1">AUTHOR</a><br>
356: <P>
357: The C++ wrapper was contributed by Google Inc.
358: <br>
359: Copyright © 2007 Google Inc.
360: <br>
361: </P>
362: <br><a name="SEC12" href="#TOC1">REVISION</a><br>
363: <P>
364: Last updated: 08 January 2012
365: <br>
366: <p>
367: Return to the <a href="index.html">PCRE index page</a>.
368: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>