1: .TH PCRECPP 3 "08 January 2012" "PCRE 8.30"
2: .SH NAME
3: PCRE - Perl-compatible regular expressions.
4: .SH "SYNOPSIS OF C++ WRAPPER"
5: .rs
6: .sp
7: .B #include <pcrecpp.h>
8: .
9: .SH DESCRIPTION
10: .rs
11: .sp
12: The C++ wrapper for PCRE was provided by Google Inc. Some additional
13: functionality was added by Giuseppe Maxia. This brief man page was constructed
14: from the notes in the \fIpcrecpp.h\fP file, which should be consulted for
15: further details. Note that the C++ wrapper supports only the original 8-bit
16: PCRE library. There is no 16-bit or 32-bit support at present.
17: .
18: .
19: .SH "MATCHING INTERFACE"
20: .rs
21: .sp
22: The "FullMatch" operation checks that supplied text matches a supplied pattern
23: exactly. If pointer arguments are supplied, it copies matched sub-strings that
24: match sub-patterns into them.
25: .sp
26: Example: successful match
27: pcrecpp::RE re("h.*o");
28: re.FullMatch("hello");
29: .sp
30: Example: unsuccessful match (requires full match):
31: pcrecpp::RE re("e");
32: !re.FullMatch("hello");
33: .sp
34: Example: creating a temporary RE object:
35: pcrecpp::RE("h.*o").FullMatch("hello");
36: .sp
37: You can pass in a "const char*" or a "string" for "text". The examples below
38: tend to use a const char*. You can, as in the different examples above, store
39: the RE object explicitly in a variable or use a temporary RE object. The
40: examples below use one mode or the other arbitrarily. Either could correctly be
41: used for any of these examples.
42: .P
43: You must supply extra pointer arguments to extract matched subpieces.
44: .sp
45: Example: extracts "ruby" into "s" and 1234 into "i"
46: int i;
47: string s;
48: pcrecpp::RE re("(\e\ew+):(\e\ed+)");
49: re.FullMatch("ruby:1234", &s, &i);
50: .sp
51: Example: does not try to extract any extra sub-patterns
52: re.FullMatch("ruby:1234", &s);
53: .sp
54: Example: does not try to extract into NULL
55: re.FullMatch("ruby:1234", NULL, &i);
56: .sp
57: Example: integer overflow causes failure
58: !re.FullMatch("ruby:1234567891234", NULL, &i);
59: .sp
60: Example: fails because there aren't enough sub-patterns:
61: !pcrecpp::RE("\e\ew+:\e\ed+").FullMatch("ruby:1234", &s);
62: .sp
63: Example: fails because string cannot be stored in integer
64: !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
65: .sp
66: The provided pointer arguments can be pointers to any scalar numeric
67: type, or one of:
68: .sp
69: string (matched piece is copied to string)
70: StringPiece (StringPiece is mutated to point to matched piece)
71: T (where "bool T::ParseFrom(const char*, int)" exists)
72: NULL (the corresponding matched sub-pattern is not copied)
73: .sp
74: The function returns true iff all of the following conditions are satisfied:
75: .sp
76: a. "text" matches "pattern" exactly;
77: .sp
78: b. The number of matched sub-patterns is >= number of supplied
79: pointers;
80: .sp
81: c. The "i"th argument has a suitable type for holding the
82: string captured as the "i"th sub-pattern. If you pass in
83: void * NULL for the "i"th argument, or a non-void * NULL
84: of the correct type, or pass fewer arguments than the
85: number of sub-patterns, "i"th captured sub-pattern is
86: ignored.
87: .sp
88: CAVEAT: An optional sub-pattern that does not exist in the matched
89: string is assigned the empty string. Therefore, the following will
90: return false (because the empty string is not a valid number):
91: .sp
92: int number;
93: pcrecpp::RE::FullMatch("abc", "[a-z]+(\e\ed+)?", &number);
94: .sp
95: The matching interface supports at most 16 arguments per call.
96: If you need more, consider using the more general interface
97: \fBpcrecpp::RE::DoMatch\fP. See \fBpcrecpp.h\fP for the signature for
98: \fBDoMatch\fP.
99: .P
100: NOTE: Do not use \fBno_arg\fP, which is used internally to mark the end of a
101: list of optional arguments, as a placeholder for missing arguments, as this can
102: lead to segfaults.
103: .
104: .
105: .SH "QUOTING METACHARACTERS"
106: .rs
107: .sp
108: You can use the "QuoteMeta" operation to insert backslashes before all
109: potentially meaningful characters in a string. The returned string, used as a
110: regular expression, will exactly match the original string.
111: .sp
112: Example:
113: string quoted = RE::QuoteMeta(unquoted);
114: .sp
115: Note that it's legal to escape a character even if it has no special meaning in
116: a regular expression -- so this function does that. (This also makes it
117: identical to the perl function of the same name; see "perldoc -f quotemeta".)
118: For example, "1.5-2.0?" becomes "1\e.5\e-2\e.0\e?".
119: .
120: .SH "PARTIAL MATCHES"
121: .rs
122: .sp
123: You can use the "PartialMatch" operation when you want the pattern
124: to match any substring of the text.
125: .sp
126: Example: simple search for a string:
127: pcrecpp::RE("ell").PartialMatch("hello");
128: .sp
129: Example: find first number in a string:
130: int number;
131: pcrecpp::RE re("(\e\ed+)");
132: re.PartialMatch("x*100 + 20", &number);
133: assert(number == 100);
134: .
135: .
136: .SH "UTF-8 AND THE MATCHING INTERFACE"
137: .rs
138: .sp
139: By default, pattern and text are plain text, one byte per character. The UTF8
140: flag, passed to the constructor, causes both pattern and string to be treated
141: as UTF-8 text, still a byte stream but potentially multiple bytes per
142: character. In practice, the text is likelier to be UTF-8 than the pattern, but
143: the match returned may depend on the UTF8 flag, so always use it when matching
144: UTF8 text. For example, "." will match one byte normally but with UTF8 set may
145: match up to three bytes of a multi-byte character.
146: .sp
147: Example:
148: pcrecpp::RE_Options options;
149: options.set_utf8();
150: pcrecpp::RE re(utf8_pattern, options);
151: re.FullMatch(utf8_string);
152: .sp
153: Example: using the convenience function UTF8():
154: pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
155: re.FullMatch(utf8_string);
156: .sp
157: NOTE: The UTF8 flag is ignored if pcre was not configured with the
158: --enable-utf8 flag.
159: .
160: .
161: .SH "PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE"
162: .rs
163: .sp
164: PCRE defines some modifiers to change the behavior of the regular expression
165: engine. The C++ wrapper defines an auxiliary class, RE_Options, as a vehicle to
166: pass such modifiers to a RE class. Currently, the following modifiers are
167: supported:
168: .sp
169: modifier description Perl corresponding
170: .sp
171: PCRE_CASELESS case insensitive match /i
172: PCRE_MULTILINE multiple lines match /m
173: PCRE_DOTALL dot matches newlines /s
174: PCRE_DOLLAR_ENDONLY $ matches only at end N/A
175: PCRE_EXTRA strict escape parsing N/A
176: PCRE_EXTENDED ignore white spaces /x
177: PCRE_UTF8 handles UTF8 chars built-in
178: PCRE_UNGREEDY reverses * and *? N/A
179: PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
180: .sp
181: (*) Both Perl and PCRE allow non capturing parentheses by means of the
182: "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not
183: capture, while (ab|cd) does.
184: .P
185: For a full account on how each modifier works, please check the
186: PCRE API reference page.
187: .P
188: For each modifier, there are two member functions whose name is made
189: out of the modifier in lowercase, without the "PCRE_" prefix. For
190: instance, PCRE_CASELESS is handled by
191: .sp
192: bool caseless()
193: .sp
194: which returns true if the modifier is set, and
195: .sp
196: RE_Options & set_caseless(bool)
197: .sp
198: which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can be
199: accessed through the \fBset_match_limit()\fP and \fBmatch_limit()\fP member
200: functions. Setting \fImatch_limit\fP to a non-zero value will limit the
201: execution of pcre to keep it from doing bad things like blowing the stack or
202: taking an eternity to return a result. A value of 5000 is good enough to stop
203: stack blowup in a 2MB thread stack. Setting \fImatch_limit\fP to zero disables
204: match limiting. Alternatively, you can call \fBmatch_limit_recursion()\fP
205: which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to limit how much PCRE
206: recurses. \fBmatch_limit()\fP limits the number of matches PCRE does;
207: \fBmatch_limit_recursion()\fP limits the depth of internal recursion, and
208: therefore the amount of stack that is used.
209: .P
210: Normally, to pass one or more modifiers to a RE class, you declare
211: a \fIRE_Options\fP object, set the appropriate options, and pass this
212: object to a RE constructor. Example:
213: .sp
214: RE_Options opt;
215: opt.set_caseless(true);
216: if (RE("HELLO", opt).PartialMatch("hello world")) ...
217: .sp
218: RE_options has two constructors. The default constructor takes no arguments and
219: creates a set of flags that are off by default. The optional parameter
220: \fIoption_flags\fP is to facilitate transfer of legacy code from C programs.
221: This lets you do
222: .sp
223: RE(pattern,
224: RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
225: .sp
226: However, new code is better off doing
227: .sp
228: RE(pattern,
229: RE_Options().set_caseless(true).set_multiline(true))
230: .PartialMatch(str);
231: .sp
232: If you are going to pass one of the most used modifiers, there are some
233: convenience functions that return a RE_Options class with the
234: appropriate modifier already set: \fBCASELESS()\fP, \fBUTF8()\fP,
235: \fBMULTILINE()\fP, \fBDOTALL\fP(), and \fBEXTENDED()\fP.
236: .P
237: If you need to set several options at once, and you don't want to go through
238: the pains of declaring a RE_Options object and setting several options, there
239: is a parallel method that give you such ability on the fly. You can concatenate
240: several \fBset_xxxxx()\fP member functions, since each of them returns a
241: reference to its class object. For example, to pass PCRE_CASELESS,
242: PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one statement, you may write:
243: .sp
244: RE(" ^ xyz \e\es+ .* blah$",
245: RE_Options()
246: .set_caseless(true)
247: .set_extended(true)
248: .set_multiline(true)).PartialMatch(sometext);
249: .sp
250: .
251: .
252: .SH "SCANNING TEXT INCREMENTALLY"
253: .rs
254: .sp
255: The "Consume" operation may be useful if you want to repeatedly
256: match regular expressions at the front of a string and skip over
257: them as they match. This requires use of the "StringPiece" type,
258: which represents a sub-range of a real string. Like RE, StringPiece
259: is defined in the pcrecpp namespace.
260: .sp
261: Example: read lines of the form "var = value" from a string.
262: string contents = ...; // Fill string somehow
263: pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
264: .sp
265: string var;
266: int value;
267: pcrecpp::RE re("(\e\ew+) = (\e\ed+)\en");
268: while (re.Consume(&input, &var, &value)) {
269: ...;
270: }
271: .sp
272: Each successful call to "Consume" will set "var/value", and also
273: advance "input" so it points past the matched text.
274: .P
275: The "FindAndConsume" operation is similar to "Consume" but does not
276: anchor your match at the beginning of the string. For example, you
277: could extract all words from a string by repeatedly calling
278: .sp
279: pcrecpp::RE("(\e\ew+)").FindAndConsume(&input, &word)
280: .
281: .
282: .SH "PARSING HEX/OCTAL/C-RADIX NUMBERS"
283: .rs
284: .sp
285: By default, if you pass a pointer to a numeric value, the
286: corresponding text is interpreted as a base-10 number. You can
287: instead wrap the pointer with a call to one of the operators Hex(),
288: Octal(), or CRadix() to interpret the text in another base. The
289: CRadix operator interprets C-style "0" (base-8) and "0x" (base-16)
290: prefixes, but defaults to base-10.
291: .sp
292: Example:
293: int a, b, c, d;
294: pcrecpp::RE re("(.*) (.*) (.*) (.*)");
295: re.FullMatch("100 40 0100 0x40",
296: pcrecpp::Octal(&a), pcrecpp::Hex(&b),
297: pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
298: .sp
299: will leave 64 in a, b, c, and d.
300: .
301: .
302: .SH "REPLACING PARTS OF STRINGS"
303: .rs
304: .sp
305: You can replace the first match of "pattern" in "str" with "rewrite".
306: Within "rewrite", backslash-escaped digits (\e1 to \e9) can be
307: used to insert text matching corresponding parenthesized group
308: from the pattern. \e0 in "rewrite" refers to the entire matching
309: text. For example:
310: .sp
311: string s = "yabba dabba doo";
312: pcrecpp::RE("b+").Replace("d", &s);
313: .sp
314: will leave "s" containing "yada dabba doo". The result is true if the pattern
315: matches and a replacement occurs, false otherwise.
316: .P
317: \fBGlobalReplace\fP is like \fBReplace\fP except that it replaces all
318: occurrences of the pattern in the string with the rewrite. Replacements are
319: not subject to re-matching. For example:
320: .sp
321: string s = "yabba dabba doo";
322: pcrecpp::RE("b+").GlobalReplace("d", &s);
323: .sp
324: will leave "s" containing "yada dada doo". It returns the number of
325: replacements made.
326: .P
327: \fBExtract\fP is like \fBReplace\fP, except that if the pattern matches,
328: "rewrite" is copied into "out" (an additional argument) with substitutions.
329: The non-matching portions of "text" are ignored. Returns true iff a match
330: occurred and the extraction happened successfully; if no match occurs, the
331: string is left unaffected.
332: .
333: .
334: .SH AUTHOR
335: .rs
336: .sp
337: .nf
338: The C++ wrapper was contributed by Google Inc.
339: Copyright (c) 2007 Google Inc.
340: .fi
341: .
342: .
343: .SH REVISION
344: .rs
345: .sp
346: .nf
347: Last updated: 08 January 2012
348: .fi
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>