Annotation of embedaddon/pcre/doc/html/pcreapi.html, revision 1.1.1.4
1.1 misho 1: <html>
2: <head>
3: <title>pcreapi specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcreapi man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">PCRE NATIVE API BASIC FUNCTIONS</a>
1.1.1.2 misho 17: <li><a name="TOC2" href="#SEC2">PCRE NATIVE API STRING EXTRACTION FUNCTIONS</a>
18: <li><a name="TOC3" href="#SEC3">PCRE NATIVE API AUXILIARY FUNCTIONS</a>
19: <li><a name="TOC4" href="#SEC4">PCRE NATIVE API INDIRECTED FUNCTIONS</a>
1.1.1.4 ! misho 20: <li><a name="TOC5" href="#SEC5">PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a>
1.1.1.2 misho 21: <li><a name="TOC6" href="#SEC6">PCRE API OVERVIEW</a>
22: <li><a name="TOC7" href="#SEC7">NEWLINES</a>
23: <li><a name="TOC8" href="#SEC8">MULTITHREADING</a>
24: <li><a name="TOC9" href="#SEC9">SAVING PRECOMPILED PATTERNS FOR LATER USE</a>
25: <li><a name="TOC10" href="#SEC10">CHECKING BUILD-TIME OPTIONS</a>
26: <li><a name="TOC11" href="#SEC11">COMPILING A PATTERN</a>
27: <li><a name="TOC12" href="#SEC12">COMPILATION ERROR CODES</a>
28: <li><a name="TOC13" href="#SEC13">STUDYING A PATTERN</a>
29: <li><a name="TOC14" href="#SEC14">LOCALE SUPPORT</a>
30: <li><a name="TOC15" href="#SEC15">INFORMATION ABOUT A PATTERN</a>
31: <li><a name="TOC16" href="#SEC16">REFERENCE COUNTS</a>
32: <li><a name="TOC17" href="#SEC17">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a>
33: <li><a name="TOC18" href="#SEC18">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a>
34: <li><a name="TOC19" href="#SEC19">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a>
35: <li><a name="TOC20" href="#SEC20">DUPLICATE SUBPATTERN NAMES</a>
36: <li><a name="TOC21" href="#SEC21">FINDING ALL POSSIBLE MATCHES</a>
37: <li><a name="TOC22" href="#SEC22">OBTAINING AN ESTIMATE OF STACK USAGE</a>
38: <li><a name="TOC23" href="#SEC23">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a>
39: <li><a name="TOC24" href="#SEC24">SEE ALSO</a>
40: <li><a name="TOC25" href="#SEC25">AUTHOR</a>
41: <li><a name="TOC26" href="#SEC26">REVISION</a>
1.1 misho 42: </ul>
43: <P>
44: <b>#include <pcre.h></b>
45: </P>
1.1.1.2 misho 46: <br><a name="SEC1" href="#TOC1">PCRE NATIVE API BASIC FUNCTIONS</a><br>
1.1 misho 47: <P>
48: <b>pcre *pcre_compile(const char *<i>pattern</i>, int <i>options</i>,</b>
49: <b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
50: <b>const unsigned char *<i>tableptr</i>);</b>
51: </P>
52: <P>
53: <b>pcre *pcre_compile2(const char *<i>pattern</i>, int <i>options</i>,</b>
54: <b>int *<i>errorcodeptr</i>,</b>
55: <b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
56: <b>const unsigned char *<i>tableptr</i>);</b>
57: </P>
58: <P>
59: <b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i>,</b>
60: <b>const char **<i>errptr</i>);</b>
61: </P>
62: <P>
63: <b>void pcre_free_study(pcre_extra *<i>extra</i>);</b>
64: </P>
65: <P>
66: <b>int pcre_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
67: <b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
68: <b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
69: </P>
70: <P>
71: <b>int pcre_dfa_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
72: <b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
73: <b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
74: <b>int *<i>workspace</i>, int <i>wscount</i>);</b>
75: </P>
1.1.1.2 misho 76: <br><a name="SEC2" href="#TOC1">PCRE NATIVE API STRING EXTRACTION FUNCTIONS</a><br>
1.1 misho 77: <P>
78: <b>int pcre_copy_named_substring(const pcre *<i>code</i>,</b>
79: <b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
80: <b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
81: <b>char *<i>buffer</i>, int <i>buffersize</i>);</b>
82: </P>
83: <P>
84: <b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
85: <b>int <i>stringcount</i>, int <i>stringnumber</i>, char *<i>buffer</i>,</b>
86: <b>int <i>buffersize</i>);</b>
87: </P>
88: <P>
89: <b>int pcre_get_named_substring(const pcre *<i>code</i>,</b>
90: <b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
91: <b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
92: <b>const char **<i>stringptr</i>);</b>
93: </P>
94: <P>
95: <b>int pcre_get_stringnumber(const pcre *<i>code</i>,</b>
96: <b>const char *<i>name</i>);</b>
97: </P>
98: <P>
99: <b>int pcre_get_stringtable_entries(const pcre *<i>code</i>,</b>
100: <b>const char *<i>name</i>, char **<i>first</i>, char **<i>last</i>);</b>
101: </P>
102: <P>
103: <b>int pcre_get_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
104: <b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
105: <b>const char **<i>stringptr</i>);</b>
106: </P>
107: <P>
108: <b>int pcre_get_substring_list(const char *<i>subject</i>,</b>
109: <b>int *<i>ovector</i>, int <i>stringcount</i>, const char ***<i>listptr</i>);</b>
110: </P>
111: <P>
112: <b>void pcre_free_substring(const char *<i>stringptr</i>);</b>
113: </P>
114: <P>
115: <b>void pcre_free_substring_list(const char **<i>stringptr</i>);</b>
116: </P>
1.1.1.2 misho 117: <br><a name="SEC3" href="#TOC1">PCRE NATIVE API AUXILIARY FUNCTIONS</a><br>
118: <P>
1.1.1.4 ! misho 119: <b>int pcre_jit_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
! 120: <b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
! 121: <b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
! 122: <b>pcre_jit_stack *<i>jstack</i>);</b>
! 123: </P>
! 124: <P>
1.1.1.2 misho 125: <b>pcre_jit_stack *pcre_jit_stack_alloc(int <i>startsize</i>, int <i>maxsize</i>);</b>
126: </P>
127: <P>
128: <b>void pcre_jit_stack_free(pcre_jit_stack *<i>stack</i>);</b>
129: </P>
130: <P>
131: <b>void pcre_assign_jit_stack(pcre_extra *<i>extra</i>,</b>
132: <b>pcre_jit_callback <i>callback</i>, void *<i>data</i>);</b>
133: </P>
1.1 misho 134: <P>
135: <b>const unsigned char *pcre_maketables(void);</b>
136: </P>
137: <P>
138: <b>int pcre_fullinfo(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
139: <b>int <i>what</i>, void *<i>where</i>);</b>
140: </P>
141: <P>
142: <b>int pcre_refcount(pcre *<i>code</i>, int <i>adjust</i>);</b>
143: </P>
144: <P>
145: <b>int pcre_config(int <i>what</i>, void *<i>where</i>);</b>
146: </P>
147: <P>
1.1.1.2 misho 148: <b>const char *pcre_version(void);</b>
1.1 misho 149: </P>
1.1.1.2 misho 150: <P>
151: <b>int pcre_pattern_to_host_byte_order(pcre *<i>code</i>,</b>
152: <b>pcre_extra *<i>extra</i>, const unsigned char *<i>tables</i>);</b>
153: </P>
154: <br><a name="SEC4" href="#TOC1">PCRE NATIVE API INDIRECTED FUNCTIONS</a><br>
1.1 misho 155: <P>
156: <b>void *(*pcre_malloc)(size_t);</b>
157: </P>
158: <P>
159: <b>void (*pcre_free)(void *);</b>
160: </P>
161: <P>
162: <b>void *(*pcre_stack_malloc)(size_t);</b>
163: </P>
164: <P>
165: <b>void (*pcre_stack_free)(void *);</b>
166: </P>
167: <P>
168: <b>int (*pcre_callout)(pcre_callout_block *);</b>
169: </P>
1.1.1.4 ! misho 170: <br><a name="SEC5" href="#TOC1">PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES</a><br>
1.1.1.2 misho 171: <P>
1.1.1.4 ! misho 172: As well as support for 8-bit character strings, PCRE also supports 16-bit
! 173: strings (from release 8.30) and 32-bit strings (from release 8.32), by means of
! 174: two additional libraries. They can be built as well as, or instead of, the
! 175: 8-bit library. To avoid too much complication, this document describes the
! 176: 8-bit versions of the functions, with only occasional references to the 16-bit
! 177: and 32-bit libraries.
! 178: </P>
! 179: <P>
! 180: The 16-bit and 32-bit functions operate in the same way as their 8-bit
! 181: counterparts; they just use different data types for their arguments and
! 182: results, and their names start with <b>pcre16_</b> or <b>pcre32_</b> instead of
! 183: <b>pcre_</b>. For every option that has UTF8 in its name (for example,
! 184: PCRE_UTF8), there are corresponding 16-bit and 32-bit names with UTF8 replaced
! 185: by UTF16 or UTF32, respectively. This facility is in fact just cosmetic; the
! 186: 16-bit and 32-bit option names define the same bit values.
1.1.1.2 misho 187: </P>
188: <P>
189: References to bytes and UTF-8 in this document should be read as references to
1.1.1.4 ! misho 190: 16-bit data units and UTF-16 when using the 16-bit library, or 32-bit data
! 191: units and UTF-32 when using the 32-bit library, unless specified otherwise.
! 192: More details of the specific differences for the 16-bit and 32-bit libraries
! 193: are given in the
1.1.1.2 misho 194: <a href="pcre16.html"><b>pcre16</b></a>
1.1.1.4 ! misho 195: and
! 196: <a href="pcre32.html"><b>pcre32</b></a>
! 197: pages.
1.1.1.2 misho 198: </P>
199: <br><a name="SEC6" href="#TOC1">PCRE API OVERVIEW</a><br>
1.1 misho 200: <P>
201: PCRE has its own native API, which is described in this document. There are
1.1.1.2 misho 202: also some wrapper functions (for the 8-bit library only) that correspond to the
203: POSIX regular expression API, but they do not give access to all the
204: functionality. They are described in the
1.1 misho 205: <a href="pcreposix.html"><b>pcreposix</b></a>
206: documentation. Both of these APIs define a set of C function calls. A C++
1.1.1.2 misho 207: wrapper (again for the 8-bit library only) is also distributed with PCRE. It is
208: documented in the
1.1 misho 209: <a href="pcrecpp.html"><b>pcrecpp</b></a>
210: page.
211: </P>
212: <P>
213: The native API C function prototypes are defined in the header file
1.1.1.2 misho 214: <b>pcre.h</b>, and on Unix-like systems the (8-bit) library itself is called
215: <b>libpcre</b>. It can normally be accessed by adding <b>-lpcre</b> to the
216: command for linking an application that uses PCRE. The header file defines the
217: macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release numbers
218: for the library. Applications can use these to include support for different
219: releases of PCRE.
1.1 misho 220: </P>
221: <P>
222: In a Windows environment, if you want to statically link an application program
223: against a non-dll <b>pcre.a</b> file, you must define PCRE_STATIC before
224: including <b>pcre.h</b> or <b>pcrecpp.h</b>, because otherwise the
225: <b>pcre_malloc()</b> and <b>pcre_free()</b> exported functions will be declared
226: <b>__declspec(dllimport)</b>, with unwanted results.
227: </P>
228: <P>
229: The functions <b>pcre_compile()</b>, <b>pcre_compile2()</b>, <b>pcre_study()</b>,
230: and <b>pcre_exec()</b> are used for compiling and matching regular expressions
231: in a Perl-compatible manner. A sample program that demonstrates the simplest
232: way of using them is provided in the file called <i>pcredemo.c</i> in the PCRE
233: source distribution. A listing of this program is given in the
234: <a href="pcredemo.html"><b>pcredemo</b></a>
235: documentation, and the
236: <a href="pcresample.html"><b>pcresample</b></a>
237: documentation describes how to compile and run it.
238: </P>
239: <P>
240: Just-in-time compiler support is an optional feature of PCRE that can be built
241: in appropriate hardware environments. It greatly speeds up the matching
242: performance of many patterns. Simple programs can easily request that it be
243: used if available, by setting an option that is ignored when it is not
244: relevant. More complicated programs might need to make use of the functions
245: <b>pcre_jit_stack_alloc()</b>, <b>pcre_jit_stack_free()</b>, and
246: <b>pcre_assign_jit_stack()</b> in order to control the JIT code's memory usage.
1.1.1.4 ! misho 247: </P>
! 248: <P>
! 249: From release 8.32 there is also a direct interface for JIT execution, which
! 250: gives improved performance. The JIT-specific functions are discussed in the
1.1 misho 251: <a href="pcrejit.html"><b>pcrejit</b></a>
252: documentation.
253: </P>
254: <P>
255: A second matching function, <b>pcre_dfa_exec()</b>, which is not
256: Perl-compatible, is also provided. This uses a different algorithm for the
257: matching. The alternative algorithm finds all possible matches (at a given
258: point in the subject), and scans the subject just once (unless there are
259: lookbehind assertions). However, this algorithm does not return captured
260: substrings. A description of the two matching algorithms and their advantages
261: and disadvantages is given in the
262: <a href="pcrematching.html"><b>pcrematching</b></a>
263: documentation.
264: </P>
265: <P>
266: In addition to the main compiling and matching functions, there are convenience
267: functions for extracting captured substrings from a subject string that is
268: matched by <b>pcre_exec()</b>. They are:
269: <pre>
270: <b>pcre_copy_substring()</b>
271: <b>pcre_copy_named_substring()</b>
272: <b>pcre_get_substring()</b>
273: <b>pcre_get_named_substring()</b>
274: <b>pcre_get_substring_list()</b>
275: <b>pcre_get_stringnumber()</b>
276: <b>pcre_get_stringtable_entries()</b>
277: </pre>
278: <b>pcre_free_substring()</b> and <b>pcre_free_substring_list()</b> are also
279: provided, to free the memory used for extracted strings.
280: </P>
281: <P>
282: The function <b>pcre_maketables()</b> is used to build a set of character tables
283: in the current locale for passing to <b>pcre_compile()</b>, <b>pcre_exec()</b>,
284: or <b>pcre_dfa_exec()</b>. This is an optional facility that is provided for
285: specialist use. Most commonly, no special tables are passed, in which case
286: internal tables that are generated when PCRE is built are used.
287: </P>
288: <P>
289: The function <b>pcre_fullinfo()</b> is used to find out information about a
1.1.1.2 misho 290: compiled pattern. The function <b>pcre_version()</b> returns a pointer to a
291: string containing the version of PCRE and its date of release.
1.1 misho 292: </P>
293: <P>
294: The function <b>pcre_refcount()</b> maintains a reference count in a data block
295: containing a compiled pattern. This is provided for the benefit of
296: object-oriented applications.
297: </P>
298: <P>
299: The global variables <b>pcre_malloc</b> and <b>pcre_free</b> initially contain
300: the entry points of the standard <b>malloc()</b> and <b>free()</b> functions,
301: respectively. PCRE calls the memory management functions via these variables,
302: so a calling program can replace them if it wishes to intercept the calls. This
303: should be done before calling any PCRE functions.
304: </P>
305: <P>
306: The global variables <b>pcre_stack_malloc</b> and <b>pcre_stack_free</b> are also
307: indirections to memory management functions. These special functions are used
308: only when PCRE is compiled to use the heap for remembering data, instead of
309: recursive function calls, when running the <b>pcre_exec()</b> function. See the
310: <a href="pcrebuild.html"><b>pcrebuild</b></a>
311: documentation for details of how to do this. It is a non-standard way of
312: building PCRE, for use in environments that have limited stacks. Because of the
313: greater use of memory management, it runs more slowly. Separate functions are
314: provided so that special-purpose external code can be used for this case. When
315: used, these functions are always called in a stack-like manner (last obtained,
316: first freed), and always for memory blocks of the same size. There is a
317: discussion about PCRE's stack usage in the
318: <a href="pcrestack.html"><b>pcrestack</b></a>
319: documentation.
320: </P>
321: <P>
322: The global variable <b>pcre_callout</b> initially contains NULL. It can be set
323: by the caller to a "callout" function, which PCRE will then call at specified
324: points during a matching operation. Details are given in the
325: <a href="pcrecallout.html"><b>pcrecallout</b></a>
326: documentation.
327: <a name="newlines"></a></P>
1.1.1.2 misho 328: <br><a name="SEC7" href="#TOC1">NEWLINES</a><br>
1.1 misho 329: <P>
330: PCRE supports five different conventions for indicating line breaks in
331: strings: a single CR (carriage return) character, a single LF (linefeed)
332: character, the two-character sequence CRLF, any of the three preceding, or any
333: Unicode newline sequence. The Unicode newline sequences are the three just
1.1.1.3 misho 334: mentioned, plus the single characters VT (vertical tab, U+000B), FF (form feed,
1.1 misho 335: U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
336: (paragraph separator, U+2029).
337: </P>
338: <P>
339: Each of the first three conventions is used by at least one operating system as
340: its standard newline sequence. When PCRE is built, a default can be specified.
341: The default default is LF, which is the Unix standard. When PCRE is run, the
342: default can be overridden, either when a pattern is compiled, or when it is
343: matched.
344: </P>
345: <P>
346: At compile time, the newline convention can be specified by the <i>options</i>
347: argument of <b>pcre_compile()</b>, or it can be specified by special text at the
348: start of the pattern itself; this overrides any other settings. See the
349: <a href="pcrepattern.html"><b>pcrepattern</b></a>
350: page for details of the special character sequences.
351: </P>
352: <P>
353: In the PCRE documentation the word "newline" is used to mean "the character or
354: pair of characters that indicate a line break". The choice of newline
355: convention affects the handling of the dot, circumflex, and dollar
356: metacharacters, the handling of #-comments in /x mode, and, when CRLF is a
357: recognized line ending sequence, the match position advancement for a
358: non-anchored pattern. There is more detail about this in the
359: <a href="#execoptions">section on <b>pcre_exec()</b> options</a>
360: below.
361: </P>
362: <P>
363: The choice of newline convention does not affect the interpretation of
364: the \n or \r escape sequences, nor does it affect what \R matches, which is
365: controlled in a similar way, but by separate options.
366: </P>
1.1.1.2 misho 367: <br><a name="SEC8" href="#TOC1">MULTITHREADING</a><br>
1.1 misho 368: <P>
369: The PCRE functions can be used in multi-threading applications, with the
370: proviso that the memory management functions pointed to by <b>pcre_malloc</b>,
371: <b>pcre_free</b>, <b>pcre_stack_malloc</b>, and <b>pcre_stack_free</b>, and the
372: callout function pointed to by <b>pcre_callout</b>, are shared by all threads.
373: </P>
374: <P>
375: The compiled form of a regular expression is not altered during matching, so
376: the same compiled pattern can safely be used by several threads at once.
377: </P>
378: <P>
379: If the just-in-time optimization feature is being used, it needs separate
380: memory stack areas for each thread. See the
381: <a href="pcrejit.html"><b>pcrejit</b></a>
382: documentation for more details.
383: </P>
1.1.1.2 misho 384: <br><a name="SEC9" href="#TOC1">SAVING PRECOMPILED PATTERNS FOR LATER USE</a><br>
1.1 misho 385: <P>
386: The compiled form of a regular expression can be saved and re-used at a later
387: time, possibly by a different program, and even on a host other than the one on
388: which it was compiled. Details are given in the
389: <a href="pcreprecompile.html"><b>pcreprecompile</b></a>
1.1.1.2 misho 390: documentation, which includes a description of the
391: <b>pcre_pattern_to_host_byte_order()</b> function. However, compiling a regular
392: expression with one version of PCRE for use with a different version is not
393: guaranteed to work and may cause crashes.
1.1 misho 394: </P>
1.1.1.2 misho 395: <br><a name="SEC10" href="#TOC1">CHECKING BUILD-TIME OPTIONS</a><br>
1.1 misho 396: <P>
397: <b>int pcre_config(int <i>what</i>, void *<i>where</i>);</b>
398: </P>
399: <P>
400: The function <b>pcre_config()</b> makes it possible for a PCRE client to
401: discover which optional features have been compiled into the PCRE library. The
402: <a href="pcrebuild.html"><b>pcrebuild</b></a>
403: documentation has more details about these optional features.
404: </P>
405: <P>
406: The first argument for <b>pcre_config()</b> is an integer, specifying which
407: information is required; the second argument is a pointer to a variable into
1.1.1.2 misho 408: which the information is placed. The returned value is zero on success, or the
409: negative error code PCRE_ERROR_BADOPTION if the value in the first argument is
410: not recognized. The following information is available:
1.1 misho 411: <pre>
412: PCRE_CONFIG_UTF8
413: </pre>
414: The output is an integer that is set to one if UTF-8 support is available;
1.1.1.4 ! misho 415: otherwise it is set to zero. This value should normally be given to the 8-bit
! 416: version of this function, <b>pcre_config()</b>. If it is given to the 16-bit
! 417: or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
1.1.1.2 misho 418: <pre>
419: PCRE_CONFIG_UTF16
420: </pre>
421: The output is an integer that is set to one if UTF-16 support is available;
422: otherwise it is set to zero. This value should normally be given to the 16-bit
423: version of this function, <b>pcre16_config()</b>. If it is given to the 8-bit
1.1.1.4 ! misho 424: or 32-bit version of this function, the result is PCRE_ERROR_BADOPTION.
! 425: <pre>
! 426: PCRE_CONFIG_UTF32
! 427: </pre>
! 428: The output is an integer that is set to one if UTF-32 support is available;
! 429: otherwise it is set to zero. This value should normally be given to the 32-bit
! 430: version of this function, <b>pcre32_config()</b>. If it is given to the 8-bit
! 431: or 16-bit version of this function, the result is PCRE_ERROR_BADOPTION.
1.1 misho 432: <pre>
433: PCRE_CONFIG_UNICODE_PROPERTIES
434: </pre>
435: The output is an integer that is set to one if support for Unicode character
436: properties is available; otherwise it is set to zero.
437: <pre>
438: PCRE_CONFIG_JIT
439: </pre>
440: The output is an integer that is set to one if support for just-in-time
441: compiling is available; otherwise it is set to zero.
442: <pre>
1.1.1.2 misho 443: PCRE_CONFIG_JITTARGET
444: </pre>
445: The output is a pointer to a zero-terminated "const char *" string. If JIT
446: support is available, the string contains the name of the architecture for
447: which the JIT compiler is configured, for example "x86 32bit (little endian +
448: unaligned)". If JIT support is not available, the result is NULL.
449: <pre>
1.1 misho 450: PCRE_CONFIG_NEWLINE
451: </pre>
452: The output is an integer whose value specifies the default character sequence
1.1.1.4 ! misho 453: that is recognized as meaning "newline". The values that are supported in
! 454: ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for
! 455: ANYCRLF, and -1 for ANY. In EBCDIC environments, CR, ANYCRLF, and ANY yield the
! 456: same values. However, the value for LF is normally 21, though some EBCDIC
! 457: environments use 37. The corresponding values for CRLF are 3349 and 3365. The
! 458: default should normally correspond to the standard sequence for your operating
! 459: system.
1.1 misho 460: <pre>
461: PCRE_CONFIG_BSR
462: </pre>
463: The output is an integer whose value indicates what character sequences the \R
464: escape sequence matches by default. A value of 0 means that \R matches any
465: Unicode line ending sequence; a value of 1 means that \R matches only CR, LF,
466: or CRLF. The default can be overridden when a pattern is compiled or matched.
467: <pre>
468: PCRE_CONFIG_LINK_SIZE
469: </pre>
470: The output is an integer that contains the number of bytes used for internal
1.1.1.2 misho 471: linkage in compiled regular expressions. For the 8-bit library, the value can
472: be 2, 3, or 4. For the 16-bit library, the value is either 2 or 4 and is still
1.1.1.4 ! misho 473: a number of bytes. For the 32-bit library, the value is either 2 or 4 and is
! 474: still a number of bytes. The default value of 2 is sufficient for all but the
! 475: most massive patterns, since it allows the compiled pattern to be up to 64K in
! 476: size. Larger values allow larger regular expressions to be compiled, at the
! 477: expense of slower matching.
1.1 misho 478: <pre>
479: PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
480: </pre>
481: The output is an integer that contains the threshold above which the POSIX
482: interface uses <b>malloc()</b> for output vectors. Further details are given in
483: the
484: <a href="pcreposix.html"><b>pcreposix</b></a>
485: documentation.
486: <pre>
487: PCRE_CONFIG_MATCH_LIMIT
488: </pre>
489: The output is a long integer that gives the default limit for the number of
490: internal matching function calls in a <b>pcre_exec()</b> execution. Further
491: details are given with <b>pcre_exec()</b> below.
492: <pre>
493: PCRE_CONFIG_MATCH_LIMIT_RECURSION
494: </pre>
495: The output is a long integer that gives the default limit for the depth of
496: recursion when calling the internal matching function in a <b>pcre_exec()</b>
497: execution. Further details are given with <b>pcre_exec()</b> below.
498: <pre>
499: PCRE_CONFIG_STACKRECURSE
500: </pre>
501: The output is an integer that is set to one if internal recursion when running
502: <b>pcre_exec()</b> is implemented by recursive function calls that use the stack
503: to remember their state. This is the usual way that PCRE is compiled. The
504: output is zero if PCRE was compiled to use blocks of data on the heap instead
505: of recursive function calls. In this case, <b>pcre_stack_malloc</b> and
506: <b>pcre_stack_free</b> are called to manage memory blocks on the heap, thus
507: avoiding the use of the stack.
508: </P>
1.1.1.2 misho 509: <br><a name="SEC11" href="#TOC1">COMPILING A PATTERN</a><br>
1.1 misho 510: <P>
511: <b>pcre *pcre_compile(const char *<i>pattern</i>, int <i>options</i>,</b>
512: <b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
513: <b>const unsigned char *<i>tableptr</i>);</b>
514: <b>pcre *pcre_compile2(const char *<i>pattern</i>, int <i>options</i>,</b>
515: <b>int *<i>errorcodeptr</i>,</b>
516: <b>const char **<i>errptr</i>, int *<i>erroffset</i>,</b>
517: <b>const unsigned char *<i>tableptr</i>);</b>
518: </P>
519: <P>
520: Either of the functions <b>pcre_compile()</b> or <b>pcre_compile2()</b> can be
521: called to compile a pattern into an internal form. The only difference between
522: the two interfaces is that <b>pcre_compile2()</b> has an additional argument,
523: <i>errorcodeptr</i>, via which a numerical error code can be returned. To avoid
524: too much repetition, we refer just to <b>pcre_compile()</b> below, but the
525: information applies equally to <b>pcre_compile2()</b>.
526: </P>
527: <P>
528: The pattern is a C string terminated by a binary zero, and is passed in the
529: <i>pattern</i> argument. A pointer to a single block of memory that is obtained
530: via <b>pcre_malloc</b> is returned. This contains the compiled code and related
531: data. The <b>pcre</b> type is defined for the returned block; this is a typedef
532: for a structure whose contents are not externally defined. It is up to the
533: caller to free the memory (via <b>pcre_free</b>) when it is no longer required.
534: </P>
535: <P>
536: Although the compiled code of a PCRE regex is relocatable, that is, it does not
537: depend on memory location, the complete <b>pcre</b> data block is not
538: fully relocatable, because it may contain a copy of the <i>tableptr</i>
539: argument, which is an address (see below).
540: </P>
541: <P>
542: The <i>options</i> argument contains various bit settings that affect the
543: compilation. It should be zero if no options are required. The available
544: options are described below. Some of them (in particular, those that are
545: compatible with Perl, but some others as well) can also be set and unset from
546: within the pattern (see the detailed description in the
547: <a href="pcrepattern.html"><b>pcrepattern</b></a>
548: documentation). For those options that can be different in different parts of
549: the pattern, the contents of the <i>options</i> argument specifies their
550: settings at the start of compilation and execution. The PCRE_ANCHORED,
551: PCRE_BSR_<i>xxx</i>, PCRE_NEWLINE_<i>xxx</i>, PCRE_NO_UTF8_CHECK, and
1.1.1.3 misho 552: PCRE_NO_START_OPTIMIZE options can be set at the time of matching as well as at
1.1 misho 553: compile time.
554: </P>
555: <P>
556: If <i>errptr</i> is NULL, <b>pcre_compile()</b> returns NULL immediately.
557: Otherwise, if compilation of a pattern fails, <b>pcre_compile()</b> returns
558: NULL, and sets the variable pointed to by <i>errptr</i> to point to a textual
559: error message. This is a static string that is part of the library. You must
560: not try to free it. Normally, the offset from the start of the pattern to the
1.1.1.4 ! misho 561: data unit that was being processed when the error was discovered is placed in
! 562: the variable pointed to by <i>erroffset</i>, which must not be NULL (if it is,
! 563: an immediate error is given). However, for an invalid UTF-8 or UTF-16 string,
! 564: the offset is that of the first data unit of the failing character.
1.1 misho 565: </P>
566: <P>
1.1.1.2 misho 567: Some errors are not detected until the whole pattern has been scanned; in these
568: cases, the offset passed back is the length of the pattern. Note that the
1.1.1.4 ! misho 569: offset is in data units, not characters, even in a UTF mode. It may sometimes
! 570: point into the middle of a UTF-8 or UTF-16 character.
1.1 misho 571: </P>
572: <P>
573: If <b>pcre_compile2()</b> is used instead of <b>pcre_compile()</b>, and the
574: <i>errorcodeptr</i> argument is not NULL, a non-zero error code number is
575: returned via this argument in the event of an error. This is in addition to the
576: textual error message. Error codes and messages are listed below.
577: </P>
578: <P>
579: If the final argument, <i>tableptr</i>, is NULL, PCRE uses a default set of
580: character tables that are built when PCRE is compiled, using the default C
581: locale. Otherwise, <i>tableptr</i> must be an address that is the result of a
582: call to <b>pcre_maketables()</b>. This value is stored with the compiled
583: pattern, and used again by <b>pcre_exec()</b>, unless another table pointer is
584: passed to it. For more discussion, see the section on locale support below.
585: </P>
586: <P>
587: This code fragment shows a typical straightforward call to <b>pcre_compile()</b>:
588: <pre>
589: pcre *re;
590: const char *error;
591: int erroffset;
592: re = pcre_compile(
593: "^A.*Z", /* the pattern */
594: 0, /* default options */
595: &error, /* for error message */
596: &erroffset, /* for error offset */
597: NULL); /* use default character tables */
598: </pre>
599: The following names for option bits are defined in the <b>pcre.h</b> header
600: file:
601: <pre>
602: PCRE_ANCHORED
603: </pre>
604: If this bit is set, the pattern is forced to be "anchored", that is, it is
605: constrained to match only at the first matching point in the string that is
606: being searched (the "subject string"). This effect can also be achieved by
607: appropriate constructs in the pattern itself, which is the only way to do it in
608: Perl.
609: <pre>
610: PCRE_AUTO_CALLOUT
611: </pre>
612: If this bit is set, <b>pcre_compile()</b> automatically inserts callout items,
613: all with number 255, before each pattern item. For discussion of the callout
614: facility, see the
615: <a href="pcrecallout.html"><b>pcrecallout</b></a>
616: documentation.
617: <pre>
618: PCRE_BSR_ANYCRLF
619: PCRE_BSR_UNICODE
620: </pre>
621: These options (which are mutually exclusive) control what the \R escape
622: sequence matches. The choice is either to match only CR, LF, or CRLF, or to
623: match any Unicode newline sequence. The default is specified when PCRE is
624: built. It can be overridden from within the pattern, or by setting an option
625: when a compiled pattern is matched.
626: <pre>
627: PCRE_CASELESS
628: </pre>
629: If this bit is set, letters in the pattern match both upper and lower case
630: letters. It is equivalent to Perl's /i option, and it can be changed within a
631: pattern by a (?i) option setting. In UTF-8 mode, PCRE always understands the
632: concept of case for characters whose values are less than 128, so caseless
633: matching is always possible. For characters with higher values, the concept of
634: case is supported if PCRE is compiled with Unicode property support, but not
635: otherwise. If you want to use caseless matching for characters 128 and above,
636: you must ensure that PCRE is compiled with Unicode property support as well as
637: with UTF-8 support.
638: <pre>
639: PCRE_DOLLAR_ENDONLY
640: </pre>
641: If this bit is set, a dollar metacharacter in the pattern matches only at the
642: end of the subject string. Without this option, a dollar also matches
643: immediately before a newline at the end of the string (but not before any other
644: newlines). The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
645: There is no equivalent to this option in Perl, and no way to set it within a
646: pattern.
647: <pre>
648: PCRE_DOTALL
649: </pre>
650: If this bit is set, a dot metacharacter in the pattern matches a character of
651: any value, including one that indicates a newline. However, it only ever
652: matches one character, even if newlines are coded as CRLF. Without this option,
653: a dot does not match when the current position is at a newline. This option is
654: equivalent to Perl's /s option, and it can be changed within a pattern by a
655: (?s) option setting. A negative class such as [^a] always matches newline
656: characters, independent of the setting of this option.
657: <pre>
658: PCRE_DUPNAMES
659: </pre>
660: If this bit is set, names used to identify capturing subpatterns need not be
661: unique. This can be helpful for certain types of pattern when it is known that
662: only one instance of the named subpattern can ever be matched. There are more
663: details of named subpatterns below; see also the
664: <a href="pcrepattern.html"><b>pcrepattern</b></a>
665: documentation.
666: <pre>
667: PCRE_EXTENDED
668: </pre>
1.1.1.3 misho 669: If this bit is set, white space data characters in the pattern are totally
670: ignored except when escaped or inside a character class. White space does not
1.1 misho 671: include the VT character (code 11). In addition, characters between an
672: unescaped # outside a character class and the next newline, inclusive, are also
673: ignored. This is equivalent to Perl's /x option, and it can be changed within a
674: pattern by a (?x) option setting.
675: </P>
676: <P>
677: Which characters are interpreted as newlines is controlled by the options
678: passed to <b>pcre_compile()</b> or by a special sequence at the start of the
679: pattern, as described in the section entitled
680: <a href="pcrepattern.html#newlines">"Newline conventions"</a>
681: in the <b>pcrepattern</b> documentation. Note that the end of this type of
682: comment is a literal newline sequence in the pattern; escape sequences that
683: happen to represent a newline do not count.
684: </P>
685: <P>
686: This option makes it possible to include comments inside complicated patterns.
1.1.1.3 misho 687: Note, however, that this applies only to data characters. White space characters
1.1 misho 688: may never appear within special character sequences in a pattern, for example
689: within the sequence (?( that introduces a conditional subpattern.
690: <pre>
691: PCRE_EXTRA
692: </pre>
693: This option was invented in order to turn on additional functionality of PCRE
694: that is incompatible with Perl, but it is currently of very little use. When
695: set, any backslash in a pattern that is followed by a letter that has no
696: special meaning causes an error, thus reserving these combinations for future
697: expansion. By default, as in Perl, a backslash followed by a letter with no
698: special meaning is treated as a literal. (Perl can, however, be persuaded to
699: give an error for this, by running it with the -w option.) There are at present
700: no other features controlled by this option. It can also be set by a (?X)
701: option setting within a pattern.
702: <pre>
703: PCRE_FIRSTLINE
704: </pre>
705: If this option is set, an unanchored pattern is required to match before or at
706: the first newline in the subject string, though the matched text may continue
707: over the newline.
708: <pre>
709: PCRE_JAVASCRIPT_COMPAT
710: </pre>
711: If this option is set, PCRE's behaviour is changed in some ways so that it is
712: compatible with JavaScript rather than Perl. The changes are as follows:
713: </P>
714: <P>
715: (1) A lone closing square bracket in a pattern causes a compile-time error,
716: because this is illegal in JavaScript (by default it is treated as a data
717: character). Thus, the pattern AB]CD becomes illegal when this option is set.
718: </P>
719: <P>
720: (2) At run time, a back reference to an unset subpattern group matches an empty
721: string (by default this causes the current matching alternative to fail). A
722: pattern such as (\1)(a) succeeds when this option is set (assuming it can find
723: an "a" in the subject), whereas it fails by default, for Perl compatibility.
724: </P>
725: <P>
726: (3) \U matches an upper case "U" character; by default \U causes a compile
727: time error (Perl uses \U to upper case subsequent characters).
728: </P>
729: <P>
730: (4) \u matches a lower case "u" character unless it is followed by four
731: hexadecimal digits, in which case the hexadecimal number defines the code point
732: to match. By default, \u causes a compile time error (Perl uses it to upper
733: case the following character).
734: </P>
735: <P>
736: (5) \x matches a lower case "x" character unless it is followed by two
737: hexadecimal digits, in which case the hexadecimal number defines the code point
738: to match. By default, as in Perl, a hexadecimal number is always expected after
739: \x, but it may have zero, one, or two digits (so, for example, \xz matches a
740: binary zero character followed by z).
741: <pre>
742: PCRE_MULTILINE
743: </pre>
1.1.1.4 ! misho 744: By default, for the purposes of matching "start of line" and "end of line",
! 745: PCRE treats the subject string as consisting of a single line of characters,
! 746: even if it actually contains newlines. The "start of line" metacharacter (^)
! 747: matches only at the start of the string, and the "end of line" metacharacter
! 748: ($) matches only at the end of the string, or before a terminating newline
! 749: (except when PCRE_DOLLAR_ENDONLY is set). Note, however, that unless
! 750: PCRE_DOTALL is set, the "any character" metacharacter (.) does not match at a
! 751: newline. This behaviour (for ^, $, and dot) is the same as Perl.
1.1 misho 752: </P>
753: <P>
754: When PCRE_MULTILINE it is set, the "start of line" and "end of line" constructs
755: match immediately following or immediately before internal newlines in the
756: subject string, respectively, as well as at the very start and end. This is
757: equivalent to Perl's /m option, and it can be changed within a pattern by a
758: (?m) option setting. If there are no newlines in a subject string, or no
759: occurrences of ^ or $ in a pattern, setting PCRE_MULTILINE has no effect.
760: <pre>
1.1.1.4 ! misho 761: PCRE_NEVER_UTF
! 762: </pre>
! 763: This option locks out interpretation of the pattern as UTF-8 (or UTF-16 or
! 764: UTF-32 in the 16-bit and 32-bit libraries). In particular, it prevents the
! 765: creator of the pattern from switching to UTF interpretation by starting the
! 766: pattern with (*UTF). This may be useful in applications that process patterns
! 767: from external sources. The combination of PCRE_UTF8 and PCRE_NEVER_UTF also
! 768: causes an error.
! 769: <pre>
1.1 misho 770: PCRE_NEWLINE_CR
771: PCRE_NEWLINE_LF
772: PCRE_NEWLINE_CRLF
773: PCRE_NEWLINE_ANYCRLF
774: PCRE_NEWLINE_ANY
775: </pre>
776: These options override the default newline definition that was chosen when PCRE
777: was built. Setting the first or the second specifies that a newline is
778: indicated by a single character (CR or LF, respectively). Setting
779: PCRE_NEWLINE_CRLF specifies that a newline is indicated by the two-character
780: CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies that any of the three
781: preceding sequences should be recognized. Setting PCRE_NEWLINE_ANY specifies
1.1.1.4 ! misho 782: that any Unicode newline sequence should be recognized.
! 783: </P>
! 784: <P>
! 785: In an ASCII/Unicode environment, the Unicode newline sequences are the three
! 786: just mentioned, plus the single characters VT (vertical tab, U+000B), FF (form
! 787: feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
! 788: (paragraph separator, U+2029). For the 8-bit library, the last two are
! 789: recognized only in UTF-8 mode.
! 790: </P>
! 791: <P>
! 792: When PCRE is compiled to run in an EBCDIC (mainframe) environment, the code for
! 793: CR is 0x0d, the same as ASCII. However, the character code for LF is normally
! 794: 0x15, though in some EBCDIC environments 0x25 is used. Whichever of these is
! 795: not LF is made to correspond to Unicode's NEL character. EBCDIC codes are all
! 796: less than 256. For more details, see the
! 797: <a href="pcrebuild.html"><b>pcrebuild</b></a>
! 798: documentation.
1.1 misho 799: </P>
800: <P>
801: The newline setting in the options word uses three bits that are treated
802: as a number, giving eight possibilities. Currently only six are used (default
803: plus the five values above). This means that if you set more than one newline
804: option, the combination may or may not be sensible. For example,
805: PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to PCRE_NEWLINE_CRLF, but
806: other combinations may yield unused numbers and cause an error.
807: </P>
808: <P>
809: The only time that a line break in a pattern is specially recognized when
1.1.1.3 misho 810: compiling is when PCRE_EXTENDED is set. CR and LF are white space characters,
1.1 misho 811: and so are ignored in this mode. Also, an unescaped # outside a character class
812: indicates a comment that lasts until after the next line break sequence. In
813: other circumstances, line break sequences in patterns are treated as literal
814: data.
815: </P>
816: <P>
817: The newline option that is set at compile time becomes the default that is used
818: for <b>pcre_exec()</b> and <b>pcre_dfa_exec()</b>, but it can be overridden.
819: <pre>
820: PCRE_NO_AUTO_CAPTURE
821: </pre>
822: If this option is set, it disables the use of numbered capturing parentheses in
823: the pattern. Any opening parenthesis that is not followed by ? behaves as if it
824: were followed by ?: but named parentheses can still be used for capturing (and
825: they acquire numbers in the usual way). There is no equivalent of this option
826: in Perl.
827: <pre>
1.1.1.4 ! misho 828: PCRE_NO_START_OPTIMIZE
1.1 misho 829: </pre>
830: This is an option that acts at matching time; that is, it is really an option
831: for <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. If it is set at compile time,
1.1.1.4 ! misho 832: it is remembered with the compiled pattern and assumed at matching time. This
! 833: is necessary if you want to use JIT execution, because the JIT compiler needs
! 834: to know whether or not this option is set. For details see the discussion of
! 835: PCRE_NO_START_OPTIMIZE
1.1 misho 836: <a href="#execoptions">below.</a>
837: <pre>
838: PCRE_UCP
839: </pre>
840: This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
841: \w, and some of the POSIX character classes. By default, only ASCII characters
842: are recognized, but if PCRE_UCP is set, Unicode properties are used instead to
843: classify characters. More details are given in the section on
844: <a href="pcre.html#genericchartypes">generic character types</a>
845: in the
846: <a href="pcrepattern.html"><b>pcrepattern</b></a>
847: page. If you set PCRE_UCP, matching one of the items it affects takes much
848: longer. The option is available only if PCRE has been compiled with Unicode
849: property support.
850: <pre>
851: PCRE_UNGREEDY
852: </pre>
853: This option inverts the "greediness" of the quantifiers so that they are not
854: greedy by default, but become greedy if followed by "?". It is not compatible
855: with Perl. It can also be set by a (?U) option setting within the pattern.
856: <pre>
857: PCRE_UTF8
858: </pre>
859: This option causes PCRE to regard both the pattern and the subject as strings
1.1.1.2 misho 860: of UTF-8 characters instead of single-byte strings. However, it is available
861: only when PCRE is built to include UTF support. If not, the use of this option
862: provokes an error. Details of how this option changes the behaviour of PCRE are
863: given in the
1.1 misho 864: <a href="pcreunicode.html"><b>pcreunicode</b></a>
865: page.
866: <pre>
867: PCRE_NO_UTF8_CHECK
868: </pre>
1.1.1.4 ! misho 869: When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
! 870: automatically checked. There is a discussion about the
1.1.1.2 misho 871: <a href="pcreunicode.html#utf8strings">validity of UTF-8 strings</a>
872: in the
873: <a href="pcreunicode.html"><b>pcreunicode</b></a>
874: page. If an invalid UTF-8 sequence is found, <b>pcre_compile()</b> returns an
875: error. If you already know that your pattern is valid, and you want to skip
876: this check for performance reasons, you can set the PCRE_NO_UTF8_CHECK option.
877: When it is set, the effect of passing an invalid UTF-8 string as a pattern is
878: undefined. It may cause your program to crash. Note that this option can also
879: be passed to <b>pcre_exec()</b> and <b>pcre_dfa_exec()</b>, to suppress the
1.1.1.4 ! misho 880: validity checking of subject strings only. If the same string is being matched
! 881: many times, the option can be safely set for the second and subsequent
! 882: matchings to improve performance.
1.1 misho 883: </P>
1.1.1.2 misho 884: <br><a name="SEC12" href="#TOC1">COMPILATION ERROR CODES</a><br>
1.1 misho 885: <P>
886: The following table lists the error codes than may be returned by
887: <b>pcre_compile2()</b>, along with the error messages that may be returned by
1.1.1.2 misho 888: both compiling functions. Note that error messages are always 8-bit ASCII
1.1.1.4 ! misho 889: strings, even in 16-bit or 32-bit mode. As PCRE has developed, some error codes
! 890: have fallen out of use. To avoid confusion, they have not been re-used.
1.1 misho 891: <pre>
892: 0 no error
893: 1 \ at end of pattern
894: 2 \c at end of pattern
895: 3 unrecognized character follows \
896: 4 numbers out of order in {} quantifier
897: 5 number too big in {} quantifier
898: 6 missing terminating ] for character class
899: 7 invalid escape sequence in character class
900: 8 range out of order in character class
901: 9 nothing to repeat
902: 10 [this code is not in use]
903: 11 internal error: unexpected repeat
904: 12 unrecognized character after (? or (?-
905: 13 POSIX named classes are supported only within a class
906: 14 missing )
907: 15 reference to non-existent subpattern
908: 16 erroffset passed as NULL
909: 17 unknown option bit(s) set
910: 18 missing ) after comment
911: 19 [this code is not in use]
912: 20 regular expression is too large
913: 21 failed to get memory
914: 22 unmatched parentheses
915: 23 internal error: code overflow
916: 24 unrecognized character after (?<
917: 25 lookbehind assertion is not fixed length
918: 26 malformed number or name after (?(
919: 27 conditional group contains more than two branches
920: 28 assertion expected after (?(
921: 29 (?R or (?[+-]digits must be followed by )
922: 30 unknown POSIX class name
923: 31 POSIX collating elements are not supported
1.1.1.2 misho 924: 32 this version of PCRE is compiled without UTF support
1.1 misho 925: 33 [this code is not in use]
926: 34 character value in \x{...} sequence is too large
927: 35 invalid condition (?(0)
928: 36 \C not allowed in lookbehind assertion
929: 37 PCRE does not support \L, \l, \N{name}, \U, or \u
930: 38 number after (?C is > 255
931: 39 closing ) for (?C expected
932: 40 recursive call could loop indefinitely
933: 41 unrecognized character after (?P
934: 42 syntax error in subpattern name (missing terminator)
935: 43 two named subpatterns have the same name
1.1.1.2 misho 936: 44 invalid UTF-8 string (specifically UTF-8)
1.1 misho 937: 45 support for \P, \p, and \X has not been compiled
938: 46 malformed \P or \p sequence
939: 47 unknown property name after \P or \p
940: 48 subpattern name is too long (maximum 32 characters)
941: 49 too many named subpatterns (maximum 10000)
942: 50 [this code is not in use]
1.1.1.2 misho 943: 51 octal value is greater than \377 in 8-bit non-UTF-8 mode
1.1 misho 944: 52 internal error: overran compiling workspace
945: 53 internal error: previously-checked referenced subpattern
946: not found
947: 54 DEFINE group contains more than one branch
948: 55 repeating a DEFINE group is not allowed
949: 56 inconsistent NEWLINE options
950: 57 \g is not followed by a braced, angle-bracketed, or quoted
951: name/number or by a plain number
952: 58 a numbered reference must not be zero
953: 59 an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
1.1.1.4 ! misho 954: 60 (*VERB) not recognized or malformed
1.1 misho 955: 61 number is too big
956: 62 subpattern name expected
957: 63 digit expected after (?+
958: 64 ] is an invalid data character in JavaScript compatibility mode
959: 65 different names for subpatterns of the same number are
960: not allowed
961: 66 (*MARK) must have an argument
1.1.1.2 misho 962: 67 this version of PCRE is not compiled with Unicode property
963: support
1.1 misho 964: 68 \c must be followed by an ASCII character
965: 69 \k is not followed by a braced, angle-bracketed, or quoted name
1.1.1.2 misho 966: 70 internal error: unknown opcode in find_fixedlength()
967: 71 \N is not supported in a class
968: 72 too many forward references
969: 73 disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
970: 74 invalid UTF-16 string (specifically UTF-16)
1.1.1.3 misho 971: 75 name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
972: 76 character value in \u.... sequence is too large
1.1.1.4 ! misho 973: 77 invalid UTF-32 string (specifically UTF-32)
1.1 misho 974: </pre>
975: The numbers 32 and 10000 in errors 48 and 49 are defaults; different values may
976: be used if the limits were changed when PCRE was built.
977: <a name="studyingapattern"></a></P>
1.1.1.2 misho 978: <br><a name="SEC13" href="#TOC1">STUDYING A PATTERN</a><br>
1.1 misho 979: <P>
980: <b>pcre_extra *pcre_study(const pcre *<i>code</i>, int <i>options</i></b>
981: <b>const char **<i>errptr</i>);</b>
982: </P>
983: <P>
984: If a compiled pattern is going to be used several times, it is worth spending
985: more time analyzing it in order to speed up the time taken for matching. The
986: function <b>pcre_study()</b> takes a pointer to a compiled pattern as its first
987: argument. If studying the pattern produces additional information that will
988: help speed up matching, <b>pcre_study()</b> returns a pointer to a
989: <b>pcre_extra</b> block, in which the <i>study_data</i> field points to the
990: results of the study.
991: </P>
992: <P>
993: The returned value from <b>pcre_study()</b> can be passed directly to
994: <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. However, a <b>pcre_extra</b> block
995: also contains other fields that can be set by the caller before the block is
996: passed; these are described
997: <a href="#extradata">below</a>
998: in the section on matching a pattern.
999: </P>
1000: <P>
1001: If studying the pattern does not produce any useful information,
1.1.1.4 ! misho 1002: <b>pcre_study()</b> returns NULL by default. In that circumstance, if the
! 1003: calling program wants to pass any of the other fields to <b>pcre_exec()</b> or
! 1004: <b>pcre_dfa_exec()</b>, it must set up its own <b>pcre_extra</b> block. However,
! 1005: if <b>pcre_study()</b> is called with the PCRE_STUDY_EXTRA_NEEDED option, it
! 1006: returns a <b>pcre_extra</b> block even if studying did not find any additional
! 1007: information. It may still return NULL, however, if an error occurs in
! 1008: <b>pcre_study()</b>.
1.1 misho 1009: </P>
1010: <P>
1.1.1.3 misho 1011: The second argument of <b>pcre_study()</b> contains option bits. There are three
1.1.1.4 ! misho 1012: further options in addition to PCRE_STUDY_EXTRA_NEEDED:
1.1.1.3 misho 1013: <pre>
1014: PCRE_STUDY_JIT_COMPILE
1015: PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
1016: PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
1017: </pre>
1018: If any of these are set, and the just-in-time compiler is available, the
1019: pattern is further compiled into machine code that executes much faster than
1020: the <b>pcre_exec()</b> interpretive matching function. If the just-in-time
1.1.1.4 ! misho 1021: compiler is not available, these options are ignored. All undefined bits in the
1.1.1.3 misho 1022: <i>options</i> argument must be zero.
1.1 misho 1023: </P>
1024: <P>
1025: JIT compilation is a heavyweight optimization. It can take some time for
1026: patterns to be analyzed, and for one-off matches and simple patterns the
1027: benefit of faster execution might be offset by a much slower study time.
1028: Not all patterns can be optimized by the JIT compiler. For those that cannot be
1029: handled, matching automatically falls back to the <b>pcre_exec()</b>
1030: interpreter. For more details, see the
1031: <a href="pcrejit.html"><b>pcrejit</b></a>
1032: documentation.
1033: </P>
1034: <P>
1035: The third argument for <b>pcre_study()</b> is a pointer for an error message. If
1036: studying succeeds (even if no data is returned), the variable it points to is
1037: set to NULL. Otherwise it is set to point to a textual error message. This is a
1038: static string that is part of the library. You must not try to free it. You
1039: should test the error pointer for NULL after calling <b>pcre_study()</b>, to be
1040: sure that it has run successfully.
1041: </P>
1042: <P>
1043: When you are finished with a pattern, you can free the memory used for the
1044: study data by calling <b>pcre_free_study()</b>. This function was added to the
1045: API for release 8.20. For earlier versions, the memory could be freed with
1046: <b>pcre_free()</b>, just like the pattern itself. This will still work in cases
1.1.1.3 misho 1047: where JIT optimization is not used, but it is advisable to change to the new
1048: function when convenient.
1.1 misho 1049: </P>
1050: <P>
1051: This is a typical way in which <b>pcre_study</b>() is used (except that in a
1052: real application there should be tests for errors):
1053: <pre>
1054: int rc;
1055: pcre *re;
1056: pcre_extra *sd;
1057: re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
1058: sd = pcre_study(
1059: re, /* result of pcre_compile() */
1060: 0, /* no options */
1061: &error); /* set to NULL or points to a message */
1062: rc = pcre_exec( /* see below for details of pcre_exec() options */
1063: re, sd, "subject", 7, 0, 0, ovector, 30);
1064: ...
1065: pcre_free_study(sd);
1066: pcre_free(re);
1067: </pre>
1068: Studying a pattern does two things: first, a lower bound for the length of
1069: subject string that is needed to match the pattern is computed. This does not
1070: mean that there are any strings of that length that match, but it does
1.1.1.4 ! misho 1071: guarantee that no shorter strings match. The value is used to avoid wasting
! 1072: time by trying to match strings that are shorter than the lower bound. You can
! 1073: find out the value in a calling program via the <b>pcre_fullinfo()</b> function.
1.1 misho 1074: </P>
1075: <P>
1076: Studying a pattern is also useful for non-anchored patterns that do not have a
1077: single fixed starting character. A bitmap of possible starting bytes is
1078: created. This speeds up finding a position in the subject at which to start
1.1.1.4 ! misho 1079: matching. (In 16-bit mode, the bitmap is used for 16-bit values less than 256.
! 1080: In 32-bit mode, the bitmap is used for 32-bit values less than 256.)
1.1 misho 1081: </P>
1082: <P>
1083: These two optimizations apply to both <b>pcre_exec()</b> and
1.1.1.3 misho 1084: <b>pcre_dfa_exec()</b>, and the information is also used by the JIT compiler.
1.1.1.4 ! misho 1085: The optimizations can be disabled by setting the PCRE_NO_START_OPTIMIZE option.
! 1086: You might want to do this if your pattern contains callouts or (*MARK) and you
! 1087: want to make use of these facilities in cases where matching fails.
! 1088: </P>
! 1089: <P>
! 1090: PCRE_NO_START_OPTIMIZE can be specified at either compile time or execution
! 1091: time. However, if PCRE_NO_START_OPTIMIZE is passed to <b>pcre_exec()</b>, (that
! 1092: is, after any JIT compilation has happened) JIT execution is disabled. For JIT
! 1093: execution to work with PCRE_NO_START_OPTIMIZE, the option must be set at
! 1094: compile time.
! 1095: </P>
! 1096: <P>
! 1097: There is a longer discussion of PCRE_NO_START_OPTIMIZE
1.1 misho 1098: <a href="#execoptions">below.</a>
1099: <a name="localesupport"></a></P>
1.1.1.2 misho 1100: <br><a name="SEC14" href="#TOC1">LOCALE SUPPORT</a><br>
1.1 misho 1101: <P>
1102: PCRE handles caseless matching, and determines whether characters are letters,
1103: digits, or whatever, by reference to a set of tables, indexed by character
1.1.1.2 misho 1104: value. When running in UTF-8 mode, this applies only to characters
1105: with codes less than 128. By default, higher-valued codes never match escapes
1106: such as \w or \d, but they can be tested with \p if PCRE is built with
1107: Unicode character property support. Alternatively, the PCRE_UCP option can be
1108: set at compile time; this causes \w and friends to use Unicode property
1109: support instead of built-in tables. The use of locales with Unicode is
1110: discouraged. If you are handling characters with codes greater than 128, you
1111: should either use UTF-8 and Unicode, or use locales, but not try to mix the
1112: two.
1.1 misho 1113: </P>
1114: <P>
1115: PCRE contains an internal set of tables that are used when the final argument
1116: of <b>pcre_compile()</b> is NULL. These are sufficient for many applications.
1117: Normally, the internal tables recognize only ASCII characters. However, when
1118: PCRE is built, it is possible to cause the internal tables to be rebuilt in the
1119: default "C" locale of the local system, which may cause them to be different.
1120: </P>
1121: <P>
1122: The internal tables can always be overridden by tables supplied by the
1123: application that calls PCRE. These may be created in a different locale from
1124: the default. As more and more applications change to using Unicode, the need
1125: for this locale support is expected to die away.
1126: </P>
1127: <P>
1128: External tables are built by calling the <b>pcre_maketables()</b> function,
1129: which has no arguments, in the relevant locale. The result can then be passed
1130: to <b>pcre_compile()</b> or <b>pcre_exec()</b> as often as necessary. For
1131: example, to build and use tables that are appropriate for the French locale
1132: (where accented characters with values greater than 128 are treated as letters),
1133: the following code could be used:
1134: <pre>
1135: setlocale(LC_CTYPE, "fr_FR");
1136: tables = pcre_maketables();
1137: re = pcre_compile(..., tables);
1138: </pre>
1139: The locale name "fr_FR" is used on Linux and other Unix-like systems; if you
1140: are using Windows, the name for the French locale is "french".
1141: </P>
1142: <P>
1143: When <b>pcre_maketables()</b> runs, the tables are built in memory that is
1144: obtained via <b>pcre_malloc</b>. It is the caller's responsibility to ensure
1145: that the memory containing the tables remains available for as long as it is
1146: needed.
1147: </P>
1148: <P>
1149: The pointer that is passed to <b>pcre_compile()</b> is saved with the compiled
1150: pattern, and the same tables are used via this pointer by <b>pcre_study()</b>
1151: and normally also by <b>pcre_exec()</b>. Thus, by default, for any single
1152: pattern, compilation, studying and matching all happen in the same locale, but
1153: different patterns can be compiled in different locales.
1154: </P>
1155: <P>
1156: It is possible to pass a table pointer or NULL (indicating the use of the
1157: internal tables) to <b>pcre_exec()</b>. Although not intended for this purpose,
1158: this facility could be used to match a pattern in a different locale from the
1159: one in which it was compiled. Passing table pointers at run time is discussed
1160: below in the section on matching a pattern.
1161: <a name="infoaboutpattern"></a></P>
1.1.1.2 misho 1162: <br><a name="SEC15" href="#TOC1">INFORMATION ABOUT A PATTERN</a><br>
1.1 misho 1163: <P>
1164: <b>int pcre_fullinfo(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
1165: <b>int <i>what</i>, void *<i>where</i>);</b>
1166: </P>
1167: <P>
1168: The <b>pcre_fullinfo()</b> function returns information about a compiled
1.1.1.2 misho 1169: pattern. It replaces the <b>pcre_info()</b> function, which was removed from the
1170: library at version 8.30, after more than 10 years of obsolescence.
1.1 misho 1171: </P>
1172: <P>
1173: The first argument for <b>pcre_fullinfo()</b> is a pointer to the compiled
1174: pattern. The second argument is the result of <b>pcre_study()</b>, or NULL if
1175: the pattern was not studied. The third argument specifies which piece of
1176: information is required, and the fourth argument is a pointer to a variable
1177: to receive the data. The yield of the function is zero for success, or one of
1178: the following negative numbers:
1179: <pre>
1.1.1.2 misho 1180: PCRE_ERROR_NULL the argument <i>code</i> was NULL
1181: the argument <i>where</i> was NULL
1182: PCRE_ERROR_BADMAGIC the "magic number" was not found
1183: PCRE_ERROR_BADENDIANNESS the pattern was compiled with different
1184: endianness
1185: PCRE_ERROR_BADOPTION the value of <i>what</i> was invalid
1.1.1.4 ! misho 1186: PCRE_ERROR_UNSET the requested field is not set
1.1 misho 1187: </pre>
1188: The "magic number" is placed at the start of each compiled pattern as an simple
1.1.1.2 misho 1189: check against passing an arbitrary memory pointer. The endianness error can
1190: occur if a compiled pattern is saved and reloaded on a different host. Here is
1191: a typical call of <b>pcre_fullinfo()</b>, to obtain the length of the compiled
1192: pattern:
1.1 misho 1193: <pre>
1194: int rc;
1195: size_t length;
1196: rc = pcre_fullinfo(
1197: re, /* result of pcre_compile() */
1198: sd, /* result of pcre_study(), or NULL */
1199: PCRE_INFO_SIZE, /* what is required */
1200: &length); /* where to put the data */
1201: </pre>
1202: The possible values for the third argument are defined in <b>pcre.h</b>, and are
1203: as follows:
1204: <pre>
1205: PCRE_INFO_BACKREFMAX
1206: </pre>
1207: Return the number of the highest back reference in the pattern. The fourth
1208: argument should point to an <b>int</b> variable. Zero is returned if there are
1209: no back references.
1210: <pre>
1211: PCRE_INFO_CAPTURECOUNT
1212: </pre>
1213: Return the number of capturing subpatterns in the pattern. The fourth argument
1214: should point to an <b>int</b> variable.
1215: <pre>
1216: PCRE_INFO_DEFAULT_TABLES
1217: </pre>
1218: Return a pointer to the internal default character tables within PCRE. The
1219: fourth argument should point to an <b>unsigned char *</b> variable. This
1220: information call is provided for internal use by the <b>pcre_study()</b>
1221: function. External callers can cause PCRE to use its internal tables by passing
1222: a NULL table pointer.
1223: <pre>
1224: PCRE_INFO_FIRSTBYTE
1225: </pre>
1.1.1.2 misho 1226: Return information about the first data unit of any matched string, for a
1227: non-anchored pattern. (The name of this option refers to the 8-bit library,
1228: where data units are bytes.) The fourth argument should point to an <b>int</b>
1229: variable.
1230: </P>
1231: <P>
1232: If there is a fixed first value, for example, the letter "c" from a pattern
1233: such as (cat|cow|coyote), its value is returned. In the 8-bit library, the
1.1.1.4 ! misho 1234: value is always less than 256. In the 16-bit library the value can be up to
! 1235: 0xffff. In the 32-bit library the value can be up to 0x10ffff.
1.1 misho 1236: </P>
1237: <P>
1.1.1.2 misho 1238: If there is no fixed first value, and if either
1.1 misho 1239: <br>
1240: <br>
1241: (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
1242: starts with "^", or
1243: <br>
1244: <br>
1245: (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
1246: (if it were set, the pattern would be anchored),
1247: <br>
1248: <br>
1249: -1 is returned, indicating that the pattern matches only at the start of a
1250: subject string or after any newline within the string. Otherwise -2 is
1251: returned. For anchored patterns, -2 is returned.
1.1.1.4 ! misho 1252: </P>
! 1253: <P>
! 1254: Since for the 32-bit library using the non-UTF-32 mode, this function is unable
! 1255: to return the full 32-bit range of the character, this value is deprecated;
! 1256: instead the PCRE_INFO_FIRSTCHARACTERFLAGS and PCRE_INFO_FIRSTCHARACTER values
! 1257: should be used.
1.1 misho 1258: <pre>
1259: PCRE_INFO_FIRSTTABLE
1260: </pre>
1261: If the pattern was studied, and this resulted in the construction of a 256-bit
1.1.1.2 misho 1262: table indicating a fixed set of values for the first data unit in any matching
1.1 misho 1263: string, a pointer to the table is returned. Otherwise NULL is returned. The
1264: fourth argument should point to an <b>unsigned char *</b> variable.
1265: <pre>
1266: PCRE_INFO_HASCRORLF
1267: </pre>
1268: Return 1 if the pattern contains any explicit matches for CR or LF characters,
1269: otherwise 0. The fourth argument should point to an <b>int</b> variable. An
1270: explicit match is either a literal CR or LF character, or \r or \n.
1271: <pre>
1272: PCRE_INFO_JCHANGED
1273: </pre>
1274: Return 1 if the (?J) or (?-J) option setting is used in the pattern, otherwise
1275: 0. The fourth argument should point to an <b>int</b> variable. (?J) and
1276: (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1277: <pre>
1278: PCRE_INFO_JIT
1279: </pre>
1.1.1.3 misho 1280: Return 1 if the pattern was studied with one of the JIT options, and
1.1 misho 1281: just-in-time compiling was successful. The fourth argument should point to an
1282: <b>int</b> variable. A return value of 0 means that JIT support is not available
1.1.1.3 misho 1283: in this version of PCRE, or that the pattern was not studied with a JIT option,
1284: or that the JIT compiler could not handle this particular pattern. See the
1.1 misho 1285: <a href="pcrejit.html"><b>pcrejit</b></a>
1286: documentation for details of what can and cannot be handled.
1287: <pre>
1288: PCRE_INFO_JITSIZE
1289: </pre>
1.1.1.3 misho 1290: If the pattern was successfully studied with a JIT option, return the size of
1291: the JIT compiled code, otherwise return zero. The fourth argument should point
1292: to a <b>size_t</b> variable.
1.1 misho 1293: <pre>
1294: PCRE_INFO_LASTLITERAL
1295: </pre>
1.1.1.2 misho 1296: Return the value of the rightmost literal data unit that must exist in any
1297: matched string, other than at its start, if such a value has been recorded. The
1298: fourth argument should point to an <b>int</b> variable. If there is no such
1299: value, -1 is returned. For anchored patterns, a last literal value is recorded
1300: only if it follows something of variable length. For example, for the pattern
1.1 misho 1301: /^a\d+z\d+/ the returned value is "z", but for /^a\dz\d/ the returned value
1302: is -1.
1.1.1.4 ! misho 1303: </P>
! 1304: <P>
! 1305: Since for the 32-bit library using the non-UTF-32 mode, this function is unable
! 1306: to return the full 32-bit range of the character, this value is deprecated;
! 1307: instead the PCRE_INFO_REQUIREDCHARFLAGS and PCRE_INFO_REQUIREDCHAR values should
! 1308: be used.
! 1309: <pre>
! 1310: PCRE_INFO_MATCHLIMIT
! 1311: </pre>
! 1312: If the pattern set a match limit by including an item of the form
! 1313: (*LIMIT_MATCH=nnnn) at the start, the value is returned. The fourth argument
! 1314: should point to an unsigned 32-bit integer. If no such value has been set, the
! 1315: call to <b>pcre_fullinfo()</b> returns the error PCRE_ERROR_UNSET.
1.1 misho 1316: <pre>
1.1.1.3 misho 1317: PCRE_INFO_MAXLOOKBEHIND
1318: </pre>
1.1.1.4 ! misho 1319: Return the number of characters (NB not data units) in the longest lookbehind
! 1320: assertion in the pattern. This information is useful when doing multi-segment
! 1321: matching using the partial matching facilities. Note that the simple assertions
! 1322: \b and \B require a one-character lookbehind. \A also registers a
! 1323: one-character lookbehind, though it does not actually inspect the previous
! 1324: character. This is to ensure that at least one character from the old segment
! 1325: is retained when a new segment is processed. Otherwise, if there are no
! 1326: lookbehinds in the pattern, \A might match incorrectly at the start of a new
! 1327: segment.
1.1.1.3 misho 1328: <pre>
1.1 misho 1329: PCRE_INFO_MINLENGTH
1330: </pre>
1331: If the pattern was studied and a minimum length for matching subject strings
1332: was computed, its value is returned. Otherwise the returned value is -1. The
1.1.1.4 ! misho 1333: value is a number of characters, which in UTF mode may be different from the
! 1334: number of data units. The fourth argument should point to an <b>int</b>
! 1335: variable. A non-negative value is a lower bound to the length of any matching
! 1336: string. There may not be any strings of that length that do actually match, but
! 1337: every string that does match is at least that long.
1.1 misho 1338: <pre>
1339: PCRE_INFO_NAMECOUNT
1340: PCRE_INFO_NAMEENTRYSIZE
1341: PCRE_INFO_NAMETABLE
1342: </pre>
1343: PCRE supports the use of named as well as numbered capturing parentheses. The
1344: names are just an additional way of identifying the parentheses, which still
1345: acquire numbers. Several convenience functions such as
1346: <b>pcre_get_named_substring()</b> are provided for extracting captured
1347: substrings by name. It is also possible to extract the data directly, by first
1348: converting the name to a number in order to access the correct pointers in the
1349: output vector (described with <b>pcre_exec()</b> below). To do the conversion,
1350: you need to use the name-to-number map, which is described by these three
1351: values.
1352: </P>
1353: <P>
1354: The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT gives
1355: the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size of each
1356: entry; both of these return an <b>int</b> value. The entry size depends on the
1357: length of the longest name. PCRE_INFO_NAMETABLE returns a pointer to the first
1.1.1.2 misho 1358: entry of the table. This is a pointer to <b>char</b> in the 8-bit library, where
1359: the first two bytes of each entry are the number of the capturing parenthesis,
1360: most significant byte first. In the 16-bit library, the pointer points to
1.1.1.4 ! misho 1361: 16-bit data units, the first of which contains the parenthesis number. In the
! 1362: 32-bit library, the pointer points to 32-bit data units, the first of which
! 1363: contains the parenthesis number. The rest of the entry is the corresponding
! 1364: name, zero terminated.
1.1 misho 1365: </P>
1366: <P>
1367: The names are in alphabetical order. Duplicate names may appear if (?| is used
1368: to create multiple groups with the same number, as described in the
1369: <a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
1370: in the
1371: <a href="pcrepattern.html"><b>pcrepattern</b></a>
1372: page. Duplicate names for subpatterns with different numbers are permitted only
1373: if PCRE_DUPNAMES is set. In all cases of duplicate names, they appear in the
1374: table in the order in which they were found in the pattern. In the absence of
1375: (?| this is the order of increasing number; when (?| is used this is not
1376: necessarily the case because later subpatterns may have lower numbers.
1377: </P>
1378: <P>
1379: As a simple example of the name/number table, consider the following pattern
1.1.1.2 misho 1380: after compilation by the 8-bit library (assume PCRE_EXTENDED is set, so white
1381: space - including newlines - is ignored):
1.1 misho 1382: <pre>
1383: (?<date> (?<year>(\d\d)?\d\d) - (?<month>\d\d) - (?<day>\d\d) )
1384: </pre>
1385: There are four named subpatterns, so the table has four entries, and each entry
1386: in the table is eight bytes long. The table is as follows, with non-printing
1387: bytes shows in hexadecimal, and undefined bytes shown as ??:
1388: <pre>
1389: 00 01 d a t e 00 ??
1390: 00 05 d a y 00 ?? ??
1391: 00 04 m o n t h 00
1392: 00 02 y e a r 00 ??
1393: </pre>
1394: When writing code to extract data from named subpatterns using the
1395: name-to-number map, remember that the length of the entries is likely to be
1396: different for each compiled pattern.
1397: <pre>
1398: PCRE_INFO_OKPARTIAL
1399: </pre>
1400: Return 1 if the pattern can be used for partial matching with
1401: <b>pcre_exec()</b>, otherwise 0. The fourth argument should point to an
1402: <b>int</b> variable. From release 8.00, this always returns 1, because the
1403: restrictions that previously applied to partial matching have been lifted. The
1404: <a href="pcrepartial.html"><b>pcrepartial</b></a>
1405: documentation gives details of partial matching.
1406: <pre>
1407: PCRE_INFO_OPTIONS
1408: </pre>
1409: Return a copy of the options with which the pattern was compiled. The fourth
1410: argument should point to an <b>unsigned long int</b> variable. These option bits
1411: are those specified in the call to <b>pcre_compile()</b>, modified by any
1412: top-level option settings at the start of the pattern itself. In other words,
1413: they are the options that will be in force when matching starts. For example,
1414: if the pattern /(?im)abc(?-i)d/ is compiled with the PCRE_EXTENDED option, the
1415: result is PCRE_CASELESS, PCRE_MULTILINE, and PCRE_EXTENDED.
1416: </P>
1417: <P>
1418: A pattern is automatically anchored by PCRE if all of its top-level
1419: alternatives begin with one of the following:
1420: <pre>
1421: ^ unless PCRE_MULTILINE is set
1422: \A always
1423: \G always
1424: .* if PCRE_DOTALL is set and there are no back references to the subpattern in which .* appears
1425: </pre>
1426: For such patterns, the PCRE_ANCHORED bit is set in the options returned by
1427: <b>pcre_fullinfo()</b>.
1428: <pre>
1.1.1.4 ! misho 1429: PCRE_INFO_RECURSIONLIMIT
! 1430: </pre>
! 1431: If the pattern set a recursion limit by including an item of the form
! 1432: (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
! 1433: argument should point to an unsigned 32-bit integer. If no such value has been
! 1434: set, the call to <b>pcre_fullinfo()</b> returns the error PCRE_ERROR_UNSET.
! 1435: <pre>
1.1 misho 1436: PCRE_INFO_SIZE
1437: </pre>
1.1.1.4 ! misho 1438: Return the size of the compiled pattern in bytes (for all three libraries). The
1.1.1.2 misho 1439: fourth argument should point to a <b>size_t</b> variable. This value does not
1440: include the size of the <b>pcre</b> structure that is returned by
1441: <b>pcre_compile()</b>. The value that is passed as the argument to
1442: <b>pcre_malloc()</b> when <b>pcre_compile()</b> is getting memory in which to
1443: place the compiled data is the value returned by this option plus the size of
1444: the <b>pcre</b> structure. Studying a compiled pattern, with or without JIT,
1445: does not alter the value returned by this option.
1.1 misho 1446: <pre>
1447: PCRE_INFO_STUDYSIZE
1448: </pre>
1.1.1.4 ! misho 1449: Return the size in bytes (for all three libraries) of the data block pointed to
! 1450: by the <i>study_data</i> field in a <b>pcre_extra</b> block. If <b>pcre_extra</b>
! 1451: is NULL, or there is no study data, zero is returned. The fourth argument
! 1452: should point to a <b>size_t</b> variable. The <i>study_data</i> field is set by
! 1453: <b>pcre_study()</b> to record information that will speed up matching (see the
! 1454: section entitled
1.1 misho 1455: <a href="#studyingapattern">"Studying a pattern"</a>
1456: above). The format of the <i>study_data</i> block is private, but its length
1457: is made available via this option so that it can be saved and restored (see the
1458: <a href="pcreprecompile.html"><b>pcreprecompile</b></a>
1459: documentation for details).
1.1.1.4 ! misho 1460: <pre>
! 1461: PCRE_INFO_FIRSTCHARACTERFLAGS
! 1462: </pre>
! 1463: Return information about the first data unit of any matched string, for a
! 1464: non-anchored pattern. The fourth argument should point to an <b>int</b>
! 1465: variable.
! 1466: </P>
! 1467: <P>
! 1468: If there is a fixed first value, for example, the letter "c" from a pattern
! 1469: such as (cat|cow|coyote), 1 is returned, and the character value can be
! 1470: retrieved using PCRE_INFO_FIRSTCHARACTER.
! 1471: </P>
! 1472: <P>
! 1473: If there is no fixed first value, and if either
! 1474: <br>
! 1475: <br>
! 1476: (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
! 1477: starts with "^", or
! 1478: <br>
! 1479: <br>
! 1480: (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
! 1481: (if it were set, the pattern would be anchored),
! 1482: <br>
! 1483: <br>
! 1484: 2 is returned, indicating that the pattern matches only at the start of a
! 1485: subject string or after any newline within the string. Otherwise 0 is
! 1486: returned. For anchored patterns, 0 is returned.
! 1487: <pre>
! 1488: PCRE_INFO_FIRSTCHARACTER
! 1489: </pre>
! 1490: Return the fixed first character value, if PCRE_INFO_FIRSTCHARACTERFLAGS
! 1491: returned 1; otherwise returns 0. The fourth argument should point to an
! 1492: <b>uint_t</b> variable.
! 1493: </P>
! 1494: <P>
! 1495: In the 8-bit library, the value is always less than 256. In the 16-bit library
! 1496: the value can be up to 0xffff. In the 32-bit library in UTF-32 mode the value
! 1497: can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 mode.
! 1498: </P>
! 1499: <P>
! 1500: If there is no fixed first value, and if either
! 1501: <br>
! 1502: <br>
! 1503: (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
! 1504: starts with "^", or
! 1505: <br>
! 1506: <br>
! 1507: (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
! 1508: (if it were set, the pattern would be anchored),
! 1509: <br>
! 1510: <br>
! 1511: -1 is returned, indicating that the pattern matches only at the start of a
! 1512: subject string or after any newline within the string. Otherwise -2 is
! 1513: returned. For anchored patterns, -2 is returned.
! 1514: <pre>
! 1515: PCRE_INFO_REQUIREDCHARFLAGS
! 1516: </pre>
! 1517: Returns 1 if there is a rightmost literal data unit that must exist in any
! 1518: matched string, other than at its start. The fourth argument should point to
! 1519: an <b>int</b> variable. If there is no such value, 0 is returned. If returning
! 1520: 1, the character value itself can be retrieved using PCRE_INFO_REQUIREDCHAR.
! 1521: </P>
! 1522: <P>
! 1523: For anchored patterns, a last literal value is recorded only if it follows
! 1524: something of variable length. For example, for the pattern /^a\d+z\d+/ the
! 1525: returned value 1 (with "z" returned from PCRE_INFO_REQUIREDCHAR), but for
! 1526: /^a\dz\d/ the returned value is 0.
! 1527: <pre>
! 1528: PCRE_INFO_REQUIREDCHAR
! 1529: </pre>
! 1530: Return the value of the rightmost literal data unit that must exist in any
! 1531: matched string, other than at its start, if such a value has been recorded. The
! 1532: fourth argument should point to an <b>uint32_t</b> variable. If there is no such
! 1533: value, 0 is returned.
1.1 misho 1534: </P>
1.1.1.2 misho 1535: <br><a name="SEC16" href="#TOC1">REFERENCE COUNTS</a><br>
1.1 misho 1536: <P>
1537: <b>int pcre_refcount(pcre *<i>code</i>, int <i>adjust</i>);</b>
1538: </P>
1539: <P>
1540: The <b>pcre_refcount()</b> function is used to maintain a reference count in the
1541: data block that contains a compiled pattern. It is provided for the benefit of
1542: applications that operate in an object-oriented manner, where different parts
1543: of the application may be using the same compiled pattern, but you want to free
1544: the block when they are all done.
1545: </P>
1546: <P>
1547: When a pattern is compiled, the reference count field is initialized to zero.
1548: It is changed only by calling this function, whose action is to add the
1549: <i>adjust</i> value (which may be positive or negative) to it. The yield of the
1550: function is the new value. However, the value of the count is constrained to
1551: lie between 0 and 65535, inclusive. If the new value is outside these limits,
1552: it is forced to the appropriate limit value.
1553: </P>
1554: <P>
1555: Except when it is zero, the reference count is not correctly preserved if a
1556: pattern is compiled on one host and then transferred to a host whose byte-order
1557: is different. (This seems a highly unlikely scenario.)
1558: </P>
1.1.1.2 misho 1559: <br><a name="SEC17" href="#TOC1">MATCHING A PATTERN: THE TRADITIONAL FUNCTION</a><br>
1.1 misho 1560: <P>
1561: <b>int pcre_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
1562: <b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
1563: <b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>);</b>
1564: </P>
1565: <P>
1566: The function <b>pcre_exec()</b> is called to match a subject string against a
1567: compiled pattern, which is passed in the <i>code</i> argument. If the
1568: pattern was studied, the result of the study should be passed in the
1569: <i>extra</i> argument. You can call <b>pcre_exec()</b> with the same <i>code</i>
1570: and <i>extra</i> arguments as many times as you like, in order to match
1571: different subject strings with the same pattern.
1572: </P>
1573: <P>
1574: This function is the main matching facility of the library, and it operates in
1575: a Perl-like manner. For specialist use there is also an alternative matching
1576: function, which is described
1577: <a href="#dfamatch">below</a>
1578: in the section about the <b>pcre_dfa_exec()</b> function.
1579: </P>
1580: <P>
1581: In most applications, the pattern will have been compiled (and optionally
1582: studied) in the same process that calls <b>pcre_exec()</b>. However, it is
1583: possible to save compiled patterns and study data, and then use them later
1584: in different processes, possibly even on different hosts. For a discussion
1585: about this, see the
1586: <a href="pcreprecompile.html"><b>pcreprecompile</b></a>
1587: documentation.
1588: </P>
1589: <P>
1590: Here is an example of a simple call to <b>pcre_exec()</b>:
1591: <pre>
1592: int rc;
1593: int ovector[30];
1594: rc = pcre_exec(
1595: re, /* result of pcre_compile() */
1596: NULL, /* we didn't study the pattern */
1597: "some string", /* the subject string */
1598: 11, /* the length of the subject string */
1599: 0, /* start at offset 0 in the subject */
1600: 0, /* default options */
1601: ovector, /* vector of integers for substring information */
1602: 30); /* number of elements (NOT size in bytes) */
1603: <a name="extradata"></a></PRE>
1604: </P>
1605: <br><b>
1606: Extra data for <b>pcre_exec()</b>
1607: </b><br>
1608: <P>
1609: If the <i>extra</i> argument is not NULL, it must point to a <b>pcre_extra</b>
1610: data block. The <b>pcre_study()</b> function returns such a block (when it
1611: doesn't return NULL), but you can also create one for yourself, and pass
1612: additional information in it. The <b>pcre_extra</b> block contains the following
1613: fields (not necessarily in this order):
1614: <pre>
1615: unsigned long int <i>flags</i>;
1616: void *<i>study_data</i>;
1617: void *<i>executable_jit</i>;
1618: unsigned long int <i>match_limit</i>;
1619: unsigned long int <i>match_limit_recursion</i>;
1620: void *<i>callout_data</i>;
1621: const unsigned char *<i>tables</i>;
1622: unsigned char **<i>mark</i>;
1623: </pre>
1.1.1.2 misho 1624: In the 16-bit version of this structure, the <i>mark</i> field has type
1625: "PCRE_UCHAR16 **".
1.1.1.4 ! misho 1626: <br>
! 1627: <br>
! 1628: In the 32-bit version of this structure, the <i>mark</i> field has type
! 1629: "PCRE_UCHAR32 **".
1.1.1.2 misho 1630: </P>
1631: <P>
1.1.1.3 misho 1632: The <i>flags</i> field is used to specify which of the other fields are set. The
1633: flag bits are:
1.1 misho 1634: <pre>
1.1.1.3 misho 1635: PCRE_EXTRA_CALLOUT_DATA
1.1 misho 1636: PCRE_EXTRA_EXECUTABLE_JIT
1.1.1.3 misho 1637: PCRE_EXTRA_MARK
1.1 misho 1638: PCRE_EXTRA_MATCH_LIMIT
1639: PCRE_EXTRA_MATCH_LIMIT_RECURSION
1.1.1.3 misho 1640: PCRE_EXTRA_STUDY_DATA
1.1 misho 1641: PCRE_EXTRA_TABLES
1642: </pre>
1643: Other flag bits should be set to zero. The <i>study_data</i> field and sometimes
1644: the <i>executable_jit</i> field are set in the <b>pcre_extra</b> block that is
1645: returned by <b>pcre_study()</b>, together with the appropriate flag bits. You
1.1.1.3 misho 1646: should not set these yourself, but you may add to the block by setting other
1647: fields and their corresponding flag bits.
1.1 misho 1648: </P>
1649: <P>
1650: The <i>match_limit</i> field provides a means of preventing PCRE from using up a
1651: vast amount of resources when running patterns that are not going to match,
1652: but which have a very large number of possibilities in their search trees. The
1653: classic example is a pattern that uses nested unlimited repeats.
1654: </P>
1655: <P>
1656: Internally, <b>pcre_exec()</b> uses a function called <b>match()</b>, which it
1657: calls repeatedly (sometimes recursively). The limit set by <i>match_limit</i> is
1658: imposed on the number of times this function is called during a match, which
1659: has the effect of limiting the amount of backtracking that can take place. For
1660: patterns that are not anchored, the count restarts from zero for each position
1661: in the subject string.
1662: </P>
1663: <P>
1664: When <b>pcre_exec()</b> is called with a pattern that was successfully studied
1.1.1.3 misho 1665: with a JIT option, the way that the matching is executed is entirely different.
1666: However, there is still the possibility of runaway matching that goes on for a
1667: very long time, and so the <i>match_limit</i> value is also used in this case
1668: (but in a different way) to limit how long the matching can continue.
1.1 misho 1669: </P>
1670: <P>
1671: The default value for the limit can be set when PCRE is built; the default
1672: default is 10 million, which handles all but the most extreme cases. You can
1673: override the default by suppling <b>pcre_exec()</b> with a <b>pcre_extra</b>
1674: block in which <i>match_limit</i> is set, and PCRE_EXTRA_MATCH_LIMIT is set in
1675: the <i>flags</i> field. If the limit is exceeded, <b>pcre_exec()</b> returns
1676: PCRE_ERROR_MATCHLIMIT.
1677: </P>
1678: <P>
1.1.1.4 ! misho 1679: A value for the match limit may also be supplied by an item at the start of a
! 1680: pattern of the form
! 1681: <pre>
! 1682: (*LIMIT_MATCH=d)
! 1683: </pre>
! 1684: where d is a decimal number. However, such a setting is ignored unless d is
! 1685: less than the limit set by the caller of <b>pcre_exec()</b> or, if no such limit
! 1686: is set, less than the default.
! 1687: </P>
! 1688: <P>
1.1 misho 1689: The <i>match_limit_recursion</i> field is similar to <i>match_limit</i>, but
1690: instead of limiting the total number of times that <b>match()</b> is called, it
1691: limits the depth of recursion. The recursion depth is a smaller number than the
1692: total number of calls, because not all calls to <b>match()</b> are recursive.
1693: This limit is of use only if it is set smaller than <i>match_limit</i>.
1694: </P>
1695: <P>
1696: Limiting the recursion depth limits the amount of machine stack that can be
1697: used, or, when PCRE has been compiled to use memory on the heap instead of the
1698: stack, the amount of heap memory that can be used. This limit is not relevant,
1.1.1.3 misho 1699: and is ignored, when matching is done using JIT compiled code.
1.1 misho 1700: </P>
1701: <P>
1702: The default value for <i>match_limit_recursion</i> can be set when PCRE is
1703: built; the default default is the same value as the default for
1704: <i>match_limit</i>. You can override the default by suppling <b>pcre_exec()</b>
1705: with a <b>pcre_extra</b> block in which <i>match_limit_recursion</i> is set, and
1706: PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the <i>flags</i> field. If the limit
1707: is exceeded, <b>pcre_exec()</b> returns PCRE_ERROR_RECURSIONLIMIT.
1708: </P>
1709: <P>
1.1.1.4 ! misho 1710: A value for the recursion limit may also be supplied by an item at the start of
! 1711: a pattern of the form
! 1712: <pre>
! 1713: (*LIMIT_RECURSION=d)
! 1714: </pre>
! 1715: where d is a decimal number. However, such a setting is ignored unless d is
! 1716: less than the limit set by the caller of <b>pcre_exec()</b> or, if no such limit
! 1717: is set, less than the default.
! 1718: </P>
! 1719: <P>
1.1 misho 1720: The <i>callout_data</i> field is used in conjunction with the "callout" feature,
1721: and is described in the
1722: <a href="pcrecallout.html"><b>pcrecallout</b></a>
1723: documentation.
1724: </P>
1725: <P>
1726: The <i>tables</i> field is used to pass a character tables pointer to
1727: <b>pcre_exec()</b>; this overrides the value that is stored with the compiled
1728: pattern. A non-NULL value is stored with the compiled pattern only if custom
1729: tables were supplied to <b>pcre_compile()</b> via its <i>tableptr</i> argument.
1730: If NULL is passed to <b>pcre_exec()</b> using this mechanism, it forces PCRE's
1731: internal tables to be used. This facility is helpful when re-using patterns
1732: that have been saved after compiling with an external set of tables, because
1733: the external tables might be at a different address when <b>pcre_exec()</b> is
1734: called. See the
1735: <a href="pcreprecompile.html"><b>pcreprecompile</b></a>
1736: documentation for a discussion of saving compiled patterns for later use.
1737: </P>
1738: <P>
1739: If PCRE_EXTRA_MARK is set in the <i>flags</i> field, the <i>mark</i> field must
1.1.1.2 misho 1740: be set to point to a suitable variable. If the pattern contains any
1.1 misho 1741: backtracking control verbs such as (*MARK:NAME), and the execution ends up with
1742: a name to pass back, a pointer to the name string (zero terminated) is placed
1743: in the variable pointed to by the <i>mark</i> field. The names are within the
1744: compiled pattern; if you wish to retain such a name you must copy it before
1745: freeing the memory of a compiled pattern. If there is no name to pass back, the
1.1.1.2 misho 1746: variable pointed to by the <i>mark</i> field is set to NULL. For details of the
1.1 misho 1747: backtracking control verbs, see the section entitled
1748: <a href="pcrepattern#backtrackcontrol">"Backtracking control"</a>
1749: in the
1750: <a href="pcrepattern.html"><b>pcrepattern</b></a>
1751: documentation.
1752: <a name="execoptions"></a></P>
1753: <br><b>
1754: Option bits for <b>pcre_exec()</b>
1755: </b><br>
1756: <P>
1757: The unused bits of the <i>options</i> argument for <b>pcre_exec()</b> must be
1758: zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
1759: PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
1.1.1.3 misho 1760: PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, and
1761: PCRE_PARTIAL_SOFT.
1.1 misho 1762: </P>
1763: <P>
1.1.1.3 misho 1764: If the pattern was successfully studied with one of the just-in-time (JIT)
1765: compile options, the only supported options for JIT execution are
1766: PCRE_NO_UTF8_CHECK, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY,
1767: PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
1768: unsupported option is used, JIT execution is disabled and the normal
1769: interpretive code in <b>pcre_exec()</b> is run.
1.1 misho 1770: <pre>
1771: PCRE_ANCHORED
1772: </pre>
1773: The PCRE_ANCHORED option limits <b>pcre_exec()</b> to matching at the first
1774: matching position. If a pattern was compiled with PCRE_ANCHORED, or turned out
1775: to be anchored by virtue of its contents, it cannot be made unachored at
1776: matching time.
1777: <pre>
1778: PCRE_BSR_ANYCRLF
1779: PCRE_BSR_UNICODE
1780: </pre>
1781: These options (which are mutually exclusive) control what the \R escape
1782: sequence matches. The choice is either to match only CR, LF, or CRLF, or to
1783: match any Unicode newline sequence. These options override the choice that was
1784: made or defaulted when the pattern was compiled.
1785: <pre>
1786: PCRE_NEWLINE_CR
1787: PCRE_NEWLINE_LF
1788: PCRE_NEWLINE_CRLF
1789: PCRE_NEWLINE_ANYCRLF
1790: PCRE_NEWLINE_ANY
1791: </pre>
1792: These options override the newline definition that was chosen or defaulted when
1793: the pattern was compiled. For details, see the description of
1794: <b>pcre_compile()</b> above. During matching, the newline choice affects the
1795: behaviour of the dot, circumflex, and dollar metacharacters. It may also alter
1796: the way the match position is advanced after a match failure for an unanchored
1797: pattern.
1798: </P>
1799: <P>
1800: When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a
1801: match attempt for an unanchored pattern fails when the current position is at a
1802: CRLF sequence, and the pattern contains no explicit matches for CR or LF
1803: characters, the match position is advanced by two characters instead of one, in
1804: other words, to after the CRLF.
1805: </P>
1806: <P>
1807: The above rule is a compromise that makes the most common cases work as
1808: expected. For example, if the pattern is .+A (and the PCRE_DOTALL option is not
1809: set), it does not match the string "\r\nA" because, after failing at the
1810: start, it skips both the CR and the LF before retrying. However, the pattern
1811: [\r\n]A does match that string, because it contains an explicit CR or LF
1812: reference, and so advances only by one character after the first failure.
1813: </P>
1814: <P>
1815: An explicit match for CR of LF is either a literal appearance of one of those
1816: characters, or one of the \r or \n escape sequences. Implicit matches such as
1817: [^X] do not count, nor does \s (which includes CR and LF in the characters
1818: that it matches).
1819: </P>
1820: <P>
1821: Notwithstanding the above, anomalous effects may still occur when CRLF is a
1822: valid newline sequence and explicit \r or \n escapes appear in the pattern.
1823: <pre>
1824: PCRE_NOTBOL
1825: </pre>
1826: This option specifies that first character of the subject string is not the
1827: beginning of a line, so the circumflex metacharacter should not match before
1828: it. Setting this without PCRE_MULTILINE (at compile time) causes circumflex
1829: never to match. This option affects only the behaviour of the circumflex
1830: metacharacter. It does not affect \A.
1831: <pre>
1832: PCRE_NOTEOL
1833: </pre>
1834: This option specifies that the end of the subject string is not the end of a
1835: line, so the dollar metacharacter should not match it nor (except in multiline
1836: mode) a newline immediately before it. Setting this without PCRE_MULTILINE (at
1837: compile time) causes dollar never to match. This option affects only the
1838: behaviour of the dollar metacharacter. It does not affect \Z or \z.
1839: <pre>
1840: PCRE_NOTEMPTY
1841: </pre>
1842: An empty string is not considered to be a valid match if this option is set. If
1843: there are alternatives in the pattern, they are tried. If all the alternatives
1844: match the empty string, the entire match fails. For example, if the pattern
1845: <pre>
1846: a?b?
1847: </pre>
1848: is applied to a string not beginning with "a" or "b", it matches an empty
1849: string at the start of the subject. With PCRE_NOTEMPTY set, this match is not
1850: valid, so PCRE searches further into the string for occurrences of "a" or "b".
1851: <pre>
1852: PCRE_NOTEMPTY_ATSTART
1853: </pre>
1854: This is like PCRE_NOTEMPTY, except that an empty string match that is not at
1855: the start of the subject is permitted. If the pattern is anchored, such a match
1856: can occur only if the pattern contains \K.
1857: </P>
1858: <P>
1859: Perl has no direct equivalent of PCRE_NOTEMPTY or PCRE_NOTEMPTY_ATSTART, but it
1860: does make a special case of a pattern match of the empty string within its
1861: <b>split()</b> function, and when using the /g modifier. It is possible to
1862: emulate Perl's behaviour after matching a null string by first trying the match
1863: again at the same offset with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then
1864: if that fails, by advancing the starting offset (see below) and trying an
1865: ordinary match again. There is some code that demonstrates how to do this in
1866: the
1867: <a href="pcredemo.html"><b>pcredemo</b></a>
1868: sample program. In the most general case, you have to check to see if the
1869: newline convention recognizes CRLF as a newline, and if so, and the current
1870: character is CR followed by LF, advance the starting offset by two characters
1871: instead of one.
1872: <pre>
1873: PCRE_NO_START_OPTIMIZE
1874: </pre>
1875: There are a number of optimizations that <b>pcre_exec()</b> uses at the start of
1876: a match, in order to speed up the process. For example, if it is known that an
1877: unanchored match must start with a specific character, it searches the subject
1878: for that character, and fails immediately if it cannot find it, without
1879: actually running the main matching function. This means that a special item
1880: such as (*COMMIT) at the start of a pattern is not considered until after a
1.1.1.4 ! misho 1881: suitable starting point for the match has been found. Also, when callouts or
! 1882: (*MARK) items are in use, these "start-up" optimizations can cause them to be
! 1883: skipped if the pattern is never actually used. The start-up optimizations are
! 1884: in effect a pre-scan of the subject that takes place before the pattern is run.
1.1 misho 1885: </P>
1886: <P>
1887: The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, possibly
1888: causing performance to suffer, but ensuring that in cases where the result is
1889: "no match", the callouts do occur, and that items such as (*COMMIT) and (*MARK)
1890: are considered at every possible starting position in the subject string. If
1891: PCRE_NO_START_OPTIMIZE is set at compile time, it cannot be unset at matching
1.1.1.4 ! misho 1892: time. The use of PCRE_NO_START_OPTIMIZE at matching time (that is, passing it
! 1893: to <b>pcre_exec()</b>) disables JIT execution; in this situation, matching is
! 1894: always done using interpretively.
1.1 misho 1895: </P>
1896: <P>
1897: Setting PCRE_NO_START_OPTIMIZE can change the outcome of a matching operation.
1898: Consider the pattern
1899: <pre>
1900: (*COMMIT)ABC
1901: </pre>
1902: When this is compiled, PCRE records the fact that a match must start with the
1903: character "A". Suppose the subject string is "DEFABC". The start-up
1904: optimization scans along the subject, finds "A" and runs the first match
1905: attempt from there. The (*COMMIT) item means that the pattern must match the
1906: current starting position, which in this case, it does. However, if the same
1907: match is run with PCRE_NO_START_OPTIMIZE set, the initial scan along the
1908: subject string does not happen. The first match attempt is run starting from
1909: "D" and when this fails, (*COMMIT) prevents any further matches being tried, so
1910: the overall result is "no match". If the pattern is studied, more start-up
1911: optimizations may be used. For example, a minimum length for the subject may be
1912: recorded. Consider the pattern
1913: <pre>
1914: (*MARK:A)(X|Y)
1915: </pre>
1916: The minimum length for a match is one character. If the subject is "ABC", there
1917: will be attempts to match "ABC", "BC", "C", and then finally an empty string.
1918: If the pattern is studied, the final attempt does not take place, because PCRE
1919: knows that the subject is too short, and so the (*MARK) is never encountered.
1920: In this case, studying the pattern does not affect the overall match result,
1921: which is still "no match", but it does affect the auxiliary information that is
1922: returned.
1923: <pre>
1924: PCRE_NO_UTF8_CHECK
1925: </pre>
1926: When PCRE_UTF8 is set at compile time, the validity of the subject as a UTF-8
1927: string is automatically checked when <b>pcre_exec()</b> is subsequently called.
1.1.1.3 misho 1928: The entire string is checked before any other processing takes place. The value
1929: of <i>startoffset</i> is also checked to ensure that it points to the start of a
1930: UTF-8 character. There is a discussion about the
1931: <a href="pcreunicode.html#utf8strings">validity of UTF-8 strings</a>
1932: in the
1.1.1.2 misho 1933: <a href="pcreunicode.html"><b>pcreunicode</b></a>
1934: page. If an invalid sequence of bytes is found, <b>pcre_exec()</b> returns the
1935: error PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
1936: truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In both
1937: cases, information about the precise nature of the error may also be returned
1938: (see the descriptions of these errors in the section entitled \fIError return
1939: values from\fP <b>pcre_exec()</b>
1.1 misho 1940: <a href="#errorlist">below).</a>
1941: If <i>startoffset</i> contains a value that does not point to the start of a
1942: UTF-8 character (or to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is
1943: returned.
1944: </P>
1945: <P>
1946: If you already know that your subject is valid, and you want to skip these
1947: checks for performance reasons, you can set the PCRE_NO_UTF8_CHECK option when
1948: calling <b>pcre_exec()</b>. You might want to do this for the second and
1949: subsequent calls to <b>pcre_exec()</b> if you are making repeated calls to find
1950: all the matches in a single subject string. However, you should be sure that
1.1.1.2 misho 1951: the value of <i>startoffset</i> points to the start of a character (or the end
1952: of the subject). When PCRE_NO_UTF8_CHECK is set, the effect of passing an
1953: invalid string as a subject or an invalid value of <i>startoffset</i> is
1.1 misho 1954: undefined. Your program may crash.
1955: <pre>
1956: PCRE_PARTIAL_HARD
1957: PCRE_PARTIAL_SOFT
1958: </pre>
1959: These options turn on the partial matching feature. For backwards
1960: compatibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial match
1961: occurs if the end of the subject string is reached successfully, but there are
1962: not enough subject characters to complete the match. If this happens when
1963: PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, matching continues by
1964: testing any remaining alternatives. Only if no complete match can be found is
1965: PCRE_ERROR_PARTIAL returned instead of PCRE_ERROR_NOMATCH. In other words,
1966: PCRE_PARTIAL_SOFT says that the caller is prepared to handle a partial match,
1967: but only if no complete match can be found.
1968: </P>
1969: <P>
1970: If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this case, if a
1971: partial match is found, <b>pcre_exec()</b> immediately returns
1972: PCRE_ERROR_PARTIAL, without considering any other alternatives. In other words,
1973: when PCRE_PARTIAL_HARD is set, a partial match is considered to be more
1974: important that an alternative complete match.
1975: </P>
1976: <P>
1977: In both cases, the portion of the string that was inspected when the partial
1978: match was found is set as the first matching string. There is a more detailed
1979: discussion of partial and multi-segment matching, with examples, in the
1980: <a href="pcrepartial.html"><b>pcrepartial</b></a>
1981: documentation.
1982: </P>
1983: <br><b>
1984: The string to be matched by <b>pcre_exec()</b>
1985: </b><br>
1986: <P>
1987: The subject string is passed to <b>pcre_exec()</b> as a pointer in
1.1.1.4 ! misho 1988: <i>subject</i>, a length in <i>length</i>, and a starting offset in
! 1989: <i>startoffset</i>. The units for <i>length</i> and <i>startoffset</i> are bytes
! 1990: for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit
! 1991: data items for the 32-bit library.
! 1992: </P>
! 1993: <P>
! 1994: If <i>startoffset</i> is negative or greater than the length of the subject,
! 1995: <b>pcre_exec()</b> returns PCRE_ERROR_BADOFFSET. When the starting offset is
! 1996: zero, the search for a match starts at the beginning of the subject, and this
! 1997: is by far the most common case. In UTF-8 or UTF-16 mode, the offset must point
! 1998: to the start of a character, or the end of the subject (in UTF-32 mode, one
! 1999: data unit equals one character, so all offsets are valid). Unlike the pattern
! 2000: string, the subject may contain binary zeroes.
1.1 misho 2001: </P>
2002: <P>
2003: A non-zero starting offset is useful when searching for another match in the
2004: same subject by calling <b>pcre_exec()</b> again after a previous success.
2005: Setting <i>startoffset</i> differs from just passing over a shortened string and
2006: setting PCRE_NOTBOL in the case of a pattern that begins with any kind of
2007: lookbehind. For example, consider the pattern
2008: <pre>
2009: \Biss\B
2010: </pre>
2011: which finds occurrences of "iss" in the middle of words. (\B matches only if
2012: the current position in the subject is not a word boundary.) When applied to
2013: the string "Mississipi" the first call to <b>pcre_exec()</b> finds the first
2014: occurrence. If <b>pcre_exec()</b> is called again with just the remainder of the
2015: subject, namely "issipi", it does not match, because \B is always false at the
2016: start of the subject, which is deemed to be a word boundary. However, if
2017: <b>pcre_exec()</b> is passed the entire string again, but with <i>startoffset</i>
2018: set to 4, it finds the second occurrence of "iss" because it is able to look
2019: behind the starting point to discover that it is preceded by a letter.
2020: </P>
2021: <P>
2022: Finding all the matches in a subject is tricky when the pattern can match an
2023: empty string. It is possible to emulate Perl's /g behaviour by first trying the
2024: match again at the same offset, with the PCRE_NOTEMPTY_ATSTART and
2025: PCRE_ANCHORED options, and then if that fails, advancing the starting offset
2026: and trying an ordinary match again. There is some code that demonstrates how to
2027: do this in the
2028: <a href="pcredemo.html"><b>pcredemo</b></a>
2029: sample program. In the most general case, you have to check to see if the
2030: newline convention recognizes CRLF as a newline, and if so, and the current
2031: character is CR followed by LF, advance the starting offset by two characters
2032: instead of one.
2033: </P>
2034: <P>
2035: If a non-zero starting offset is passed when the pattern is anchored, one
2036: attempt to match at the given offset is made. This can only succeed if the
2037: pattern does not require the match to be at the start of the subject.
2038: </P>
2039: <br><b>
2040: How <b>pcre_exec()</b> returns captured substrings
2041: </b><br>
2042: <P>
2043: In general, a pattern matches a certain portion of the subject, and in
2044: addition, further substrings from the subject may be picked out by parts of the
2045: pattern. Following the usage in Jeffrey Friedl's book, this is called
2046: "capturing" in what follows, and the phrase "capturing subpattern" is used for
2047: a fragment of a pattern that picks out a substring. PCRE supports several other
2048: kinds of parenthesized subpattern that do not cause substrings to be captured.
2049: </P>
2050: <P>
2051: Captured substrings are returned to the caller via a vector of integers whose
2052: address is passed in <i>ovector</i>. The number of elements in the vector is
2053: passed in <i>ovecsize</i>, which must be a non-negative number. <b>Note</b>: this
2054: argument is NOT the size of <i>ovector</i> in bytes.
2055: </P>
2056: <P>
2057: The first two-thirds of the vector is used to pass back captured substrings,
2058: each substring using a pair of integers. The remaining third of the vector is
2059: used as workspace by <b>pcre_exec()</b> while matching capturing subpatterns,
2060: and is not available for passing back information. The number passed in
2061: <i>ovecsize</i> should always be a multiple of three. If it is not, it is
2062: rounded down.
2063: </P>
2064: <P>
2065: When a match is successful, information about captured substrings is returned
2066: in pairs of integers, starting at the beginning of <i>ovector</i>, and
2067: continuing up to two-thirds of its length at the most. The first element of
1.1.1.4 ! misho 2068: each pair is set to the offset of the first character in a substring, and the
! 2069: second is set to the offset of the first character after the end of a
! 2070: substring. These values are always data unit offsets, even in UTF mode. They
! 2071: are byte offsets in the 8-bit library, 16-bit data item offsets in the 16-bit
! 2072: library, and 32-bit data item offsets in the 32-bit library. <b>Note</b>: they
! 2073: are not character counts.
1.1 misho 2074: </P>
2075: <P>
2076: The first pair of integers, <i>ovector[0]</i> and <i>ovector[1]</i>, identify the
2077: portion of the subject string matched by the entire pattern. The next pair is
2078: used for the first capturing subpattern, and so on. The value returned by
2079: <b>pcre_exec()</b> is one more than the highest numbered pair that has been set.
2080: For example, if two substrings have been captured, the returned value is 3. If
2081: there are no capturing subpatterns, the return value from a successful match is
2082: 1, indicating that just the first pair of offsets has been set.
2083: </P>
2084: <P>
2085: If a capturing subpattern is matched repeatedly, it is the last portion of the
2086: string that it matched that is returned.
2087: </P>
2088: <P>
2089: If the vector is too small to hold all the captured substring offsets, it is
2090: used as far as possible (up to two-thirds of its length), and the function
1.1.1.3 misho 2091: returns a value of zero. If neither the actual string matched nor any captured
1.1 misho 2092: substrings are of interest, <b>pcre_exec()</b> may be called with <i>ovector</i>
2093: passed as NULL and <i>ovecsize</i> as zero. However, if the pattern contains
2094: back references and the <i>ovector</i> is not big enough to remember the related
2095: substrings, PCRE has to get additional memory for use during matching. Thus it
2096: is usually advisable to supply an <i>ovector</i> of reasonable size.
2097: </P>
2098: <P>
2099: There are some cases where zero is returned (indicating vector overflow) when
2100: in fact the vector is exactly the right size for the final match. For example,
2101: consider the pattern
2102: <pre>
2103: (a)(?:(b)c|bd)
2104: </pre>
2105: If a vector of 6 elements (allowing for only 1 captured substring) is given
2106: with subject string "abd", <b>pcre_exec()</b> will try to set the second
2107: captured string, thereby recording a vector overflow, before failing to match
2108: "c" and backing up to try the second alternative. The zero return, however,
2109: does correctly indicate that the maximum number of slots (namely 2) have been
2110: filled. In similar cases where there is temporary overflow, but the final
2111: number of used slots is actually less than the maximum, a non-zero value is
2112: returned.
2113: </P>
2114: <P>
2115: The <b>pcre_fullinfo()</b> function can be used to find out how many capturing
2116: subpatterns there are in a compiled pattern. The smallest size for
2117: <i>ovector</i> that will allow for <i>n</i> captured substrings, in addition to
2118: the offsets of the substring matched by the whole pattern, is (<i>n</i>+1)*3.
2119: </P>
2120: <P>
2121: It is possible for capturing subpattern number <i>n+1</i> to match some part of
2122: the subject when subpattern <i>n</i> has not been used at all. For example, if
2123: the string "abc" is matched against the pattern (a|(z))(bc) the return from the
2124: function is 4, and subpatterns 1 and 3 are matched, but 2 is not. When this
2125: happens, both values in the offset pairs corresponding to unused subpatterns
2126: are set to -1.
2127: </P>
2128: <P>
2129: Offset values that correspond to unused subpatterns at the end of the
2130: expression are also set to -1. For example, if the string "abc" is matched
2131: against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not matched. The
2132: return from the function is 2, because the highest used capturing subpattern
2133: number is 1, and the offsets for for the second and third capturing subpatterns
2134: (assuming the vector is large enough, of course) are set to -1.
2135: </P>
2136: <P>
2137: <b>Note</b>: Elements in the first two-thirds of <i>ovector</i> that do not
2138: correspond to capturing parentheses in the pattern are never changed. That is,
2139: if a pattern contains <i>n</i> capturing parentheses, no more than
2140: <i>ovector[0]</i> to <i>ovector[2n+1]</i> are set by <b>pcre_exec()</b>. The other
2141: elements (in the first two-thirds) retain whatever values they previously had.
2142: </P>
2143: <P>
2144: Some convenience functions are provided for extracting the captured substrings
2145: as separate strings. These are described below.
2146: <a name="errorlist"></a></P>
2147: <br><b>
2148: Error return values from <b>pcre_exec()</b>
2149: </b><br>
2150: <P>
2151: If <b>pcre_exec()</b> fails, it returns a negative number. The following are
2152: defined in the header file:
2153: <pre>
2154: PCRE_ERROR_NOMATCH (-1)
2155: </pre>
2156: The subject string did not match the pattern.
2157: <pre>
2158: PCRE_ERROR_NULL (-2)
2159: </pre>
2160: Either <i>code</i> or <i>subject</i> was passed as NULL, or <i>ovector</i> was
2161: NULL and <i>ovecsize</i> was not zero.
2162: <pre>
2163: PCRE_ERROR_BADOPTION (-3)
2164: </pre>
2165: An unrecognized bit was set in the <i>options</i> argument.
2166: <pre>
2167: PCRE_ERROR_BADMAGIC (-4)
2168: </pre>
2169: PCRE stores a 4-byte "magic number" at the start of the compiled code, to catch
2170: the case when it is passed a junk pointer and to detect when a pattern that was
2171: compiled in an environment of one endianness is run in an environment with the
2172: other endianness. This is the error that PCRE gives when the magic number is
2173: not present.
2174: <pre>
2175: PCRE_ERROR_UNKNOWN_OPCODE (-5)
2176: </pre>
2177: While running the pattern match, an unknown item was encountered in the
2178: compiled pattern. This error could be caused by a bug in PCRE or by overwriting
2179: of the compiled pattern.
2180: <pre>
2181: PCRE_ERROR_NOMEMORY (-6)
2182: </pre>
2183: If a pattern contains back references, but the <i>ovector</i> that is passed to
2184: <b>pcre_exec()</b> is not big enough to remember the referenced substrings, PCRE
2185: gets a block of memory at the start of matching to use for this purpose. If the
2186: call via <b>pcre_malloc()</b> fails, this error is given. The memory is
2187: automatically freed at the end of matching.
2188: </P>
2189: <P>
2190: This error is also given if <b>pcre_stack_malloc()</b> fails in
2191: <b>pcre_exec()</b>. This can happen only when PCRE has been compiled with
2192: <b>--disable-stack-for-recursion</b>.
2193: <pre>
2194: PCRE_ERROR_NOSUBSTRING (-7)
2195: </pre>
2196: This error is used by the <b>pcre_copy_substring()</b>,
2197: <b>pcre_get_substring()</b>, and <b>pcre_get_substring_list()</b> functions (see
2198: below). It is never returned by <b>pcre_exec()</b>.
2199: <pre>
2200: PCRE_ERROR_MATCHLIMIT (-8)
2201: </pre>
2202: The backtracking limit, as specified by the <i>match_limit</i> field in a
2203: <b>pcre_extra</b> structure (or defaulted) was reached. See the description
2204: above.
2205: <pre>
2206: PCRE_ERROR_CALLOUT (-9)
2207: </pre>
2208: This error is never generated by <b>pcre_exec()</b> itself. It is provided for
2209: use by callout functions that want to yield a distinctive error code. See the
2210: <a href="pcrecallout.html"><b>pcrecallout</b></a>
2211: documentation for details.
2212: <pre>
2213: PCRE_ERROR_BADUTF8 (-10)
2214: </pre>
2215: A string that contains an invalid UTF-8 byte sequence was passed as a subject,
2216: and the PCRE_NO_UTF8_CHECK option was not set. If the size of the output vector
2217: (<i>ovecsize</i>) is at least 2, the byte offset to the start of the the invalid
2218: UTF-8 character is placed in the first element, and a reason code is placed in
2219: the second element. The reason codes are listed in the
2220: <a href="#badutf8reasons">following section.</a>
2221: For backward compatibility, if PCRE_PARTIAL_HARD is set and the problem is a
2222: truncated UTF-8 character at the end of the subject (reason codes 1 to 5),
2223: PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2224: <pre>
2225: PCRE_ERROR_BADUTF8_OFFSET (-11)
2226: </pre>
2227: The UTF-8 byte sequence that was passed as a subject was checked and found to
2228: be valid (the PCRE_NO_UTF8_CHECK option was not set), but the value of
2229: <i>startoffset</i> did not point to the beginning of a UTF-8 character or the
2230: end of the subject.
2231: <pre>
2232: PCRE_ERROR_PARTIAL (-12)
2233: </pre>
2234: The subject string did not match, but it did match partially. See the
2235: <a href="pcrepartial.html"><b>pcrepartial</b></a>
2236: documentation for details of partial matching.
2237: <pre>
2238: PCRE_ERROR_BADPARTIAL (-13)
2239: </pre>
2240: This code is no longer in use. It was formerly returned when the PCRE_PARTIAL
2241: option was used with a compiled pattern containing items that were not
2242: supported for partial matching. From release 8.00 onwards, there are no
2243: restrictions on partial matching.
2244: <pre>
2245: PCRE_ERROR_INTERNAL (-14)
2246: </pre>
2247: An unexpected internal error has occurred. This error could be caused by a bug
2248: in PCRE or by overwriting of the compiled pattern.
2249: <pre>
2250: PCRE_ERROR_BADCOUNT (-15)
2251: </pre>
2252: This error is given if the value of the <i>ovecsize</i> argument is negative.
2253: <pre>
2254: PCRE_ERROR_RECURSIONLIMIT (-21)
2255: </pre>
2256: The internal recursion limit, as specified by the <i>match_limit_recursion</i>
2257: field in a <b>pcre_extra</b> structure (or defaulted) was reached. See the
2258: description above.
2259: <pre>
2260: PCRE_ERROR_BADNEWLINE (-23)
2261: </pre>
2262: An invalid combination of PCRE_NEWLINE_<i>xxx</i> options was given.
2263: <pre>
2264: PCRE_ERROR_BADOFFSET (-24)
2265: </pre>
2266: The value of <i>startoffset</i> was negative or greater than the length of the
2267: subject, that is, the value in <i>length</i>.
2268: <pre>
2269: PCRE_ERROR_SHORTUTF8 (-25)
2270: </pre>
2271: This error is returned instead of PCRE_ERROR_BADUTF8 when the subject string
2272: ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD option is set.
2273: Information about the failure is returned as for PCRE_ERROR_BADUTF8. It is in
2274: fact sufficient to detect this case, but this special error code for
2275: PCRE_PARTIAL_HARD precedes the implementation of returned information; it is
2276: retained for backwards compatibility.
2277: <pre>
2278: PCRE_ERROR_RECURSELOOP (-26)
2279: </pre>
2280: This error is returned when <b>pcre_exec()</b> detects a recursion loop within
2281: the pattern. Specifically, it means that either the whole pattern or a
2282: subpattern has been called recursively for the second time at the same position
2283: in the subject string. Some simple patterns that might do this are detected and
2284: faulted at compile time, but more complicated cases, in particular mutual
2285: recursions between two different subpatterns, cannot be detected until run
2286: time.
2287: <pre>
2288: PCRE_ERROR_JIT_STACKLIMIT (-27)
2289: </pre>
1.1.1.3 misho 2290: This error is returned when a pattern that was successfully studied using a
2291: JIT compile option is being matched, but the memory available for the
2292: just-in-time processing stack is not large enough. See the
1.1 misho 2293: <a href="pcrejit.html"><b>pcrejit</b></a>
2294: documentation for more details.
1.1.1.2 misho 2295: <pre>
1.1.1.3 misho 2296: PCRE_ERROR_BADMODE (-28)
1.1.1.2 misho 2297: </pre>
2298: This error is given if a pattern that was compiled by the 8-bit library is
1.1.1.4 ! misho 2299: passed to a 16-bit or 32-bit library function, or vice versa.
1.1.1.2 misho 2300: <pre>
1.1.1.3 misho 2301: PCRE_ERROR_BADENDIANNESS (-29)
1.1.1.2 misho 2302: </pre>
2303: This error is given if a pattern that was compiled and saved is reloaded on a
2304: host with different endianness. The utility function
2305: <b>pcre_pattern_to_host_byte_order()</b> can be used to convert such a pattern
2306: so that it runs on the new host.
1.1.1.4 ! misho 2307: <pre>
! 2308: PCRE_ERROR_JIT_BADOPTION
! 2309: </pre>
! 2310: This error is returned when a pattern that was successfully studied using a JIT
! 2311: compile option is being matched, but the matching mode (partial or complete
! 2312: match) does not correspond to any JIT compilation mode. When the JIT fast path
! 2313: function is used, this error may be also given for invalid options. See the
! 2314: <a href="pcrejit.html"><b>pcrejit</b></a>
! 2315: documentation for more details.
! 2316: <pre>
! 2317: PCRE_ERROR_BADLENGTH (-32)
! 2318: </pre>
! 2319: This error is given if <b>pcre_exec()</b> is called with a negative value for
! 2320: the <i>length</i> argument.
1.1 misho 2321: </P>
2322: <P>
1.1.1.4 ! misho 2323: Error numbers -16 to -20, -22, and 30 are not used by <b>pcre_exec()</b>.
1.1 misho 2324: <a name="badutf8reasons"></a></P>
2325: <br><b>
2326: Reason codes for invalid UTF-8 strings
2327: </b><br>
2328: <P>
1.1.1.2 misho 2329: This section applies only to the 8-bit library. The corresponding information
1.1.1.4 ! misho 2330: for the 16-bit and 32-bit libraries is given in the
1.1.1.2 misho 2331: <a href="pcre16.html"><b>pcre16</b></a>
1.1.1.4 ! misho 2332: and
! 2333: <a href="pcre32.html"><b>pcre32</b></a>
! 2334: pages.
1.1.1.2 misho 2335: </P>
2336: <P>
1.1 misho 2337: When <b>pcre_exec()</b> returns either PCRE_ERROR_BADUTF8 or
2338: PCRE_ERROR_SHORTUTF8, and the size of the output vector (<i>ovecsize</i>) is at
2339: least 2, the offset of the start of the invalid UTF-8 character is placed in
2340: the first output vector element (<i>ovector[0]</i>) and a reason code is placed
2341: in the second element (<i>ovector[1]</i>). The reason codes are given names in
2342: the <b>pcre.h</b> header file:
2343: <pre>
2344: PCRE_UTF8_ERR1
2345: PCRE_UTF8_ERR2
2346: PCRE_UTF8_ERR3
2347: PCRE_UTF8_ERR4
2348: PCRE_UTF8_ERR5
2349: </pre>
2350: The string ends with a truncated UTF-8 character; the code specifies how many
2351: bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 characters to be
2352: no longer than 4 bytes, the encoding scheme (originally defined by RFC 2279)
2353: allows for up to 6 bytes, and this is checked first; hence the possibility of
2354: 4 or 5 missing bytes.
2355: <pre>
2356: PCRE_UTF8_ERR6
2357: PCRE_UTF8_ERR7
2358: PCRE_UTF8_ERR8
2359: PCRE_UTF8_ERR9
2360: PCRE_UTF8_ERR10
2361: </pre>
2362: The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of the
2363: character do not have the binary value 0b10 (that is, either the most
2364: significant bit is 0, or the next bit is 1).
2365: <pre>
2366: PCRE_UTF8_ERR11
2367: PCRE_UTF8_ERR12
2368: </pre>
2369: A character that is valid by the RFC 2279 rules is either 5 or 6 bytes long;
2370: these code points are excluded by RFC 3629.
2371: <pre>
2372: PCRE_UTF8_ERR13
2373: </pre>
2374: A 4-byte character has a value greater than 0x10fff; these code points are
2375: excluded by RFC 3629.
2376: <pre>
2377: PCRE_UTF8_ERR14
2378: </pre>
2379: A 3-byte character has a value in the range 0xd800 to 0xdfff; this range of
2380: code points are reserved by RFC 3629 for use with UTF-16, and so are excluded
2381: from UTF-8.
2382: <pre>
2383: PCRE_UTF8_ERR15
2384: PCRE_UTF8_ERR16
2385: PCRE_UTF8_ERR17
2386: PCRE_UTF8_ERR18
2387: PCRE_UTF8_ERR19
2388: </pre>
2389: A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes for a
2390: value that can be represented by fewer bytes, which is invalid. For example,
2391: the two bytes 0xc0, 0xae give the value 0x2e, whose correct coding uses just
2392: one byte.
2393: <pre>
2394: PCRE_UTF8_ERR20
2395: </pre>
2396: The two most significant bits of the first byte of a character have the binary
2397: value 0b10 (that is, the most significant bit is 1 and the second is 0). Such a
2398: byte can only validly occur as the second or subsequent byte of a multi-byte
2399: character.
2400: <pre>
2401: PCRE_UTF8_ERR21
2402: </pre>
2403: The first byte of a character has the value 0xfe or 0xff. These values can
2404: never occur in a valid UTF-8 string.
1.1.1.4 ! misho 2405: <pre>
! 2406: PCRE_UTF8_ERR22
! 2407: </pre>
! 2408: This error code was formerly used when the presence of a so-called
! 2409: "non-character" caused an error. Unicode corrigendum #9 makes it clear that
! 2410: such characters should not cause a string to be rejected, and so this code is
! 2411: no longer in use and is never returned.
1.1 misho 2412: </P>
1.1.1.2 misho 2413: <br><a name="SEC18" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NUMBER</a><br>
1.1 misho 2414: <P>
2415: <b>int pcre_copy_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
2416: <b>int <i>stringcount</i>, int <i>stringnumber</i>, char *<i>buffer</i>,</b>
2417: <b>int <i>buffersize</i>);</b>
2418: </P>
2419: <P>
2420: <b>int pcre_get_substring(const char *<i>subject</i>, int *<i>ovector</i>,</b>
2421: <b>int <i>stringcount</i>, int <i>stringnumber</i>,</b>
2422: <b>const char **<i>stringptr</i>);</b>
2423: </P>
2424: <P>
2425: <b>int pcre_get_substring_list(const char *<i>subject</i>,</b>
2426: <b>int *<i>ovector</i>, int <i>stringcount</i>, const char ***<i>listptr</i>);</b>
2427: </P>
2428: <P>
2429: Captured substrings can be accessed directly by using the offsets returned by
2430: <b>pcre_exec()</b> in <i>ovector</i>. For convenience, the functions
2431: <b>pcre_copy_substring()</b>, <b>pcre_get_substring()</b>, and
2432: <b>pcre_get_substring_list()</b> are provided for extracting captured substrings
2433: as new, separate, zero-terminated strings. These functions identify substrings
2434: by number. The next section describes functions for extracting named
2435: substrings.
2436: </P>
2437: <P>
2438: A substring that contains a binary zero is correctly extracted and has a
2439: further zero added on the end, but the result is not, of course, a C string.
2440: However, you can process such a string by referring to the length that is
2441: returned by <b>pcre_copy_substring()</b> and <b>pcre_get_substring()</b>.
2442: Unfortunately, the interface to <b>pcre_get_substring_list()</b> is not adequate
2443: for handling strings containing binary zeros, because the end of the final
2444: string is not independently indicated.
2445: </P>
2446: <P>
2447: The first three arguments are the same for all three of these functions:
2448: <i>subject</i> is the subject string that has just been successfully matched,
2449: <i>ovector</i> is a pointer to the vector of integer offsets that was passed to
2450: <b>pcre_exec()</b>, and <i>stringcount</i> is the number of substrings that were
2451: captured by the match, including the substring that matched the entire regular
2452: expression. This is the value returned by <b>pcre_exec()</b> if it is greater
2453: than zero. If <b>pcre_exec()</b> returned zero, indicating that it ran out of
2454: space in <i>ovector</i>, the value passed as <i>stringcount</i> should be the
2455: number of elements in the vector divided by three.
2456: </P>
2457: <P>
2458: The functions <b>pcre_copy_substring()</b> and <b>pcre_get_substring()</b>
2459: extract a single substring, whose number is given as <i>stringnumber</i>. A
2460: value of zero extracts the substring that matched the entire pattern, whereas
2461: higher values extract the captured substrings. For <b>pcre_copy_substring()</b>,
2462: the string is placed in <i>buffer</i>, whose length is given by
2463: <i>buffersize</i>, while for <b>pcre_get_substring()</b> a new block of memory is
2464: obtained via <b>pcre_malloc</b>, and its address is returned via
2465: <i>stringptr</i>. The yield of the function is the length of the string, not
2466: including the terminating zero, or one of these error codes:
2467: <pre>
2468: PCRE_ERROR_NOMEMORY (-6)
2469: </pre>
2470: The buffer was too small for <b>pcre_copy_substring()</b>, or the attempt to get
2471: memory failed for <b>pcre_get_substring()</b>.
2472: <pre>
2473: PCRE_ERROR_NOSUBSTRING (-7)
2474: </pre>
2475: There is no substring whose number is <i>stringnumber</i>.
2476: </P>
2477: <P>
2478: The <b>pcre_get_substring_list()</b> function extracts all available substrings
2479: and builds a list of pointers to them. All this is done in a single block of
2480: memory that is obtained via <b>pcre_malloc</b>. The address of the memory block
2481: is returned via <i>listptr</i>, which is also the start of the list of string
2482: pointers. The end of the list is marked by a NULL pointer. The yield of the
2483: function is zero if all went well, or the error code
2484: <pre>
2485: PCRE_ERROR_NOMEMORY (-6)
2486: </pre>
2487: if the attempt to get the memory block failed.
2488: </P>
2489: <P>
2490: When any of these functions encounter a substring that is unset, which can
2491: happen when capturing subpattern number <i>n+1</i> matches some part of the
2492: subject, but subpattern <i>n</i> has not been used at all, they return an empty
2493: string. This can be distinguished from a genuine zero-length substring by
2494: inspecting the appropriate offset in <i>ovector</i>, which is negative for unset
2495: substrings.
2496: </P>
2497: <P>
2498: The two convenience functions <b>pcre_free_substring()</b> and
2499: <b>pcre_free_substring_list()</b> can be used to free the memory returned by
2500: a previous call of <b>pcre_get_substring()</b> or
2501: <b>pcre_get_substring_list()</b>, respectively. They do nothing more than call
2502: the function pointed to by <b>pcre_free</b>, which of course could be called
2503: directly from a C program. However, PCRE is used in some situations where it is
2504: linked via a special interface to another programming language that cannot use
2505: <b>pcre_free</b> directly; it is for these cases that the functions are
2506: provided.
2507: </P>
1.1.1.2 misho 2508: <br><a name="SEC19" href="#TOC1">EXTRACTING CAPTURED SUBSTRINGS BY NAME</a><br>
1.1 misho 2509: <P>
2510: <b>int pcre_get_stringnumber(const pcre *<i>code</i>,</b>
2511: <b>const char *<i>name</i>);</b>
2512: </P>
2513: <P>
2514: <b>int pcre_copy_named_substring(const pcre *<i>code</i>,</b>
2515: <b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
2516: <b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
2517: <b>char *<i>buffer</i>, int <i>buffersize</i>);</b>
2518: </P>
2519: <P>
2520: <b>int pcre_get_named_substring(const pcre *<i>code</i>,</b>
2521: <b>const char *<i>subject</i>, int *<i>ovector</i>,</b>
2522: <b>int <i>stringcount</i>, const char *<i>stringname</i>,</b>
2523: <b>const char **<i>stringptr</i>);</b>
2524: </P>
2525: <P>
2526: To extract a substring by name, you first have to find associated number.
2527: For example, for this pattern
2528: <pre>
2529: (a+)b(?<xxx>\d+)...
2530: </pre>
2531: the number of the subpattern called "xxx" is 2. If the name is known to be
2532: unique (PCRE_DUPNAMES was not set), you can find the number from the name by
2533: calling <b>pcre_get_stringnumber()</b>. The first argument is the compiled
2534: pattern, and the second is the name. The yield of the function is the
2535: subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no subpattern of
2536: that name.
2537: </P>
2538: <P>
2539: Given the number, you can extract the substring directly, or use one of the
2540: functions described in the previous section. For convenience, there are also
2541: two functions that do the whole job.
2542: </P>
2543: <P>
2544: Most of the arguments of <b>pcre_copy_named_substring()</b> and
2545: <b>pcre_get_named_substring()</b> are the same as those for the similarly named
2546: functions that extract by number. As these are described in the previous
2547: section, they are not re-described here. There are just two differences:
2548: </P>
2549: <P>
2550: First, instead of a substring number, a substring name is given. Second, there
2551: is an extra argument, given at the start, which is a pointer to the compiled
2552: pattern. This is needed in order to gain access to the name-to-number
2553: translation table.
2554: </P>
2555: <P>
2556: These functions call <b>pcre_get_stringnumber()</b>, and if it succeeds, they
2557: then call <b>pcre_copy_substring()</b> or <b>pcre_get_substring()</b>, as
2558: appropriate. <b>NOTE:</b> If PCRE_DUPNAMES is set and there are duplicate names,
2559: the behaviour may not be what you want (see the next section).
2560: </P>
2561: <P>
2562: <b>Warning:</b> If the pattern uses the (?| feature to set up multiple
2563: subpatterns with the same number, as described in the
2564: <a href="pcrepattern.html#dupsubpatternnumber">section on duplicate subpattern numbers</a>
2565: in the
2566: <a href="pcrepattern.html"><b>pcrepattern</b></a>
2567: page, you cannot use names to distinguish the different subpatterns, because
2568: names are not included in the compiled code. The matching process uses only
2569: numbers. For this reason, the use of different names for subpatterns of the
2570: same number causes an error at compile time.
2571: </P>
1.1.1.2 misho 2572: <br><a name="SEC20" href="#TOC1">DUPLICATE SUBPATTERN NAMES</a><br>
1.1 misho 2573: <P>
2574: <b>int pcre_get_stringtable_entries(const pcre *<i>code</i>,</b>
2575: <b>const char *<i>name</i>, char **<i>first</i>, char **<i>last</i>);</b>
2576: </P>
2577: <P>
2578: When a pattern is compiled with the PCRE_DUPNAMES option, names for subpatterns
2579: are not required to be unique. (Duplicate names are always allowed for
2580: subpatterns with the same number, created by using the (?| feature. Indeed, if
2581: such subpatterns are named, they are required to use the same names.)
2582: </P>
2583: <P>
2584: Normally, patterns with duplicate names are such that in any one match, only
2585: one of the named subpatterns participates. An example is shown in the
2586: <a href="pcrepattern.html"><b>pcrepattern</b></a>
2587: documentation.
2588: </P>
2589: <P>
2590: When duplicates are present, <b>pcre_copy_named_substring()</b> and
2591: <b>pcre_get_named_substring()</b> return the first substring corresponding to
2592: the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING (-7) is
2593: returned; no data is returned. The <b>pcre_get_stringnumber()</b> function
2594: returns one of the numbers that are associated with the name, but it is not
2595: defined which it is.
2596: </P>
2597: <P>
2598: If you want to get full details of all captured substrings for a given name,
2599: you must use the <b>pcre_get_stringtable_entries()</b> function. The first
2600: argument is the compiled pattern, and the second is the name. The third and
2601: fourth are pointers to variables which are updated by the function. After it
2602: has run, they point to the first and last entries in the name-to-number table
2603: for the given name. The function itself returns the length of each entry, or
2604: PCRE_ERROR_NOSUBSTRING (-7) if there are none. The format of the table is
2605: described above in the section entitled <i>Information about a pattern</i>
2606: <a href="#infoaboutpattern">above.</a>
2607: Given all the relevant entries for the name, you can extract each of their
2608: numbers, and hence the captured data, if any.
2609: </P>
1.1.1.2 misho 2610: <br><a name="SEC21" href="#TOC1">FINDING ALL POSSIBLE MATCHES</a><br>
1.1 misho 2611: <P>
2612: The traditional matching function uses a similar algorithm to Perl, which stops
2613: when it finds the first match, starting at a given point in the subject. If you
2614: want to find all possible matches, or the longest possible match, consider
2615: using the alternative matching function (see below) instead. If you cannot use
2616: the alternative function, but still need to find all possible matches, you
2617: can kludge it up by making use of the callout facility, which is described in
2618: the
2619: <a href="pcrecallout.html"><b>pcrecallout</b></a>
2620: documentation.
2621: </P>
2622: <P>
2623: What you have to do is to insert a callout right at the end of the pattern.
2624: When your callout function is called, extract and save the current matched
2625: substring. Then return 1, which forces <b>pcre_exec()</b> to backtrack and try
2626: other alternatives. Ultimately, when it runs out of matches, <b>pcre_exec()</b>
2627: will yield PCRE_ERROR_NOMATCH.
1.1.1.2 misho 2628: </P>
2629: <br><a name="SEC22" href="#TOC1">OBTAINING AN ESTIMATE OF STACK USAGE</a><br>
2630: <P>
2631: Matching certain patterns using <b>pcre_exec()</b> can use a lot of process
2632: stack, which in certain environments can be rather limited in size. Some users
2633: find it helpful to have an estimate of the amount of stack that is used by
2634: <b>pcre_exec()</b>, to help them set recursion limits, as described in the
2635: <a href="pcrestack.html"><b>pcrestack</b></a>
2636: documentation. The estimate that is output by <b>pcretest</b> when called with
2637: the <b>-m</b> and <b>-C</b> options is obtained by calling <b>pcre_exec</b> with
2638: the values NULL, NULL, NULL, -999, and -999 for its first five arguments.
2639: </P>
2640: <P>
2641: Normally, if its first argument is NULL, <b>pcre_exec()</b> immediately returns
2642: the negative error code PCRE_ERROR_NULL, but with this special combination of
2643: arguments, it returns instead a negative number whose absolute value is the
2644: approximate stack frame size in bytes. (A negative number is used so that it is
2645: clear that no match has happened.) The value is approximate because in some
2646: cases, recursive calls to <b>pcre_exec()</b> occur when there are one or two
2647: additional variables on the stack.
2648: </P>
2649: <P>
2650: If PCRE has been compiled to use the heap instead of the stack for recursion,
2651: the value returned is the size of each block that is obtained from the heap.
1.1 misho 2652: <a name="dfamatch"></a></P>
1.1.1.2 misho 2653: <br><a name="SEC23" href="#TOC1">MATCHING A PATTERN: THE ALTERNATIVE FUNCTION</a><br>
1.1 misho 2654: <P>
2655: <b>int pcre_dfa_exec(const pcre *<i>code</i>, const pcre_extra *<i>extra</i>,</b>
2656: <b>const char *<i>subject</i>, int <i>length</i>, int <i>startoffset</i>,</b>
2657: <b>int <i>options</i>, int *<i>ovector</i>, int <i>ovecsize</i>,</b>
2658: <b>int *<i>workspace</i>, int <i>wscount</i>);</b>
2659: </P>
2660: <P>
2661: The function <b>pcre_dfa_exec()</b> is called to match a subject string against
2662: a compiled pattern, using a matching algorithm that scans the subject string
2663: just once, and does not backtrack. This has different characteristics to the
2664: normal algorithm, and is not compatible with Perl. Some of the features of PCRE
2665: patterns are not supported. Nevertheless, there are times when this kind of
2666: matching can be useful. For a discussion of the two matching algorithms, and a
2667: list of features that <b>pcre_dfa_exec()</b> does not support, see the
2668: <a href="pcrematching.html"><b>pcrematching</b></a>
2669: documentation.
2670: </P>
2671: <P>
2672: The arguments for the <b>pcre_dfa_exec()</b> function are the same as for
2673: <b>pcre_exec()</b>, plus two extras. The <i>ovector</i> argument is used in a
2674: different way, and this is described below. The other common arguments are used
2675: in the same way as for <b>pcre_exec()</b>, so their description is not repeated
2676: here.
2677: </P>
2678: <P>
2679: The two additional arguments provide workspace for the function. The workspace
2680: vector should contain at least 20 elements. It is used for keeping track of
2681: multiple paths through the pattern tree. More workspace will be needed for
2682: patterns and subjects where there are a lot of potential matches.
2683: </P>
2684: <P>
2685: Here is an example of a simple call to <b>pcre_dfa_exec()</b>:
2686: <pre>
2687: int rc;
2688: int ovector[10];
2689: int wspace[20];
2690: rc = pcre_dfa_exec(
2691: re, /* result of pcre_compile() */
2692: NULL, /* we didn't study the pattern */
2693: "some string", /* the subject string */
2694: 11, /* the length of the subject string */
2695: 0, /* start at offset 0 in the subject */
2696: 0, /* default options */
2697: ovector, /* vector of integers for substring information */
2698: 10, /* number of elements (NOT size in bytes) */
2699: wspace, /* working space vector */
2700: 20); /* number of elements (NOT size in bytes) */
2701: </PRE>
2702: </P>
2703: <br><b>
2704: Option bits for <b>pcre_dfa_exec()</b>
2705: </b><br>
2706: <P>
2707: The unused bits of the <i>options</i> argument for <b>pcre_dfa_exec()</b> must be
2708: zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_<i>xxx</i>,
2709: PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
2710: PCRE_NO_UTF8_CHECK, PCRE_BSR_ANYCRLF, PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE,
2711: PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.
2712: All but the last four of these are exactly the same as for <b>pcre_exec()</b>,
2713: so their description is not repeated here.
2714: <pre>
2715: PCRE_PARTIAL_HARD
2716: PCRE_PARTIAL_SOFT
2717: </pre>
2718: These have the same general effect as they do for <b>pcre_exec()</b>, but the
2719: details are slightly different. When PCRE_PARTIAL_HARD is set for
2720: <b>pcre_dfa_exec()</b>, it returns PCRE_ERROR_PARTIAL if the end of the subject
2721: is reached and there is still at least one matching possibility that requires
2722: additional characters. This happens even if some complete matches have also
2723: been found. When PCRE_PARTIAL_SOFT is set, the return code PCRE_ERROR_NOMATCH
2724: is converted into PCRE_ERROR_PARTIAL if the end of the subject is reached,
2725: there have been no complete matches, but there is still at least one matching
2726: possibility. The portion of the string that was inspected when the longest
2727: partial match was found is set as the first matching string in both cases.
2728: There is a more detailed discussion of partial and multi-segment matching, with
2729: examples, in the
2730: <a href="pcrepartial.html"><b>pcrepartial</b></a>
2731: documentation.
2732: <pre>
2733: PCRE_DFA_SHORTEST
2734: </pre>
2735: Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to stop as
2736: soon as it has found one match. Because of the way the alternative algorithm
2737: works, this is necessarily the shortest possible match at the first possible
2738: matching point in the subject string.
2739: <pre>
2740: PCRE_DFA_RESTART
2741: </pre>
2742: When <b>pcre_dfa_exec()</b> returns a partial match, it is possible to call it
2743: again, with additional subject characters, and have it continue with the same
2744: match. The PCRE_DFA_RESTART option requests this action; when it is set, the
2745: <i>workspace</i> and <i>wscount</i> options must reference the same vector as
2746: before because data about the match so far is left in them after a partial
2747: match. There is more discussion of this facility in the
2748: <a href="pcrepartial.html"><b>pcrepartial</b></a>
2749: documentation.
2750: </P>
2751: <br><b>
2752: Successful returns from <b>pcre_dfa_exec()</b>
2753: </b><br>
2754: <P>
2755: When <b>pcre_dfa_exec()</b> succeeds, it may have matched more than one
2756: substring in the subject. Note, however, that all the matches from one run of
2757: the function start at the same point in the subject. The shorter matches are
2758: all initial substrings of the longer matches. For example, if the pattern
2759: <pre>
2760: <.*>
2761: </pre>
2762: is matched against the string
2763: <pre>
2764: This is <something> <something else> <something further> no more
2765: </pre>
2766: the three matched strings are
2767: <pre>
2768: <something>
2769: <something> <something else>
2770: <something> <something else> <something further>
2771: </pre>
2772: On success, the yield of the function is a number greater than zero, which is
2773: the number of matched substrings. The substrings themselves are returned in
2774: <i>ovector</i>. Each string uses two elements; the first is the offset to the
2775: start, and the second is the offset to the end. In fact, all the strings have
2776: the same start offset. (Space could have been saved by giving this only once,
2777: but it was decided to retain some compatibility with the way <b>pcre_exec()</b>
2778: returns data, even though the meaning of the strings is different.)
2779: </P>
2780: <P>
2781: The strings are returned in reverse order of length; that is, the longest
2782: matching string is given first. If there were too many matches to fit into
2783: <i>ovector</i>, the yield of the function is zero, and the vector is filled with
2784: the longest matches. Unlike <b>pcre_exec()</b>, <b>pcre_dfa_exec()</b> can use
2785: the entire <i>ovector</i> for returning matched strings.
2786: </P>
2787: <br><b>
2788: Error returns from <b>pcre_dfa_exec()</b>
2789: </b><br>
2790: <P>
2791: The <b>pcre_dfa_exec()</b> function returns a negative number when it fails.
2792: Many of the errors are the same as for <b>pcre_exec()</b>, and these are
2793: described
2794: <a href="#errorlist">above.</a>
2795: There are in addition the following errors that are specific to
2796: <b>pcre_dfa_exec()</b>:
2797: <pre>
2798: PCRE_ERROR_DFA_UITEM (-16)
2799: </pre>
2800: This return is given if <b>pcre_dfa_exec()</b> encounters an item in the pattern
2801: that it does not support, for instance, the use of \C or a back reference.
2802: <pre>
2803: PCRE_ERROR_DFA_UCOND (-17)
2804: </pre>
2805: This return is given if <b>pcre_dfa_exec()</b> encounters a condition item that
2806: uses a back reference for the condition, or a test for recursion in a specific
2807: group. These are not supported.
2808: <pre>
2809: PCRE_ERROR_DFA_UMLIMIT (-18)
2810: </pre>
2811: This return is given if <b>pcre_dfa_exec()</b> is called with an <i>extra</i>
2812: block that contains a setting of the <i>match_limit</i> or
2813: <i>match_limit_recursion</i> fields. This is not supported (these fields are
2814: meaningless for DFA matching).
2815: <pre>
2816: PCRE_ERROR_DFA_WSSIZE (-19)
2817: </pre>
2818: This return is given if <b>pcre_dfa_exec()</b> runs out of space in the
2819: <i>workspace</i> vector.
2820: <pre>
2821: PCRE_ERROR_DFA_RECURSE (-20)
2822: </pre>
2823: When a recursive subpattern is processed, the matching function calls itself
2824: recursively, using private vectors for <i>ovector</i> and <i>workspace</i>. This
2825: error is given if the output vector is not large enough. This should be
2826: extremely rare, as a vector of size 1000 is used.
1.1.1.3 misho 2827: <pre>
2828: PCRE_ERROR_DFA_BADRESTART (-30)
2829: </pre>
2830: When <b>pcre_dfa_exec()</b> is called with the <b>PCRE_DFA_RESTART</b> option,
2831: some plausibility checks are made on the contents of the workspace, which
2832: should contain data about the previous partial match. If any of these checks
2833: fail, this error is given.
1.1 misho 2834: </P>
1.1.1.2 misho 2835: <br><a name="SEC24" href="#TOC1">SEE ALSO</a><br>
1.1 misho 2836: <P>
1.1.1.4 ! misho 2837: <b>pcre16</b>(3), <b>pcre32</b>(3), <b>pcrebuild</b>(3), <b>pcrecallout</b>(3),
! 2838: <b>pcrecpp(3)</b>(3), <b>pcrematching</b>(3), <b>pcrepartial</b>(3),
! 2839: <b>pcreposix</b>(3), <b>pcreprecompile</b>(3), <b>pcresample</b>(3),
! 2840: <b>pcrestack</b>(3).
1.1 misho 2841: </P>
1.1.1.2 misho 2842: <br><a name="SEC25" href="#TOC1">AUTHOR</a><br>
1.1 misho 2843: <P>
2844: Philip Hazel
2845: <br>
2846: University Computing Service
2847: <br>
2848: Cambridge CB2 3QH, England.
2849: <br>
2850: </P>
1.1.1.2 misho 2851: <br><a name="SEC26" href="#TOC1">REVISION</a><br>
1.1 misho 2852: <P>
1.1.1.4 ! misho 2853: Last updated: 12 May 2013
1.1 misho 2854: <br>
1.1.1.4 ! misho 2855: Copyright © 1997-2013 University of Cambridge.
1.1 misho 2856: <br>
2857: <p>
2858: Return to the <a href="index.html">PCRE index page</a>.
2859: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>