Annotation of embedaddon/pcre/doc/html/pcrepartial.html, revision 1.1.1.2
1.1 misho 1: <html>
2: <head>
3: <title>pcrepartial specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcrepartial man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
1.1.1.2 ! misho 17: <li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()</a>
! 18: <li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()</a>
1.1 misho 19: <li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
20: <li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
21: <li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
1.1.1.2 ! misho 22: <li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()</a>
! 23: <li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()</a>
1.1 misho 24: <li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
25: <li><a name="TOC10" href="#SEC10">AUTHOR</a>
26: <li><a name="TOC11" href="#SEC11">REVISION</a>
27: </ul>
28: <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
29: <P>
1.1.1.2 ! misho 30: In normal use of PCRE, if the subject string that is passed to a matching
! 31: function matches as far as it goes, but is too short to match the entire
! 32: pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might
! 33: be helpful to distinguish this case from other cases in which there is no
! 34: match.
1.1 misho 35: </P>
36: <P>
37: Consider, for example, an application where a human is required to type in data
38: for a field with specific formatting requirements. An example might be a date
39: in the form <i>ddmmmyy</i>, defined by this pattern:
40: <pre>
41: ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
42: </pre>
43: If the application sees the user's keystrokes one by one, and can check that
44: what has been typed so far is potentially valid, it is able to raise an error
45: as soon as a mistake is made, by beeping and not reflecting the character that
46: has been typed, for example. This immediate feedback is likely to be a better
47: user interface than a check that is delayed until the entire string has been
48: entered. Partial matching can also be useful when the subject string is very
49: long and is not all available at once.
50: </P>
51: <P>
52: PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
1.1.1.2 ! misho 53: PCRE_PARTIAL_HARD options, which can be set when calling any of the matching
! 54: functions. For backwards compatibility, PCRE_PARTIAL is a synonym for
! 55: PCRE_PARTIAL_SOFT. The essential difference between the two options is whether
! 56: or not a partial match is preferred to an alternative complete match, though
! 57: the details differ between the two types of matching function. If both options
1.1 misho 58: are set, PCRE_PARTIAL_HARD takes precedence.
59: </P>
60: <P>
1.1.1.2 ! misho 61: Setting a partial matching option disables the use of any just-in-time code
! 62: that was set up by studying the compiled pattern with the
1.1 misho 63: PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard
1.1.1.2 ! misho 64: optimizations. PCRE remembers the last literal data unit in a pattern, and
! 65: abandons matching immediately if it is not present in the subject string. This
1.1 misho 66: optimization cannot be used for a subject string that might match only
67: partially. If the pattern was studied, PCRE knows the minimum length of a
68: matching string, and does not bother to run the matching function on shorter
69: strings. This optimization is also disabled for partial matching.
70: </P>
1.1.1.2 ! misho 71: <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()</a><br>
1.1 misho 72: <P>
1.1.1.2 ! misho 73: A partial match occurs during a call to <b>pcre_exec()</b> or
! 74: <b>pcre16_exec()</b> when the end of the subject string is reached successfully,
! 75: but matching cannot continue because more characters are needed. However, at
! 76: least one character in the subject must have been inspected. This character
! 77: need not form part of the final matched string; lookbehind assertions and the
! 78: \K escape sequence provide ways of inspecting characters before the start of a
! 79: matched substring. The requirement for inspecting at least one character exists
! 80: because an empty string can always be matched; without such a restriction there
! 81: would always be a partial match of an empty string at the end of the subject.
! 82: </P>
! 83: <P>
! 84: If there are at least two slots in the offsets vector when a partial match is
! 85: returned, the first slot is set to the offset of the earliest character that
! 86: was inspected. For convenience, the second offset points to the end of the
! 87: subject so that a substring can easily be identified.
1.1 misho 88: </P>
89: <P>
90: For the majority of patterns, the first offset identifies the start of the
91: partially matched string. However, for patterns that contain lookbehind
92: assertions, or \K, or begin with \b or \B, earlier characters have been
93: inspected while carrying out the match. For example:
94: <pre>
95: /(?<=abc)123/
96: </pre>
97: This pattern matches "123", but only if it is preceded by "abc". If the subject
98: string is "xyzabc12", the offsets after a partial match are for the substring
99: "abc12", because all these characters are needed if another match is tried
100: with extra characters added to the subject.
101: </P>
102: <P>
103: What happens when a partial match is identified depends on which of the two
104: partial matching options are set.
105: </P>
106: <br><b>
1.1.1.2 ! misho 107: PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()
1.1 misho 108: </b><br>
109: <P>
1.1.1.2 ! misho 110: If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> or <b>pcre16_exec()</b>
! 111: identifies a partial match, the partial match is remembered, but matching
! 112: continues as normal, and other alternatives in the pattern are tried. If no
! 113: complete match can be found, PCRE_ERROR_PARTIAL is returned instead of
! 114: PCRE_ERROR_NOMATCH.
1.1 misho 115: </P>
116: <P>
117: This option is "soft" because it prefers a complete match over a partial match.
118: All the various matching items in a pattern behave as if the subject string is
119: potentially complete. For example, \z, \Z, and $ match at the end of the
120: subject, as normal, and for \b and \B the end of the subject is treated as a
121: non-alphanumeric.
122: </P>
123: <P>
124: If there is more than one partial match, the first one that was found provides
125: the data that is returned. Consider this pattern:
126: <pre>
127: /123\w+X|dogY/
128: </pre>
129: If this is matched against the subject string "abc123dog", both
130: alternatives fail to match, but the end of the subject is reached during
131: matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
132: identifying "123dog" as the first partial match that was found. (In this
133: example, there are two partial matches, because "dog" on its own partially
134: matches the second alternative.)
135: </P>
136: <br><b>
1.1.1.2 ! misho 137: PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()
1.1 misho 138: </b><br>
139: <P>
1.1.1.2 ! misho 140: If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b> or <b>pcre16_exec()</b>,
! 141: PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without
! 142: continuing to search for possible complete matches. This option is "hard"
! 143: because it prefers an earlier partial match over a later complete match. For
! 144: this reason, the assumption is made that the end of the supplied subject string
! 145: may not be the true end of the available data, and so, if \z, \Z, \b, \B,
! 146: or $ are encountered at the end of the subject, the result is
! 147: PCRE_ERROR_PARTIAL, provided that at least one character in the subject has
! 148: been inspected.
! 149: </P>
! 150: <P>
! 151: Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16
! 152: subject strings are checked for validity. Normally, an invalid sequence
! 153: causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the
! 154: special case of a truncated character at the end of the subject,
! 155: PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when
1.1 misho 156: PCRE_PARTIAL_HARD is set.
157: </P>
158: <br><b>
159: Comparing hard and soft partial matching
160: </b><br>
161: <P>
162: The difference between the two partial matching options can be illustrated by a
163: pattern such as:
164: <pre>
165: /dog(sbody)?/
166: </pre>
167: This matches either "dog" or "dogsbody", greedily (that is, it prefers the
168: longer string if possible). If it is matched against the string "dog" with
169: PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
170: PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
171: if the pattern is made ungreedy the result is different:
172: <pre>
173: /dog(sbody)??/
174: </pre>
1.1.1.2 ! misho 175: In this case the result is always a complete match because that is found first,
! 176: and matching never continues after finding a complete match. It might be easier
! 177: to follow this explanation by thinking of the two patterns like this:
1.1 misho 178: <pre>
179: /dog(sbody)?/ is the same as /dogsbody|dog/
180: /dog(sbody)??/ is the same as /dog|dogsbody/
181: </pre>
1.1.1.2 ! misho 182: The second pattern will never match "dogsbody", because it will always find the
! 183: shorter match first.
1.1 misho 184: </P>
1.1.1.2 ! misho 185: <br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()</a><br>
1.1 misho 186: <P>
1.1.1.2 ! misho 187: The DFA functions move along the subject string character by character, without
! 188: backtracking, searching for all possible matches simultaneously. If the end of
! 189: the subject is reached before the end of the pattern, there is the possibility
! 190: of a partial match, again provided that at least one character has been
! 191: inspected.
1.1 misho 192: </P>
193: <P>
194: When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
195: have been no complete matches. Otherwise, the complete matches are returned.
196: However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
197: complete matches. The portion of the string that was inspected when the longest
198: partial match was found is set as the first matching string, provided there are
199: at least two slots in the offsets vector.
200: </P>
201: <P>
1.1.1.2 ! misho 202: Because the DFA functions always search for all possible matches, and there is
! 203: no difference between greedy and ungreedy repetition, their behaviour is
! 204: different from the standard functions when PCRE_PARTIAL_HARD is set. Consider
! 205: the string "dog" matched against the ungreedy pattern shown above:
1.1 misho 206: <pre>
207: /dog(sbody)??/
208: </pre>
1.1.1.2 ! misho 209: Whereas the standard functions stop as soon as they find the complete match for
! 210: "dog", the DFA functions also find the partial match for "dogsbody", and so
! 211: return that when PCRE_PARTIAL_HARD is set.
1.1 misho 212: </P>
213: <br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
214: <P>
215: If a pattern ends with one of sequences \b or \B, which test for word
216: boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
217: results. Consider this pattern:
218: <pre>
219: /\bcat\b/
220: </pre>
221: This matches "cat", provided there is a word boundary at either end. If the
222: subject string is "the cat", the comparison of the final "t" with a following
1.1.1.2 ! misho 223: character cannot take place, so a partial match is found. However, normal
! 224: matching carries on, and \b matches at the end of the subject when the last
! 225: character is a letter, so a complete match is found. The result, therefore, is
! 226: <i>not</i> PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield
! 227: PCRE_ERROR_PARTIAL, because then the partial match takes precedence.
1.1 misho 228: </P>
229: <br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
230: <P>
231: For releases of PCRE prior to 8.00, because of the way certain internal
232: optimizations were implemented in the <b>pcre_exec()</b> function, the
233: PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
234: all patterns. From release 8.00 onwards, the restrictions no longer apply, and
1.1.1.2 ! misho 235: partial matching with can be requested for any pattern.
1.1 misho 236: </P>
237: <P>
238: Items that were formerly restricted were repeated single characters and
239: repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
240: conform to the restrictions, <b>pcre_exec()</b> returned the error code
241: PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
242: PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
243: pattern can be used for partial matching now always returns 1.
244: </P>
245: <br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
246: <P>
247: If the escape sequence \P is present in a <b>pcretest</b> data line, the
248: PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
249: that uses the date example quoted above:
250: <pre>
251: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
252: data> 25jun04\P
253: 0: 25jun04
254: 1: jun
255: data> 25dec3\P
256: Partial match: 23dec3
257: data> 3ju\P
258: Partial match: 3ju
259: data> 3juj\P
260: No match
261: data> j\P
262: No match
263: </pre>
264: The first data string is matched completely, so <b>pcretest</b> shows the
265: matched substrings. The remaining four strings do not match the complete
266: pattern, but the first two are partial matches. Similar output is obtained
1.1.1.2 ! misho 267: if DFA matching is used.
1.1 misho 268: </P>
269: <P>
270: If the escape sequence \P is present more than once in a <b>pcretest</b> data
271: line, the PCRE_PARTIAL_HARD option is set for the match.
272: </P>
1.1.1.2 ! misho 273: <br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()</a><br>
1.1 misho 274: <P>
1.1.1.2 ! misho 275: When a partial match has been found using a DFA matching function, it is
! 276: possible to continue the match by providing additional subject data and calling
! 277: the function again with the same compiled regular expression, this time setting
! 278: the PCRE_DFA_RESTART option. You must pass the same working space as before,
! 279: because this is where details of the previous partial match are stored. Here is
! 280: an example using <b>pcretest</b>, using the \R escape sequence to set the
! 281: PCRE_DFA_RESTART option (\D specifies the use of the DFA matching function):
1.1 misho 282: <pre>
283: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
284: data> 23ja\P\D
285: Partial match: 23ja
286: data> n05\R\D
287: 0: n05
288: </pre>
289: The first call has "23ja" as the subject, and requests partial matching; the
290: second call has "n05" as the subject for the continued (restarted) match.
291: Notice that when the match is complete, only the last part is shown; PCRE does
292: not retain the previously partially-matched string. It is up to the calling
293: program to do that if it needs to.
294: </P>
295: <P>
296: You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
297: PCRE_DFA_RESTART to continue partial matching over multiple segments. This
1.1.1.2 ! misho 298: facility can be used to pass very long subject strings to the DFA matching
! 299: functions.
! 300: </P>
! 301: <br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()</a><br>
! 302: <P>
! 303: From release 8.00, the standard matching functions can also be used to do
! 304: multi-segment matching. Unlike the DFA functions, it is not possible to
! 305: restart the previous match with a new segment of data. Instead, new data must
! 306: be added to the previous subject string, and the entire match re-run, starting
! 307: from the point where the partial match occurred. Earlier data can be discarded.
1.1 misho 308: </P>
309: <P>
1.1.1.2 ! misho 310: It is best to use PCRE_PARTIAL_HARD in this situation, because it does not
! 311: treat the end of a segment as the end of the subject when matching \z, \Z,
! 312: \b, \B, and $. Consider an unanchored pattern that matches dates:
1.1 misho 313: <pre>
314: re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
315: data> The date is 23ja\P\P
316: Partial match: 23ja
317: </pre>
318: At this stage, an application could discard the text preceding "23ja", add on
1.1.1.2 ! misho 319: text from the next segment, and call the matching function again. Unlike the
! 320: DFA matching functions the entire matching string must always be available, and
1.1 misho 321: the complete matching process occurs for each call, so more memory and more
322: processing time is needed.
323: </P>
324: <P>
325: <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
1.1.1.2 ! misho 326: with \b or \B, the string that is returned for a partial match includes
1.1 misho 327: characters that precede the partially matched string itself, because these must
328: be retained when adding on more characters for a subsequent matching attempt.
329: </P>
330: <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
331: <P>
332: Certain types of pattern may give problems with multi-segment matching,
333: whichever matching function is used.
334: </P>
335: <P>
336: 1. If the pattern contains a test for the beginning of a line, you need to pass
337: the PCRE_NOTBOL option when the subject string for any call does start at the
338: beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
339: doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
340: includes the effect of PCRE_NOTEOL.
341: </P>
342: <P>
343: 2. Lookbehind assertions at the start of a pattern are catered for in the
344: offsets that are returned for a partial match. However, in theory, a lookbehind
345: assertion later in the pattern could require even earlier characters to be
346: inspected, and it might not have been reached when a partial match occurs. This
347: is probably an extremely unlikely case; you could guard against it to a certain
348: extent by always including extra characters at the start.
349: </P>
350: <P>
351: 3. Matching a subject string that is split into multiple segments may not
352: always produce exactly the same result as matching over one single long string,
353: especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
354: Word Boundaries" above describes an issue that arises if the pattern ends with
355: \b or \B. Another kind of difference may occur when there are multiple
356: matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
357: is given only when there are no completed matches. This means that as soon as
358: the shortest match has been found, continuation to a new subject segment is no
359: longer possible. Consider again this <b>pcretest</b> example:
360: <pre>
361: re> /dog(sbody)?/
362: data> dogsb\P
363: 0: dog
364: data> do\P\D
365: Partial match: do
366: data> gsb\R\P\D
367: 0: g
368: data> dogsbody\D
369: 0: dogsbody
370: 1: dog
371: </pre>
1.1.1.2 ! misho 372: The first data line passes the string "dogsb" to a standard matching function,
! 373: setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match
! 374: for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter
! 375: string "dog" is a complete match. Similarly, when the subject is presented to
! 376: a DFA matching function in several parts ("do" and "gsb" being the first two)
! 377: the match stops when "dog" has been found, and it is not possible to continue.
! 378: On the other hand, if "dogsbody" is presented as a single string, a DFA
! 379: matching function finds both matches.
1.1 misho 380: </P>
381: <P>
382: Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
383: multi-segment data. The example above then behaves differently:
384: <pre>
385: re> /dog(sbody)?/
386: data> dogsb\P\P
387: Partial match: dogsb
388: data> do\P\D
389: Partial match: do
390: data> gsb\R\P\P\D
391: Partial match: gsb
392: </pre>
1.1.1.2 ! misho 393: 4. Patterns that contain alternatives at the top level which do not all start
! 394: with the same pattern item may not work as expected when PCRE_DFA_RESTART is
! 395: used. For example, consider this pattern:
1.1 misho 396: <pre>
397: 1234|3789
398: </pre>
399: If the first part of the subject is "ABC123", a partial match of the first
400: alternative is found at offset 3. There is no partial match for the second
401: alternative, because such a match does not start at the same point in the
402: subject string. Attempting to continue with the string "7890" does not yield a
403: match because only those alternatives that match at one point in the subject
404: are remembered. The problem arises because the start of the second alternative
405: matches within the first alternative. There is no problem with anchored
406: patterns or patterns such as:
407: <pre>
408: 1234|ABCD
409: </pre>
410: where no string can be a partial match for both alternatives. This is not a
1.1.1.2 ! misho 411: problem if a standard matching function is used, because the entire match has
! 412: to be rerun each time:
1.1 misho 413: <pre>
414: re> /1234|3789/
415: data> ABC123\P\P
416: Partial match: 123
417: data> 1237890
418: 0: 3789
419: </pre>
420: Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
1.1.1.2 ! misho 421: the entire match can also be used with the DFA matching functions. Another
1.1 misho 422: possibility is to work with two buffers. If a partial match at offset <i>n</i>
423: in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
424: the second buffer, you can then try a new match starting at offset <i>n+1</i> in
425: the first buffer.
426: </P>
427: <br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
428: <P>
429: Philip Hazel
430: <br>
431: University Computing Service
432: <br>
433: Cambridge CB2 3QH, England.
434: <br>
435: </P>
436: <br><a name="SEC11" href="#TOC1">REVISION</a><br>
437: <P>
1.1.1.2 ! misho 438: Last updated: 21 January 2012
1.1 misho 439: <br>
1.1.1.2 ! misho 440: Copyright © 1997-2012 University of Cambridge.
1.1 misho 441: <br>
442: <p>
443: Return to the <a href="index.html">PCRE index page</a>.
444: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>