Annotation of embedaddon/pcre/doc/html/pcrepartial.html, revision 1.1
1.1 ! misho 1: <html>
! 2: <head>
! 3: <title>pcrepartial specification</title>
! 4: </head>
! 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
! 6: <h1>pcrepartial man page</h1>
! 7: <p>
! 8: Return to the <a href="index.html">PCRE index page</a>.
! 9: </p>
! 10: <p>
! 11: This page is part of the PCRE HTML documentation. It was generated automatically
! 12: from the original man page. If there is any nonsense in it, please consult the
! 13: man page, in case the conversion went wrong.
! 14: <br>
! 15: <ul>
! 16: <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
! 17: <li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
! 18: <li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
! 19: <li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
! 20: <li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
! 21: <li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
! 22: <li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
! 23: <li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
! 24: <li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
! 25: <li><a name="TOC10" href="#SEC10">AUTHOR</a>
! 26: <li><a name="TOC11" href="#SEC11">REVISION</a>
! 27: </ul>
! 28: <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
! 29: <P>
! 30: In normal use of PCRE, if the subject string that is passed to
! 31: <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
! 32: too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
! 33: are circumstances where it might be helpful to distinguish this case from other
! 34: cases in which there is no match.
! 35: </P>
! 36: <P>
! 37: Consider, for example, an application where a human is required to type in data
! 38: for a field with specific formatting requirements. An example might be a date
! 39: in the form <i>ddmmmyy</i>, defined by this pattern:
! 40: <pre>
! 41: ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
! 42: </pre>
! 43: If the application sees the user's keystrokes one by one, and can check that
! 44: what has been typed so far is potentially valid, it is able to raise an error
! 45: as soon as a mistake is made, by beeping and not reflecting the character that
! 46: has been typed, for example. This immediate feedback is likely to be a better
! 47: user interface than a check that is delayed until the entire string has been
! 48: entered. Partial matching can also be useful when the subject string is very
! 49: long and is not all available at once.
! 50: </P>
! 51: <P>
! 52: PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
! 53: PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
! 54: <b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
! 55: for PCRE_PARTIAL_SOFT. The essential difference between the two options is
! 56: whether or not a partial match is preferred to an alternative complete match,
! 57: though the details differ between the two matching functions. If both options
! 58: are set, PCRE_PARTIAL_HARD takes precedence.
! 59: </P>
! 60: <P>
! 61: Setting a partial matching option for <b>pcre_exec()</b> disables the use of any
! 62: just-in-time code that was set up by calling <b>pcre_study()</b> with the
! 63: PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard
! 64: optimizations. PCRE remembers the last literal byte in a pattern, and abandons
! 65: matching immediately if such a byte is not present in the subject string. This
! 66: optimization cannot be used for a subject string that might match only
! 67: partially. If the pattern was studied, PCRE knows the minimum length of a
! 68: matching string, and does not bother to run the matching function on shorter
! 69: strings. This optimization is also disabled for partial matching.
! 70: </P>
! 71: <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
! 72: <P>
! 73: A partial match occurs during a call to <b>pcre_exec()</b> when the end of the
! 74: subject string is reached successfully, but matching cannot continue because
! 75: more characters are needed. However, at least one character in the subject must
! 76: have been inspected. This character need not form part of the final matched
! 77: string; lookbehind assertions and the \K escape sequence provide ways of
! 78: inspecting characters before the start of a matched substring. The requirement
! 79: for inspecting at least one character exists because an empty string can always
! 80: be matched; without such a restriction there would always be a partial match of
! 81: an empty string at the end of the subject.
! 82: </P>
! 83: <P>
! 84: If there are at least two slots in the offsets vector when <b>pcre_exec()</b>
! 85: returns with a partial match, the first slot is set to the offset of the
! 86: earliest character that was inspected when the partial match was found. For
! 87: convenience, the second offset points to the end of the subject so that a
! 88: substring can easily be identified.
! 89: </P>
! 90: <P>
! 91: For the majority of patterns, the first offset identifies the start of the
! 92: partially matched string. However, for patterns that contain lookbehind
! 93: assertions, or \K, or begin with \b or \B, earlier characters have been
! 94: inspected while carrying out the match. For example:
! 95: <pre>
! 96: /(?<=abc)123/
! 97: </pre>
! 98: This pattern matches "123", but only if it is preceded by "abc". If the subject
! 99: string is "xyzabc12", the offsets after a partial match are for the substring
! 100: "abc12", because all these characters are needed if another match is tried
! 101: with extra characters added to the subject.
! 102: </P>
! 103: <P>
! 104: What happens when a partial match is identified depends on which of the two
! 105: partial matching options are set.
! 106: </P>
! 107: <br><b>
! 108: PCRE_PARTIAL_SOFT with pcre_exec()
! 109: </b><br>
! 110: <P>
! 111: If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> identifies a partial match,
! 112: the partial match is remembered, but matching continues as normal, and other
! 113: alternatives in the pattern are tried. If no complete match can be found,
! 114: <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
! 115: </P>
! 116: <P>
! 117: This option is "soft" because it prefers a complete match over a partial match.
! 118: All the various matching items in a pattern behave as if the subject string is
! 119: potentially complete. For example, \z, \Z, and $ match at the end of the
! 120: subject, as normal, and for \b and \B the end of the subject is treated as a
! 121: non-alphanumeric.
! 122: </P>
! 123: <P>
! 124: If there is more than one partial match, the first one that was found provides
! 125: the data that is returned. Consider this pattern:
! 126: <pre>
! 127: /123\w+X|dogY/
! 128: </pre>
! 129: If this is matched against the subject string "abc123dog", both
! 130: alternatives fail to match, but the end of the subject is reached during
! 131: matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
! 132: identifying "123dog" as the first partial match that was found. (In this
! 133: example, there are two partial matches, because "dog" on its own partially
! 134: matches the second alternative.)
! 135: </P>
! 136: <br><b>
! 137: PCRE_PARTIAL_HARD with pcre_exec()
! 138: </b><br>
! 139: <P>
! 140: If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
! 141: PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
! 142: search for possible complete matches. This option is "hard" because it prefers
! 143: an earlier partial match over a later complete match. For this reason, the
! 144: assumption is made that the end of the supplied subject string may not be the
! 145: true end of the available data, and so, if \z, \Z, \b, \B, or $ are
! 146: encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
! 147: </P>
! 148: <P>
! 149: Setting PCRE_PARTIAL_HARD also affects the way <b>pcre_exec()</b> checks UTF-8
! 150: subject strings for validity. Normally, an invalid UTF-8 sequence causes the
! 151: error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8
! 152: character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when
! 153: PCRE_PARTIAL_HARD is set.
! 154: </P>
! 155: <br><b>
! 156: Comparing hard and soft partial matching
! 157: </b><br>
! 158: <P>
! 159: The difference between the two partial matching options can be illustrated by a
! 160: pattern such as:
! 161: <pre>
! 162: /dog(sbody)?/
! 163: </pre>
! 164: This matches either "dog" or "dogsbody", greedily (that is, it prefers the
! 165: longer string if possible). If it is matched against the string "dog" with
! 166: PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
! 167: PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
! 168: if the pattern is made ungreedy the result is different:
! 169: <pre>
! 170: /dog(sbody)??/
! 171: </pre>
! 172: In this case the result is always a complete match because <b>pcre_exec()</b>
! 173: finds that first, and it never continues after finding a match. It might be
! 174: easier to follow this explanation by thinking of the two patterns like this:
! 175: <pre>
! 176: /dog(sbody)?/ is the same as /dogsbody|dog/
! 177: /dog(sbody)??/ is the same as /dog|dogsbody/
! 178: </pre>
! 179: The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
! 180: used, because it will always find the shorter match first.
! 181: </P>
! 182: <br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
! 183: <P>
! 184: The <b>pcre_dfa_exec()</b> function moves along the subject string character by
! 185: character, without backtracking, searching for all possible matches
! 186: simultaneously. If the end of the subject is reached before the end of the
! 187: pattern, there is the possibility of a partial match, again provided that at
! 188: least one character has been inspected.
! 189: </P>
! 190: <P>
! 191: When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
! 192: have been no complete matches. Otherwise, the complete matches are returned.
! 193: However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
! 194: complete matches. The portion of the string that was inspected when the longest
! 195: partial match was found is set as the first matching string, provided there are
! 196: at least two slots in the offsets vector.
! 197: </P>
! 198: <P>
! 199: Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
! 200: there is no difference between greedy and ungreedy repetition, its behaviour is
! 201: different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
! 202: string "dog" matched against the ungreedy pattern shown above:
! 203: <pre>
! 204: /dog(sbody)??/
! 205: </pre>
! 206: Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
! 207: "dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
! 208: so returns that when PCRE_PARTIAL_HARD is set.
! 209: </P>
! 210: <br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
! 211: <P>
! 212: If a pattern ends with one of sequences \b or \B, which test for word
! 213: boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
! 214: results. Consider this pattern:
! 215: <pre>
! 216: /\bcat\b/
! 217: </pre>
! 218: This matches "cat", provided there is a word boundary at either end. If the
! 219: subject string is "the cat", the comparison of the final "t" with a following
! 220: character cannot take place, so a partial match is found. However,
! 221: <b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
! 222: of the subject when the last character is a letter, thus finding a complete
! 223: match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
! 224: happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
! 225: </P>
! 226: <P>
! 227: Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
! 228: then the partial match takes precedence.
! 229: </P>
! 230: <br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
! 231: <P>
! 232: For releases of PCRE prior to 8.00, because of the way certain internal
! 233: optimizations were implemented in the <b>pcre_exec()</b> function, the
! 234: PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
! 235: all patterns. From release 8.00 onwards, the restrictions no longer apply, and
! 236: partial matching with <b>pcre_exec()</b> can be requested for any pattern.
! 237: </P>
! 238: <P>
! 239: Items that were formerly restricted were repeated single characters and
! 240: repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
! 241: conform to the restrictions, <b>pcre_exec()</b> returned the error code
! 242: PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
! 243: PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
! 244: pattern can be used for partial matching now always returns 1.
! 245: </P>
! 246: <br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
! 247: <P>
! 248: If the escape sequence \P is present in a <b>pcretest</b> data line, the
! 249: PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
! 250: that uses the date example quoted above:
! 251: <pre>
! 252: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
! 253: data> 25jun04\P
! 254: 0: 25jun04
! 255: 1: jun
! 256: data> 25dec3\P
! 257: Partial match: 23dec3
! 258: data> 3ju\P
! 259: Partial match: 3ju
! 260: data> 3juj\P
! 261: No match
! 262: data> j\P
! 263: No match
! 264: </pre>
! 265: The first data string is matched completely, so <b>pcretest</b> shows the
! 266: matched substrings. The remaining four strings do not match the complete
! 267: pattern, but the first two are partial matches. Similar output is obtained
! 268: when <b>pcre_dfa_exec()</b> is used.
! 269: </P>
! 270: <P>
! 271: If the escape sequence \P is present more than once in a <b>pcretest</b> data
! 272: line, the PCRE_PARTIAL_HARD option is set for the match.
! 273: </P>
! 274: <br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
! 275: <P>
! 276: When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
! 277: to continue the match by providing additional subject data and calling
! 278: <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
! 279: time setting the PCRE_DFA_RESTART option. You must pass the same working
! 280: space as before, because this is where details of the previous partial match
! 281: are stored. Here is an example using <b>pcretest</b>, using the \R escape
! 282: sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
! 283: <b>pcre_dfa_exec()</b>):
! 284: <pre>
! 285: re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
! 286: data> 23ja\P\D
! 287: Partial match: 23ja
! 288: data> n05\R\D
! 289: 0: n05
! 290: </pre>
! 291: The first call has "23ja" as the subject, and requests partial matching; the
! 292: second call has "n05" as the subject for the continued (restarted) match.
! 293: Notice that when the match is complete, only the last part is shown; PCRE does
! 294: not retain the previously partially-matched string. It is up to the calling
! 295: program to do that if it needs to.
! 296: </P>
! 297: <P>
! 298: You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
! 299: PCRE_DFA_RESTART to continue partial matching over multiple segments. This
! 300: facility can be used to pass very long subject strings to
! 301: <b>pcre_dfa_exec()</b>.
! 302: </P>
! 303: <br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
! 304: <P>
! 305: From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
! 306: matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
! 307: previous match with a new segment of data. Instead, new data must be added to
! 308: the previous subject string, and the entire match re-run, starting from the
! 309: point where the partial match occurred. Earlier data can be discarded. It is
! 310: best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
! 311: end of a segment as the end of the subject when matching \z, \Z, \b, \B,
! 312: and $. Consider an unanchored pattern that matches dates:
! 313: <pre>
! 314: re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
! 315: data> The date is 23ja\P\P
! 316: Partial match: 23ja
! 317: </pre>
! 318: At this stage, an application could discard the text preceding "23ja", add on
! 319: text from the next segment, and call <b>pcre_exec()</b> again. Unlike
! 320: <b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
! 321: the complete matching process occurs for each call, so more memory and more
! 322: processing time is needed.
! 323: </P>
! 324: <P>
! 325: <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
! 326: with \b or \B, the string that is returned for a partial match will include
! 327: characters that precede the partially matched string itself, because these must
! 328: be retained when adding on more characters for a subsequent matching attempt.
! 329: </P>
! 330: <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
! 331: <P>
! 332: Certain types of pattern may give problems with multi-segment matching,
! 333: whichever matching function is used.
! 334: </P>
! 335: <P>
! 336: 1. If the pattern contains a test for the beginning of a line, you need to pass
! 337: the PCRE_NOTBOL option when the subject string for any call does start at the
! 338: beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
! 339: doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
! 340: includes the effect of PCRE_NOTEOL.
! 341: </P>
! 342: <P>
! 343: 2. Lookbehind assertions at the start of a pattern are catered for in the
! 344: offsets that are returned for a partial match. However, in theory, a lookbehind
! 345: assertion later in the pattern could require even earlier characters to be
! 346: inspected, and it might not have been reached when a partial match occurs. This
! 347: is probably an extremely unlikely case; you could guard against it to a certain
! 348: extent by always including extra characters at the start.
! 349: </P>
! 350: <P>
! 351: 3. Matching a subject string that is split into multiple segments may not
! 352: always produce exactly the same result as matching over one single long string,
! 353: especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
! 354: Word Boundaries" above describes an issue that arises if the pattern ends with
! 355: \b or \B. Another kind of difference may occur when there are multiple
! 356: matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
! 357: is given only when there are no completed matches. This means that as soon as
! 358: the shortest match has been found, continuation to a new subject segment is no
! 359: longer possible. Consider again this <b>pcretest</b> example:
! 360: <pre>
! 361: re> /dog(sbody)?/
! 362: data> dogsb\P
! 363: 0: dog
! 364: data> do\P\D
! 365: Partial match: do
! 366: data> gsb\R\P\D
! 367: 0: g
! 368: data> dogsbody\D
! 369: 0: dogsbody
! 370: 1: dog
! 371: </pre>
! 372: The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
! 373: PCRE_PARTIAL_SOFT option. Although the string is a partial match for
! 374: "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
! 375: "dog" is a complete match. Similarly, when the subject is presented to
! 376: <b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
! 377: match stops when "dog" has been found, and it is not possible to continue. On
! 378: the other hand, if "dogsbody" is presented as a single string,
! 379: <b>pcre_dfa_exec()</b> finds both matches.
! 380: </P>
! 381: <P>
! 382: Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
! 383: multi-segment data. The example above then behaves differently:
! 384: <pre>
! 385: re> /dog(sbody)?/
! 386: data> dogsb\P\P
! 387: Partial match: dogsb
! 388: data> do\P\D
! 389: Partial match: do
! 390: data> gsb\R\P\P\D
! 391: Partial match: gsb
! 392: </pre>
! 393: 4. Patterns that contain alternatives at the top level which do not all
! 394: start with the same pattern item may not work as expected when
! 395: PCRE_DFA_RESTART is used with <b>pcre_dfa_exec()</b>. For example, consider this
! 396: pattern:
! 397: <pre>
! 398: 1234|3789
! 399: </pre>
! 400: If the first part of the subject is "ABC123", a partial match of the first
! 401: alternative is found at offset 3. There is no partial match for the second
! 402: alternative, because such a match does not start at the same point in the
! 403: subject string. Attempting to continue with the string "7890" does not yield a
! 404: match because only those alternatives that match at one point in the subject
! 405: are remembered. The problem arises because the start of the second alternative
! 406: matches within the first alternative. There is no problem with anchored
! 407: patterns or patterns such as:
! 408: <pre>
! 409: 1234|ABCD
! 410: </pre>
! 411: where no string can be a partial match for both alternatives. This is not a
! 412: problem if <b>pcre_exec()</b> is used, because the entire match has to be rerun
! 413: each time:
! 414: <pre>
! 415: re> /1234|3789/
! 416: data> ABC123\P\P
! 417: Partial match: 123
! 418: data> 1237890
! 419: 0: 3789
! 420: </pre>
! 421: Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
! 422: the entire match can also be used with <b>pcre_dfa_exec()</b>. Another
! 423: possibility is to work with two buffers. If a partial match at offset <i>n</i>
! 424: in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
! 425: the second buffer, you can then try a new match starting at offset <i>n+1</i> in
! 426: the first buffer.
! 427: </P>
! 428: <br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
! 429: <P>
! 430: Philip Hazel
! 431: <br>
! 432: University Computing Service
! 433: <br>
! 434: Cambridge CB2 3QH, England.
! 435: <br>
! 436: </P>
! 437: <br><a name="SEC11" href="#TOC1">REVISION</a><br>
! 438: <P>
! 439: Last updated: 26 August 2011
! 440: <br>
! 441: Copyright © 1997-2011 University of Cambridge.
! 442: <br>
! 443: <p>
! 444: Return to the <a href="index.html">PCRE index page</a>.
! 445: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>