Annotation of embedaddon/pcre/doc/html/pcrepartial.html, revision 1.1.1.1

1.1       misho       1: <html>
                      2: <head>
                      3: <title>pcrepartial specification</title>
                      4: </head>
                      5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
                      6: <h1>pcrepartial man page</h1>
                      7: <p>
                      8: Return to the <a href="index.html">PCRE index page</a>.
                      9: </p>
                     10: <p>
                     11: This page is part of the PCRE HTML documentation. It was generated automatically
                     12: from the original man page. If there is any nonsense in it, please consult the
                     13: man page, in case the conversion went wrong.
                     14: <br>
                     15: <ul>
                     16: <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
                     17: <li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
                     18: <li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
                     19: <li><a name="TOC4" href="#SEC4">PARTIAL MATCHING AND WORD BOUNDARIES</a>
                     20: <li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
                     21: <li><a name="TOC6" href="#SEC6">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a>
                     22: <li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
                     23: <li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
                     24: <li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
                     25: <li><a name="TOC10" href="#SEC10">AUTHOR</a>
                     26: <li><a name="TOC11" href="#SEC11">REVISION</a>
                     27: </ul>
                     28: <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
                     29: <P>
                     30: In normal use of PCRE, if the subject string that is passed to
                     31: <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
                     32: too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
                     33: are circumstances where it might be helpful to distinguish this case from other
                     34: cases in which there is no match.
                     35: </P>
                     36: <P>
                     37: Consider, for example, an application where a human is required to type in data
                     38: for a field with specific formatting requirements. An example might be a date
                     39: in the form <i>ddmmmyy</i>, defined by this pattern:
                     40: <pre>
                     41:   ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
                     42: </pre>
                     43: If the application sees the user's keystrokes one by one, and can check that
                     44: what has been typed so far is potentially valid, it is able to raise an error
                     45: as soon as a mistake is made, by beeping and not reflecting the character that
                     46: has been typed, for example. This immediate feedback is likely to be a better
                     47: user interface than a check that is delayed until the entire string has been
                     48: entered. Partial matching can also be useful when the subject string is very
                     49: long and is not all available at once.
                     50: </P>
                     51: <P>
                     52: PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
                     53: PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
                     54: <b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
                     55: for PCRE_PARTIAL_SOFT. The essential difference between the two options is
                     56: whether or not a partial match is preferred to an alternative complete match,
                     57: though the details differ between the two matching functions. If both options
                     58: are set, PCRE_PARTIAL_HARD takes precedence.
                     59: </P>
                     60: <P>
                     61: Setting a partial matching option for <b>pcre_exec()</b> disables the use of any
                     62: just-in-time code that was set up by calling <b>pcre_study()</b> with the
                     63: PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard
                     64: optimizations. PCRE remembers the last literal byte in a pattern, and abandons
                     65: matching immediately if such a byte is not present in the subject string. This
                     66: optimization cannot be used for a subject string that might match only
                     67: partially. If the pattern was studied, PCRE knows the minimum length of a
                     68: matching string, and does not bother to run the matching function on shorter
                     69: strings. This optimization is also disabled for partial matching.
                     70: </P>
                     71: <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
                     72: <P>
                     73: A partial match occurs during a call to <b>pcre_exec()</b> when the end of the
                     74: subject string is reached successfully, but matching cannot continue because
                     75: more characters are needed. However, at least one character in the subject must
                     76: have been inspected. This character need not form part of the final matched
                     77: string; lookbehind assertions and the \K escape sequence provide ways of
                     78: inspecting characters before the start of a matched substring. The requirement
                     79: for inspecting at least one character exists because an empty string can always
                     80: be matched; without such a restriction there would always be a partial match of
                     81: an empty string at the end of the subject.
                     82: </P>
                     83: <P>
                     84: If there are at least two slots in the offsets vector when <b>pcre_exec()</b>
                     85: returns with a partial match, the first slot is set to the offset of the
                     86: earliest character that was inspected when the partial match was found. For
                     87: convenience, the second offset points to the end of the subject so that a
                     88: substring can easily be identified.
                     89: </P>
                     90: <P>
                     91: For the majority of patterns, the first offset identifies the start of the
                     92: partially matched string. However, for patterns that contain lookbehind
                     93: assertions, or \K, or begin with \b or \B, earlier characters have been
                     94: inspected while carrying out the match. For example:
                     95: <pre>
                     96:   /(?&#60;=abc)123/
                     97: </pre>
                     98: This pattern matches "123", but only if it is preceded by "abc". If the subject
                     99: string is "xyzabc12", the offsets after a partial match are for the substring
                    100: "abc12", because all these characters are needed if another match is tried
                    101: with extra characters added to the subject.
                    102: </P>
                    103: <P>
                    104: What happens when a partial match is identified depends on which of the two
                    105: partial matching options are set.
                    106: </P>
                    107: <br><b>
                    108: PCRE_PARTIAL_SOFT with pcre_exec()
                    109: </b><br>
                    110: <P>
                    111: If PCRE_PARTIAL_SOFT is set when <b>pcre_exec()</b> identifies a partial match,
                    112: the partial match is remembered, but matching continues as normal, and other
                    113: alternatives in the pattern are tried. If no complete match can be found,
                    114: <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
                    115: </P>
                    116: <P>
                    117: This option is "soft" because it prefers a complete match over a partial match.
                    118: All the various matching items in a pattern behave as if the subject string is
                    119: potentially complete. For example, \z, \Z, and $ match at the end of the
                    120: subject, as normal, and for \b and \B the end of the subject is treated as a
                    121: non-alphanumeric.
                    122: </P>
                    123: <P>
                    124: If there is more than one partial match, the first one that was found provides
                    125: the data that is returned. Consider this pattern:
                    126: <pre>
                    127:   /123\w+X|dogY/
                    128: </pre>
                    129: If this is matched against the subject string "abc123dog", both
                    130: alternatives fail to match, but the end of the subject is reached during
                    131: matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
                    132: identifying "123dog" as the first partial match that was found. (In this
                    133: example, there are two partial matches, because "dog" on its own partially
                    134: matches the second alternative.)
                    135: </P>
                    136: <br><b>
                    137: PCRE_PARTIAL_HARD with pcre_exec()
                    138: </b><br>
                    139: <P>
                    140: If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
                    141: PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
                    142: search for possible complete matches. This option is "hard" because it prefers
                    143: an earlier partial match over a later complete match. For this reason, the
                    144: assumption is made that the end of the supplied subject string may not be the
                    145: true end of the available data, and so, if \z, \Z, \b, \B, or $ are
                    146: encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
                    147: </P>
                    148: <P>
                    149: Setting PCRE_PARTIAL_HARD also affects the way <b>pcre_exec()</b> checks UTF-8
                    150: subject strings for validity. Normally, an invalid UTF-8 sequence causes the
                    151: error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8
                    152: character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when
                    153: PCRE_PARTIAL_HARD is set.
                    154: </P>
                    155: <br><b>
                    156: Comparing hard and soft partial matching
                    157: </b><br>
                    158: <P>
                    159: The difference between the two partial matching options can be illustrated by a
                    160: pattern such as:
                    161: <pre>
                    162:   /dog(sbody)?/
                    163: </pre>
                    164: This matches either "dog" or "dogsbody", greedily (that is, it prefers the
                    165: longer string if possible). If it is matched against the string "dog" with
                    166: PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
                    167: PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
                    168: if the pattern is made ungreedy the result is different:
                    169: <pre>
                    170:   /dog(sbody)??/
                    171: </pre>
                    172: In this case the result is always a complete match because <b>pcre_exec()</b>
                    173: finds that first, and it never continues after finding a match. It might be
                    174: easier to follow this explanation by thinking of the two patterns like this:
                    175: <pre>
                    176:   /dog(sbody)?/    is the same as  /dogsbody|dog/
                    177:   /dog(sbody)??/   is the same as  /dog|dogsbody/
                    178: </pre>
                    179: The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
                    180: used, because it will always find the shorter match first.
                    181: </P>
                    182: <br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
                    183: <P>
                    184: The <b>pcre_dfa_exec()</b> function moves along the subject string character by
                    185: character, without backtracking, searching for all possible matches
                    186: simultaneously. If the end of the subject is reached before the end of the
                    187: pattern, there is the possibility of a partial match, again provided that at
                    188: least one character has been inspected.
                    189: </P>
                    190: <P>
                    191: When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
                    192: have been no complete matches. Otherwise, the complete matches are returned.
                    193: However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
                    194: complete matches. The portion of the string that was inspected when the longest
                    195: partial match was found is set as the first matching string, provided there are
                    196: at least two slots in the offsets vector.
                    197: </P>
                    198: <P>
                    199: Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
                    200: there is no difference between greedy and ungreedy repetition, its behaviour is
                    201: different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
                    202: string "dog" matched against the ungreedy pattern shown above:
                    203: <pre>
                    204:   /dog(sbody)??/
                    205: </pre>
                    206: Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
                    207: "dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
                    208: so returns that when PCRE_PARTIAL_HARD is set.
                    209: </P>
                    210: <br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
                    211: <P>
                    212: If a pattern ends with one of sequences \b or \B, which test for word
                    213: boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
                    214: results. Consider this pattern:
                    215: <pre>
                    216:   /\bcat\b/
                    217: </pre>
                    218: This matches "cat", provided there is a word boundary at either end. If the
                    219: subject string is "the cat", the comparison of the final "t" with a following
                    220: character cannot take place, so a partial match is found. However,
                    221: <b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
                    222: of the subject when the last character is a letter, thus finding a complete
                    223: match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
                    224: happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
                    225: </P>
                    226: <P>
                    227: Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
                    228: then the partial match takes precedence.
                    229: </P>
                    230: <br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
                    231: <P>
                    232: For releases of PCRE prior to 8.00, because of the way certain internal
                    233: optimizations were implemented in the <b>pcre_exec()</b> function, the
                    234: PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
                    235: all patterns. From release 8.00 onwards, the restrictions no longer apply, and
                    236: partial matching with <b>pcre_exec()</b> can be requested for any pattern.
                    237: </P>
                    238: <P>
                    239: Items that were formerly restricted were repeated single characters and
                    240: repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
                    241: conform to the restrictions, <b>pcre_exec()</b> returned the error code
                    242: PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
                    243: PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
                    244: pattern can be used for partial matching now always returns 1.
                    245: </P>
                    246: <br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
                    247: <P>
                    248: If the escape sequence \P is present in a <b>pcretest</b> data line, the
                    249: PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
                    250: that uses the date example quoted above:
                    251: <pre>
                    252:     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                    253:   data&#62; 25jun04\P
                    254:    0: 25jun04
                    255:    1: jun
                    256:   data&#62; 25dec3\P
                    257:   Partial match: 23dec3
                    258:   data&#62; 3ju\P
                    259:   Partial match: 3ju
                    260:   data&#62; 3juj\P
                    261:   No match
                    262:   data&#62; j\P
                    263:   No match
                    264: </pre>
                    265: The first data string is matched completely, so <b>pcretest</b> shows the
                    266: matched substrings. The remaining four strings do not match the complete
                    267: pattern, but the first two are partial matches. Similar output is obtained
                    268: when <b>pcre_dfa_exec()</b> is used.
                    269: </P>
                    270: <P>
                    271: If the escape sequence \P is present more than once in a <b>pcretest</b> data
                    272: line, the PCRE_PARTIAL_HARD option is set for the match.
                    273: </P>
                    274: <br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
                    275: <P>
                    276: When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
                    277: to continue the match by providing additional subject data and calling
                    278: <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
                    279: time setting the PCRE_DFA_RESTART option. You must pass the same working
                    280: space as before, because this is where details of the previous partial match
                    281: are stored. Here is an example using <b>pcretest</b>, using the \R escape
                    282: sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
                    283: <b>pcre_dfa_exec()</b>):
                    284: <pre>
                    285:     re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                    286:   data&#62; 23ja\P\D
                    287:   Partial match: 23ja
                    288:   data&#62; n05\R\D
                    289:    0: n05
                    290: </pre>
                    291: The first call has "23ja" as the subject, and requests partial matching; the
                    292: second call has "n05" as the subject for the continued (restarted) match.
                    293: Notice that when the match is complete, only the last part is shown; PCRE does
                    294: not retain the previously partially-matched string. It is up to the calling
                    295: program to do that if it needs to.
                    296: </P>
                    297: <P>
                    298: You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
                    299: PCRE_DFA_RESTART to continue partial matching over multiple segments. This
                    300: facility can be used to pass very long subject strings to
                    301: <b>pcre_dfa_exec()</b>.
                    302: </P>
                    303: <br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
                    304: <P>
                    305: From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
                    306: matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
                    307: previous match with a new segment of data. Instead, new data must be added to
                    308: the previous subject string, and the entire match re-run, starting from the
                    309: point where the partial match occurred. Earlier data can be discarded. It is
                    310: best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
                    311: end of a segment as the end of the subject when matching \z, \Z, \b, \B,
                    312: and $. Consider an unanchored pattern that matches dates:
                    313: <pre>
                    314:     re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
                    315:   data&#62; The date is 23ja\P\P
                    316:   Partial match: 23ja
                    317: </pre>
                    318: At this stage, an application could discard the text preceding "23ja", add on
                    319: text from the next segment, and call <b>pcre_exec()</b> again. Unlike
                    320: <b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
                    321: the complete matching process occurs for each call, so more memory and more
                    322: processing time is needed.
                    323: </P>
                    324: <P>
                    325: <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
                    326: with \b or \B, the string that is returned for a partial match will include
                    327: characters that precede the partially matched string itself, because these must
                    328: be retained when adding on more characters for a subsequent matching attempt.
                    329: </P>
                    330: <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
                    331: <P>
                    332: Certain types of pattern may give problems with multi-segment matching,
                    333: whichever matching function is used.
                    334: </P>
                    335: <P>
                    336: 1. If the pattern contains a test for the beginning of a line, you need to pass
                    337: the PCRE_NOTBOL option when the subject string for any call does start at the
                    338: beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
                    339: doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
                    340: includes the effect of PCRE_NOTEOL.
                    341: </P>
                    342: <P>
                    343: 2. Lookbehind assertions at the start of a pattern are catered for in the
                    344: offsets that are returned for a partial match. However, in theory, a lookbehind
                    345: assertion later in the pattern could require even earlier characters to be
                    346: inspected, and it might not have been reached when a partial match occurs. This
                    347: is probably an extremely unlikely case; you could guard against it to a certain
                    348: extent by always including extra characters at the start.
                    349: </P>
                    350: <P>
                    351: 3. Matching a subject string that is split into multiple segments may not
                    352: always produce exactly the same result as matching over one single long string,
                    353: especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
                    354: Word Boundaries" above describes an issue that arises if the pattern ends with
                    355: \b or \B. Another kind of difference may occur when there are multiple
                    356: matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
                    357: is given only when there are no completed matches. This means that as soon as
                    358: the shortest match has been found, continuation to a new subject segment is no
                    359: longer possible. Consider again this <b>pcretest</b> example:
                    360: <pre>
                    361:     re&#62; /dog(sbody)?/
                    362:   data&#62; dogsb\P
                    363:    0: dog
                    364:   data&#62; do\P\D
                    365:   Partial match: do
                    366:   data&#62; gsb\R\P\D
                    367:    0: g
                    368:   data&#62; dogsbody\D
                    369:    0: dogsbody
                    370:    1: dog
                    371: </pre>
                    372: The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
                    373: PCRE_PARTIAL_SOFT option. Although the string is a partial match for
                    374: "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
                    375: "dog" is a complete match. Similarly, when the subject is presented to
                    376: <b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
                    377: match stops when "dog" has been found, and it is not possible to continue. On
                    378: the other hand, if "dogsbody" is presented as a single string,
                    379: <b>pcre_dfa_exec()</b> finds both matches.
                    380: </P>
                    381: <P>
                    382: Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
                    383: multi-segment data. The example above then behaves differently:
                    384: <pre>
                    385:     re&#62; /dog(sbody)?/
                    386:   data&#62; dogsb\P\P
                    387:   Partial match: dogsb
                    388:   data&#62; do\P\D
                    389:   Partial match: do
                    390:   data&#62; gsb\R\P\P\D
                    391:   Partial match: gsb
                    392: </pre>
                    393: 4. Patterns that contain alternatives at the top level which do not all
                    394: start with the same pattern item may not work as expected when
                    395: PCRE_DFA_RESTART is used with <b>pcre_dfa_exec()</b>. For example, consider this
                    396: pattern:
                    397: <pre>
                    398:   1234|3789
                    399: </pre>
                    400: If the first part of the subject is "ABC123", a partial match of the first
                    401: alternative is found at offset 3. There is no partial match for the second
                    402: alternative, because such a match does not start at the same point in the
                    403: subject string. Attempting to continue with the string "7890" does not yield a
                    404: match because only those alternatives that match at one point in the subject
                    405: are remembered. The problem arises because the start of the second alternative
                    406: matches within the first alternative. There is no problem with anchored
                    407: patterns or patterns such as:
                    408: <pre>
                    409:   1234|ABCD
                    410: </pre>
                    411: where no string can be a partial match for both alternatives. This is not a
                    412: problem if <b>pcre_exec()</b> is used, because the entire match has to be rerun
                    413: each time:
                    414: <pre>
                    415:     re&#62; /1234|3789/
                    416:   data&#62; ABC123\P\P
                    417:   Partial match: 123
                    418:   data&#62; 1237890
                    419:    0: 3789
                    420: </pre>
                    421: Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
                    422: the entire match can also be used with <b>pcre_dfa_exec()</b>. Another
                    423: possibility is to work with two buffers. If a partial match at offset <i>n</i>
                    424: in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
                    425: the second buffer, you can then try a new match starting at offset <i>n+1</i> in
                    426: the first buffer.
                    427: </P>
                    428: <br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
                    429: <P>
                    430: Philip Hazel
                    431: <br>
                    432: University Computing Service
                    433: <br>
                    434: Cambridge CB2 3QH, England.
                    435: <br>
                    436: </P>
                    437: <br><a name="SEC11" href="#TOC1">REVISION</a><br>
                    438: <P>
                    439: Last updated: 26 August 2011
                    440: <br>
                    441: Copyright &copy; 1997-2011 University of Cambridge.
                    442: <br>
                    443: <p>
                    444: Return to the <a href="index.html">PCRE index page</a>.
                    445: </p>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>