embedaddon/pcre/doc/html/pcrematching.html - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html / pcrematching.html
Revision 1.1.1.4 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:05 2014 UTC (11 years ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD

pcre 8.34

1: <html> 2: <head> 3: <title>pcrematching specification</title> 4: </head> 5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6: <h1>pcrematching man page</h1> 7:  8: Return to the <a href="index.html">PCRE index page</a>. 9:  10:  11: This page is part of the PCRE HTML documentation. It was generated automatically 12: from the original man page. If there is any nonsense in it, please consult the 13: man page, in case the conversion went wrong. 14:   15: <ul> 16: <li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 17: <li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 18: <li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 19: <li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> 20: <li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 21: <li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 22: <li><a name="TOC7" href="#SEC7">AUTHOR</a> 23: <li><a name="TOC8" href="#SEC8">REVISION</a> 24: </ul> 25:  <a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a>  26:  27: This document describes the two different algorithms that are available in PCRE 28: for matching a compiled regular expression against a given subject string. The 29: "standard" algorithm is the one provided by the pcre_exec(), 30: pcre16_exec() and pcre32_exec() functions. These work in the same 31: as as Perl's matching function, and provide a Perl-compatible matching operation. 32: The just-in-time (JIT) optimization that is described in the 33: <a href="pcrejit.html">pcrejit</a> 34: documentation is compatible with these functions. 35:  36:  37: An alternative algorithm is provided by the pcre_dfa_exec(), 38: pcre16_dfa_exec() and pcre32_dfa_exec() functions; they operate in 39: a different way, and are not Perl-compatible. This alternative has advantages 40: and disadvantages compared with the standard algorithm, and these are described 41: below. 42:  43:  44: When there is only one possible way in which a given subject string can match a 45: pattern, the two algorithms give the same answer. A difference arises, however, 46: when there are multiple possibilities. For example, if the pattern 47: <pre> 48: ^<.*> 49: </pre> 50: is matched against the string 51: <pre> 52: <something> <something else> <something further> 53: </pre> 54: there are three possible answers. The standard algorithm finds only one of 55: them, whereas the alternative algorithm finds all three. 56:  57:  <a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a>  58:  59: The set of strings that are matched by a regular expression can be represented 60: as a tree structure. An unlimited repetition in the pattern makes the tree of 61: infinite size, but it is still a tree. Matching the pattern to a given subject 62: string (from a given starting point) can be thought of as a search of the tree. 63: There are two ways to search a tree: depth-first and breadth-first, and these 64: correspond to the two matching algorithms provided by PCRE. 65:  66:  <a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a>  67:  68: In the terminology of Jeffrey Friedl's book "Mastering Regular 69: Expressions", the standard algorithm is an "NFA algorithm". It conducts a 70: depth-first search of the pattern tree. That is, it proceeds along a single 71: path through the tree, checking that the subject matches what is required. When 72: there is a mismatch, the algorithm tries any alternatives at the current point, 73: and if they all fail, it backs up to the previous branch point in the tree, and 74: tries the next alternative branch at that level. This often involves backing up 75: (moving to the left) in the subject string as well. The order in which 76: repetition branches are tried is controlled by the greedy or ungreedy nature of 77: the quantifier. 78:  79:  80: If a leaf node is reached, a matching string has been found, and at that point 81: the algorithm stops. Thus, if there is more than one possible match, this 82: algorithm returns the first one that it finds. Whether this is the shortest, 83: the longest, or some intermediate length depends on the way the greedy and 84: ungreedy repetition quantifiers are specified in the pattern. 85:  86:  87: Because it ends up with a single path through the tree, it is relatively 88: straightforward for this algorithm to keep track of the substrings that are 89: matched by portions of the pattern in parentheses. This provides support for 90: capturing parentheses and back references. 91:  92:  <a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a>  93:  94: This algorithm conducts a breadth-first search of the tree. Starting from the 95: first matching point in the subject, it scans the subject string from left to 96: right, once, character by character, and as it does this, it remembers all the 97: paths through the tree that represent valid matches. In Friedl's terminology, 98: this is a kind of "DFA algorithm", though it is not implemented as a 99: traditional finite state machine (it keeps multiple states active 100: simultaneously). 101:  102:  103: Although the general principle of this matching algorithm is that it scans the 104: subject string only once, without backtracking, there is one exception: when a 105: lookaround assertion is encountered, the characters following or preceding the 106: current point have to be independently inspected. 107:  108:  109: The scan continues until either the end of the subject is reached, or there are 110: no more unterminated paths. At this point, terminated paths represent the 111: different matching possibilities (if there are none, the match has failed). 112: Thus, if there is more than one possible match, this algorithm finds all of 113: them, and in particular, it finds the longest. The matches are returned in 114: decreasing order of length. There is an option to stop the algorithm after the 115: first match (which is necessarily the shortest) is found. 116:  117:  118: Note that all the matches that are found start at the same point in the 119: subject. If the pattern 120: <pre> 121: cat(er(pillar)?)? 122: </pre> 123: is matched against the string "the caterpillar catchment", the result will be 124: the three strings "caterpillar", "cater", and "cat" that start at the fifth 125: character of the subject. The algorithm does not automatically move on to find 126: matches that start at later positions. 127:  128:  129: PCRE's "auto-possessification" optimization usually applies to character 130: repeats at the end of a pattern (as well as internally). For example, the 131: pattern "a\d+" is compiled as if it were "a\d++" because there is no point 132: even considering the possibility of backtracking into the repeated digits. For 133: DFA matching, this means that only one possible match is found. If you really 134: do want multiple matches in such cases, either use an ungreedy repeat 135: ("a\d+?") or set the PCRE_NO_AUTO_POSSESS option when compiling. 136:  137:  138: There are a number of features of PCRE regular expressions that are not 139: supported by the alternative matching algorithm. They are as follows: 140:  141:  142: 1. Because the algorithm finds all possible matches, the greedy or ungreedy 143: nature of repetition quantifiers is not relevant. Greedy and ungreedy 144: quantifiers are treated in exactly the same way. However, possessive 145: quantifiers can make a difference when what follows could also match what is 146: quantified, for example in a pattern like this: 147: <pre> 148: ^a++\w! 149: </pre> 150: This pattern matches "aaab!" but not "aaa!", which would be matched by a 151: non-possessive quantifier. Similarly, if an atomic group is present, it is 152: matched as if it were a standalone pattern at the current point, and the 153: longest match is then "locked in" for the rest of the overall pattern. 154:  155:  156: 2. When dealing with multiple paths through the tree simultaneously, it is not 157: straightforward to keep track of captured substrings for the different matching 158: possibilities, and PCRE's implementation of this algorithm does not attempt to 159: do this. This means that no captured substrings are available. 160:  161:  162: 3. Because no substrings are captured, back references within the pattern are 163: not supported, and cause errors if encountered. 164:  165:  166: 4. For the same reason, conditional expressions that use a backreference as the 167: condition or test for a specific group recursion are not supported. 168:  169:  170: 5. Because many paths through the tree may be active, the \K escape sequence, 171: which resets the start of the match when encountered (but may be on some paths 172: and not on others), is not supported. It causes an error if encountered. 173:  174:  175: 6. Callouts are supported, but the value of the capture_top field is 176: always 1, and the value of the capture_last field is always -1. 177:  178:  179: 7. The \C escape sequence, which (in the standard algorithm) always matches a 180: single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is not supported in 181: these modes, because the alternative algorithm moves through the subject string 182: one character (not data unit) at a time, for all active paths through the tree. 183:  184:  185: 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 186: supported. (*FAIL) is supported, and behaves like a failing negative assertion. 187:  188:  <a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>  189:  190: Using the alternative matching algorithm provides the following advantages: 191:  192:  193: 1. All possible matches (at a single point in the subject) are automatically 194: found, and in particular, the longest match is found. To find more than one 195: match using the standard algorithm, you have to do kludgy things with 196: callouts. 197:  198:  199: 2. Because the alternative algorithm scans the subject string just once, and 200: never needs to backtrack (except for lookbehinds), it is possible to pass very 201: long subject strings to the matching function in several pieces, checking for 202: partial matching each time. Although it is possible to do multi-segment 203: matching using the standard algorithm by retaining partially matched 204: substrings, it is more complicated. The 205: <a href="pcrepartial.html">pcrepartial</a> 206: documentation gives details of partial matching and discusses multi-segment 207: matching. 208:  209:  <a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>  210:  211: The alternative algorithm suffers from a number of disadvantages: 212:  213:  214: 1. It is substantially slower than the standard algorithm. This is partly 215: because it has to search for all possible matches, but is also because it is 216: less susceptible to optimization. 217:  218:  219: 2. Capturing parentheses and back references are not supported. 220:  221:  222: 3. Although atomic groups are supported, their use does not provide the 223: performance advantage that it does for the standard algorithm. 224:  225:  <a name="SEC7" href="#TOC1">AUTHOR</a>  226:  227: Philip Hazel 228:   229: University Computing Service 230:   231: Cambridge CB2 3QH, England. 232:   233:  234:  <a name="SEC8" href="#TOC1">REVISION</a>  235:  236: Last updated: 12 November 2013 237:   238: Copyright © 1997-2012 University of Cambridge. 239:   240:  241: Return to the <a href="index.html">PCRE index page</a>. 242: