embedaddon/pcre/HACKING - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / HACKING
Revision 1.1.1.5 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Sun Jun 15 19:46:04 2014 UTC (9 years, 11 months ago) by misho
Branches: pcre, MAIN
CVS tags: v8_34, HEAD

pcre 8.34

1: Technical Notes about PCRE 2: -------------------------- 3: 4: These are very rough technical notes that record potentially useful information 5: about PCRE internals. For information about testing PCRE, see the pcretest 6: documentation and the comment at the head of the RunTest file. 7: 8: 9: Historical note 1 10: ----------------- 11: 12: Many years ago I implemented some regular expression functions to an algorithm 13: suggested by Martin Richards. These were not Unix-like in form, and were quite 14: restricted in what they could do by comparison with Perl. The interesting part 15: about the algorithm was that the amount of space required to hold the compiled 16: form of an expression was known in advance. The code to apply an expression did 17: not operate by backtracking, as the original Henry Spencer code and current 18: Perl code does, but instead checked all possibilities simultaneously by keeping 19: a list of current states and checking all of them as it advanced through the 20: subject string. In the terminology of Jeffrey Friedl's book, it was a "DFA 21: algorithm", though it was not a traditional Finite State Machine (FSM). When 22: the pattern was all used up, all remaining states were possible matches, and 23: the one matching the longest subset of the subject string was chosen. This did 24: not necessarily maximize the individual wild portions of the pattern, as is 25: expected in Unix and Perl-style regular expressions. 26: 27: 28: Historical note 2 29: ----------------- 30: 31: By contrast, the code originally written by Henry Spencer (which was 32: subsequently heavily modified for Perl) compiles the expression twice: once in 33: a dummy mode in order to find out how much store will be needed, and then for 34: real. (The Perl version probably doesn't do this any more; I'm talking about 35: the original library.) The execution function operates by backtracking and 36: maximizing (or, optionally, minimizing in Perl) the amount of the subject that 37: matches individual wild portions of the pattern. This is an "NFA algorithm" in 38: Friedl's terminology. 39: 40: 41: OK, here's the real stuff 42: ------------------------- 43: 44: For the set of functions that form the "basic" PCRE library (which are 45: unrelated to those mentioned above), I tried at first to invent an algorithm 46: that used an amount of store bounded by a multiple of the number of characters 47: in the pattern, to save on compiling time. However, because of the greater 48: complexity in Perl regular expressions, I couldn't do this. In any case, a 49: first pass through the pattern is helpful for other reasons. 50: 51: 52: Support for 16-bit and 32-bit data strings 53: ------------------------------------------- 54: 55: From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from 56: release 8.32, PCRE supports 32-bit data strings. The library can be compiled 57: in any combination of 8-bit, 16-bit or 32-bit modes, creating up to three 58: different libraries. In the description that follows, the word "short" is used 59: for a 16-bit data quantity, and the word "unit" is used for a quantity that is 60: a byte in 8-bit mode, a short in 16-bit mode and a 32-bit word in 32-bit mode. 61: However, so as not to over-complicate the text, the names of PCRE functions are 62: given in 8-bit form only. 63: 64: 65: Computing the memory requirement: how it was 66: -------------------------------------------- 67: 68: Up to and including release 6.7, PCRE worked by running a very degenerate first 69: pass to calculate a maximum store size, and then a second pass to do the real 70: compile - which might use a bit less than the predicted amount of memory. The 71: idea was that this would turn out faster than the Henry Spencer code because 72: the first pass is degenerate and the second pass can just store stuff straight 73: into the vector, which it knows is big enough. 74: 75: 76: Computing the memory requirement: how it is 77: ------------------------------------------- 78: 79: By the time I was working on a potential 6.8 release, the degenerate first pass 80: had become very complicated and hard to maintain. Indeed one of the early 81: things I did for 6.8 was to fix Yet Another Bug in the memory computation. Then 82: I had a flash of inspiration as to how I could run the real compile function in 83: a "fake" mode that enables it to compute how much memory it would need, while 84: actually only ever using a few hundred bytes of working memory, and without too 85: many tests of the mode that might slow it down. So I refactored the compiling 86: functions to work this way. This got rid of about 600 lines of source. It 87: should make future maintenance and development easier. As this was such a major 88: change, I never released 6.8, instead upping the number to 7.0 (other quite 89: major changes were also present in the 7.0 release). 90: 91: A side effect of this work was that the previous limit of 200 on the nesting 92: depth of parentheses was removed. However, there is a downside: pcre_compile() 93: runs more slowly than before (30% or more, depending on the pattern) because it 94: is doing a full analysis of the pattern. My hope was that this would not be a 95: big issue, and in the event, nobody has commented on it. 96: 97: At release 8.34, a limit on the nesting depth of parentheses was re-introduced 98: (default 250, settable at build time) so as to put a limit on the amount of 99: system stack used by pcre_compile(). This is a safety feature for environments 100: with small stacks where the patterns are provided by users. 101: 102: 103: Traditional matching function 104: ----------------------------- 105: 106: The "traditional", and original, matching function is called pcre_exec(), and 107: it implements an NFA algorithm, similar to the original Henry Spencer algorithm 108: and the way that Perl works. This is not surprising, since it is intended to be 109: as compatible with Perl as possible. This is the function most users of PCRE 110: will use most of the time. From release 8.20, if PCRE is compiled with 111: just-in-time (JIT) support, and studying a compiled pattern with JIT is 112: successful, the JIT code is run instead of the normal pcre_exec() code, but the 113: result is the same. 114: 115: 116: Supplementary matching function 117: ------------------------------- 118: 119: From PCRE 6.0, there is also a supplementary matching function called 120: pcre_dfa_exec(). This implements a DFA matching algorithm that searches 121: simultaneously for all possible matches that start at one point in the subject 122: string. (Going back to my roots: see Historical Note 1 above.) This function 123: intreprets the same compiled pattern data as pcre_exec(); however, not all the 124: facilities are available, and those that are do not always work in quite the 125: same way. See the user documentation for details. 126: 127: The algorithm that is used for pcre_dfa_exec() is not a traditional FSM, 128: because it may have a number of states active at one time. More work would be 129: needed at compile time to produce a traditional FSM where only one state is 130: ever active at once. I believe some other regex matchers work this way. JIT 131: support is not available for this kind of matching. 132: 133: 134: Changeable options 135: ------------------ 136: 137: The /i, /m, or /s options (PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and some 138: others) may change in the middle of patterns. From PCRE 8.13, their processing 139: is handled entirely at compile time by generating different opcodes for the 140: different settings. The runtime functions do not need to keep track of an 141: options state any more. 142: 143: 144: Format of compiled patterns 145: --------------------------- 146: 147: The compiled form of a pattern is a vector of unsigned units (bytes in 8-bit 148: mode, shorts in 16-bit mode, 32-bit words in 32-bit mode), containing items of 149: variable length. The first unit in an item contains an opcode, and the length 150: of the item is either implicit in the opcode or contained in the data that 151: follows it. 152: 153: In many cases listed below, LINK_SIZE data values are specified for offsets 154: within the compiled pattern. LINK_SIZE always specifies a number of bytes. The 155: default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or 156: 4-byte values for these offsets, although this impairs the performance. (3-byte 157: LINK_SIZE values are available only in 8-bit mode.) Specifing a LINK_SIZE 158: larger than 2 is necessary only when patterns whose compiled length is greater 159: than 64K are going to be processed. In this description, we assume the "normal" 160: compilation options. Data values that are counts (e.g. quantifiers) are two 161: bytes long in 8-bit mode (most significant byte first), or one unit in 16-bit 162: and 32-bit modes. 163: 164: 165: Opcodes with no following data 166: ------------------------------ 167: 168: These items are all just one unit long 169: 170: OP_END end of pattern 171: OP_ANY match any one character other than newline 172: OP_ALLANY match any one character, including newline 173: OP_ANYBYTE match any single unit, even in UTF-8/16 mode 174: OP_SOD match start of data: \A 175: OP_SOM, start of match (subject + offset): \G 176: OP_SET_SOM, set start of match (\K) 177: OP_CIRC ^ (start of data) 178: OP_CIRCM ^ multiline mode (start of data or after newline) 179: OP_NOT_WORD_BOUNDARY \W 180: OP_WORD_BOUNDARY \w 181: OP_NOT_DIGIT \D 182: OP_DIGIT \d 183: OP_NOT_HSPACE \H 184: OP_HSPACE \h 185: OP_NOT_WHITESPACE \S 186: OP_WHITESPACE \s 187: OP_NOT_VSPACE \V 188: OP_VSPACE \v 189: OP_NOT_WORDCHAR \W 190: OP_WORDCHAR \w 191: OP_EODN match end of data or newline at end: \Z 192: OP_EOD match end of data: \z 193: OP_DOLL $ (end of data, or before final newline) 194: OP_DOLLM $ multiline mode (end of data or before newline) 195: OP_EXTUNI match an extended Unicode grapheme cluster 196: OP_ANYNL match any Unicode newline sequence 197: 198: OP_ASSERT_ACCEPT ) 199: OP_ACCEPT ) These are Perl 5.10's "backtracking control 200: OP_COMMIT ) verbs". If OP_ACCEPT is inside capturing 201: OP_FAIL ) parentheses, it may be preceded by one or more 202: OP_PRUNE ) OP_CLOSE, each followed by a count that 203: OP_SKIP ) indicates which parentheses must be closed. 204: OP_THEN ) 205: 206: OP_ASSERT_ACCEPT is used when (*ACCEPT) is encountered within an assertion. 207: This ends the assertion, not the entire pattern match. 208: 209: 210: Backtracking control verbs with optional data 211: --------------------------------------------- 212: 213: (*THEN) without an argument generates the opcode OP_THEN and no following data. 214: OP_MARK is followed by the mark name, preceded by a one-unit length, and 215: followed by a binary zero. For (*PRUNE), (*SKIP), and (*THEN) with arguments, 216: the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name 217: following in the same format as OP_MARK. 218: 219: 220: Matching literal characters 221: --------------------------- 222: 223: The OP_CHAR opcode is followed by a single character that is to be matched 224: casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes, 225: the character may be more than one unit long. In UTF-32 mode, characters 226: are always exactly one unit long. 227: 228: If there is only one character in a character class, OP_CHAR or OP_CHARI is 229: used for a positive class, and OP_NOT or OP_NOTI for a negative one (that is, 230: for something like [^a]). 231: 232: 233: Repeating single characters 234: --------------------------- 235: 236: The common repeats (*, +, ?), when applied to a single character, use the 237: following opcodes, which come in caseful and caseless versions: 238: 239: Caseful Caseless 240: OP_STAR OP_STARI 241: OP_MINSTAR OP_MINSTARI 242: OP_POSSTAR OP_POSSTARI 243: OP_PLUS OP_PLUSI 244: OP_MINPLUS OP_MINPLUSI 245: OP_POSPLUS OP_POSPLUSI 246: OP_QUERY OP_QUERYI 247: OP_MINQUERY OP_MINQUERYI 248: OP_POSQUERY OP_POSQUERYI 249: 250: Each opcode is followed by the character that is to be repeated. In ASCII mode, 251: these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in 252: UTF-32 mode these are one-unit items. Those with "MIN" in their names are the 253: minimizing versions. Those with "POS" in their names are possessive versions. 254: Other repeats make use of these opcodes: 255: 256: Caseful Caseless 257: OP_UPTO OP_UPTOI 258: OP_MINUPTO OP_MINUPTOI 259: OP_POSUPTO OP_POSUPTOI 260: OP_EXACT OP_EXACTI 261: 262: Each of these is followed by a count and then the repeated character. OP_UPTO 263: matches from 0 to the given number. A repeat with a non-zero minimum and a 264: fixed maximum is coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or 265: OPT_POSUPTO). 266: 267: Another set of matching repeating opcodes (called OP_NOTSTAR, OP_NOTSTARI, 268: etc.) are used for repeated, negated, single-character classes such as [^a]*. 269: The normal single-character opcodes (OP_STAR, etc.) are used for repeated 270: positive single-character classes. 271: 272: 273: Repeating character types 274: ------------------------- 275: 276: Repeats of things like \d are done exactly as for single characters, except 277: that instead of a character, the opcode for the type is stored in the data 278: unit. The opcodes are: 279: 280: OP_TYPESTAR 281: OP_TYPEMINSTAR 282: OP_TYPEPOSSTAR 283: OP_TYPEPLUS 284: OP_TYPEMINPLUS 285: OP_TYPEPOSPLUS 286: OP_TYPEQUERY 287: OP_TYPEMINQUERY 288: OP_TYPEPOSQUERY 289: OP_TYPEUPTO 290: OP_TYPEMINUPTO 291: OP_TYPEPOSUPTO 292: OP_TYPEEXACT 293: 294: 295: Match by Unicode property 296: ------------------------- 297: 298: OP_PROP and OP_NOTPROP are used for positive and negative matches of a 299: character by testing its Unicode property (the \p and \P escape sequences). 300: Each is followed by two units that encode the desired property as a type and a 301: value. The types are a set of #defines of the form PT_xxx, and the values are 302: enumerations of the form ucp_xx, defined in the ucp.h source file. The value is 303: relevant only for PT_GC (General Category), PT_PC (Particular Category), and 304: PT_SC (Script). 305: 306: Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by 307: three units: OP_PROP or OP_NOTPROP, and then the desired property type and 308: value. 309: 310: 311: Character classes 312: ----------------- 313: 314: If there is only one character in a class, OP_CHAR or OP_CHARI is used for a 315: positive class, and OP_NOT or OP_NOTI for a negative one (that is, for 316: something like [^a]). 317: 318: A set of repeating opcodes (called OP_NOTSTAR etc.) are used for repeated, 319: negated, single-character classes. The normal single-character opcodes 320: (OP_STAR, etc.) are used for repeated positive single-character classes. 321: 322: When there is more than one character in a class, and all the code points are 323: less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a 324: negative one. In either case, the opcode is followed by a 32-byte (16-short, 325: 8-word) bit map containing a 1 bit for every character that is acceptable. The 326: bits are counted from the least significant end of each unit. In caseless mode, 327: bits for both cases are set. 328: 329: The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 330: mode, subject characters with values greater than 255 can be handled correctly. 331: For OP_CLASS they do not match, whereas for OP_NCLASS they do. 332: 333: For classes containing characters with values greater than 255 or that contain 334: \p or \P, OP_XCLASS is used. It optionally uses a bit map if any code points 335: are less than 256, followed by a list of pairs (for a range) and single 336: characters. In caseless mode, both cases are explicitly listed. 337: 338: OP_XCLASS is followed by a unit containing flag bits: XCL_NOT indicates that 339: this is a negative class, and XCL_MAP indicates that a bit map is present. 340: There follows the bit map, if XCL_MAP is set, and then a sequence of items 341: coded as follows: 342: 343: XCL_END marks the end of the list 344: XCL_SINGLE one character follows 345: XCL_RANGE two characters follow 346: XCL_PROP a Unicode property (type, value) follows 347: XCL_NOTPROP a Unicode property (type, value) follows 348: 349: If a range starts with a code point less than 256 and ends with one greater 350: than 256, an XCL_RANGE item is used, without setting any bits in the bit map. 351: This means that if no other items in the class set bits in the map, a map is 352: not needed. 353: 354: 355: Back references 356: --------------- 357: 358: OP_REF (caseful) or OP_REFI (caseless) is followed by a count containing the 359: reference number if the reference is to a unique capturing group (either by 360: number or by name). When named groups are used, there may be more than one 361: group with the same name. In this case, a reference by name generates OP_DNREF 362: or OP_DNREFI. These are followed by two counts: the index (not the byte offset) 363: in the group name table of the first entry for the requred name, followed by 364: the number of groups with the same name. 365: 366: 367: Repeating character classes and back references 368: ----------------------------------------------- 369: 370: Single-character classes are handled specially (see above). This section 371: applies to other classes and also to back references. In both cases, the repeat 372: information follows the base item. The matching code looks at the following 373: opcode to see if it is one of 374: 375: OP_CRSTAR 376: OP_CRMINSTAR 377: OP_CRPOSSTAR 378: OP_CRPLUS 379: OP_CRMINPLUS 380: OP_CRPOSPLUS 381: OP_CRQUERY 382: OP_CRMINQUERY 383: OP_CRPOSQUERY 384: OP_CRRANGE 385: OP_CRMINRANGE 386: OP_CRPOSRANGE 387: 388: All but the last three are single-unit items, with no data. The others are 389: followed by the minimum and maximum repeat counts. 390: 391: 392: Brackets and alternation 393: ------------------------ 394: 395: A pair of non-capturing round brackets is wrapped round each expression at 396: compile time, so alternation always happens in the context of brackets. 397: 398: [Note for North Americans: "bracket" to some English speakers, including 399: myself, can be round, square, curly, or pointy. Hence this usage rather than 400: "parentheses".] 401: 402: Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99 403: capturing brackets and it used a different opcode for each one. From release 404: 3.5, the limit was removed by putting the bracket number into the data for 405: higher-numbered brackets. From release 7.0 all capturing brackets are handled 406: this way, using the single opcode OP_CBRA. 407: 408: A bracket opcode is followed by LINK_SIZE bytes which give the offset to the 409: next alternative OP_ALT or, if there aren't any branches, to the matching 410: OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to 411: the next one, or to the OP_KET opcode. For capturing brackets, the bracket 412: number is a count that immediately follows the offset. 413: 414: OP_KET is used for subpatterns that do not repeat indefinitely, and OP_KETRMIN 415: and OP_KETRMAX are used for indefinite repetitions, minimally or maximally 416: respectively (see below for possessive repetitions). All three are followed by 417: LINK_SIZE bytes giving (as a positive number) the offset back to the matching 418: bracket opcode. 419: 420: If a subpattern is quantified such that it is permitted to match zero times, it 421: is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are 422: single-unit opcodes that tell the matcher that skipping the following 423: subpattern entirely is a valid branch. In the case of the first two, not 424: skipping the pattern is also valid (greedy and non-greedy). The third is used 425: when a pattern has the quantifier {0,0}. It cannot be entirely discarded, 426: because it may be called as a subroutine from elsewhere in the regex. 427: 428: A subpattern with an indefinite maximum repetition is replicated in the 429: compiled data its minimum number of times (or once with OP_BRAZERO if the 430: minimum is zero), with the final copy terminating with OP_KETRMIN or OP_KETRMAX 431: as appropriate. 432: 433: A subpattern with a bounded maximum repetition is replicated in a nested 434: fashion up to the maximum number of times, with OP_BRAZERO or OP_BRAMINZERO 435: before each replication after the minimum, so that, for example, (abc){2,5} is 436: compiled as (abc)(abc)((abc)((abc)(abc)?)?)?, except that each bracketed group 437: has the same number. 438: 439: When a repeated subpattern has an unbounded upper limit, it is checked to see 440: whether it could match an empty string. If this is the case, the opcode in the 441: final replication is changed to OP_SBRA or OP_SCBRA. This tells the matcher 442: that it needs to check for matching an empty string when it hits OP_KETRMIN or 443: OP_KETRMAX, and if so, to break the loop. 444: 445: 446: Possessive brackets 447: ------------------- 448: 449: When a repeated group (capturing or non-capturing) is marked as possessive by 450: the "+" notation, e.g. (abc)++, different opcodes are used. Their names all 451: have POS on the end, e.g. OP_BRAPOS instead of OP_BRA and OP_SCPBRPOS instead 452: of OP_SCBRA. The end of such a group is marked by OP_KETRPOS. If the minimum 453: repetition is zero, the group is preceded by OP_BRAPOSZERO. 454: 455: 456: Once-only (atomic) groups 457: ------------------------- 458: 459: These are just like other subpatterns, but they start with the opcode 460: OP_ONCE or OP_ONCE_NC. The former is used when there are no capturing brackets 461: within the atomic group; the latter when there are. The distinction is needed 462: for when there is a backtrack to before the group - any captures within the 463: group must be reset, so it is necessary to retain backtracking points inside 464: the group even after it is complete in order to do this. When there are no 465: captures in an atomic group, all the backtracking can be discarded when it is 466: complete. This is more efficient, and also uses less stack. 467: 468: The check for matching an empty string in an unbounded repeat is handled 469: entirely at runtime, so there are just these two opcodes for atomic groups. 470: 471: 472: Assertions 473: ---------- 474: 475: Forward assertions are also just like other subpatterns, but starting with one 476: of the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes 477: OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion 478: is OP_REVERSE, followed by a count of the number of characters to move back the 479: pointer in the subject string. In ASCII mode, the count is a number of units, 480: but in UTF-8/16 mode each character may occupy more than one unit; in UTF-32 481: mode each character occupies exactly one unit. A separate count is present in 482: each alternative of a lookbehind assertion, allowing them to have different 483: fixed lengths. 484: 485: 486: Conditional subpatterns 487: ----------------------- 488: 489: These are like other subpatterns, but they start with the opcode OP_COND, or 490: OP_SCOND for one that might match an empty string in an unbounded repeat. If 491: the condition is a back reference, this is stored at the start of the 492: subpattern using the opcode OP_CREF followed by a count containing the 493: reference number, provided that the reference is to a unique capturing group. 494: If the reference was by name and there is more than one group with that name, 495: OP_DNCREF is used instead. It is followed by two counts: the index in the group 496: names table, and the number of groups with the same name. 497: 498: If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of 499: group x" (coded as "(?(Rx)"), the group number is stored at the start of the 500: subpattern using the opcode OP_RREF (with a value of zero for "the whole 501: pattern") or OP_DNRREF (with data as for OP_DNCREF). For a DEFINE condition, 502: just the single unit OP_DEF is used (it has no associated data). Otherwise, a 503: conditional subpattern always starts with one of the assertions. 504: 505: 506: Recursion 507: --------- 508: 509: Recursion either matches the current regex, or some subexpression. The opcode 510: OP_RECURSE is followed by aLINK_SIZE value that is the offset to the starting 511: bracket from the start of the whole pattern. From release 6.5, OP_RECURSE is 512: automatically wrapped inside OP_ONCE brackets, because otherwise some patterns 513: broke it. OP_RECURSE is also used for "subroutine" calls, even though they are 514: not strictly a recursion. 515: 516: 517: Callout 518: ------- 519: 520: OP_CALLOUT is followed by one unit of data that holds a callout number in the 521: range 0 to 254 for manual callouts, or 255 for an automatic callout. In both 522: cases there follows a count giving the offset in the pattern string to the 523: start of the following item, and another count giving the length of this item. 524: These values make is possible for pcretest to output useful tracing information 525: using automatic callouts. 526: 527: Philip Hazel 528: November 2013