Diff for /embedaddon/pcre/HACKING between versions 1.1.1.2 and 1.1.1.4

version 1.1.1.2, 2012/02/21 23:50:25 version 1.1.1.4, 2013/07/22 08:25:56
Line 49  complexity in Perl regular expressions, I couldn't do  Line 49  complexity in Perl regular expressions, I couldn't do 
 first pass through the pattern is helpful for other reasons.   first pass through the pattern is helpful for other reasons. 
   
   
Support for 16-bit data stringsSupport for 16-bit and 32-bit data strings
--------------------------------------------------------------------------
   
From release 8.30, PCRE supports 16-bit as well as 8-bit data strings, by being From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from
compilable in either 8-bit or 16-bit modes, or both. Thus, two different release 8.32, PCRE supports 32-bit data strings. The library can be compiled
libraries can be created. In the description that follows, the word "short" is in any combination of 8-bit, 16-bit or 32-bit modes, creating different
 libraries. In the description that follows, the word "short" is 
 used for a 16-bit data quantity, and the word "unit" is used for a quantity  used for a 16-bit data quantity, and the word "unit" is used for a quantity
that is a byte in 8-bit mode and a short in 16-bit mode. However, so as not tothat is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned
over-complicate the text, the names of PCRE functions are given in 8-bit forminteger in 32-bit mode. However, so as not to over-complicate the text, the
only.names of PCRE functions are given in 8-bit form only.
   
   
 Computing the memory requirement: how it was  Computing the memory requirement: how it was
Line 138  Format of compiled patterns Line 139  Format of compiled patterns
 ---------------------------  ---------------------------
   
 The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or  The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
shorts in 16-bit mode), containing items of variable length. The first unit inshorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing
an item contains an opcode, and the length of the item is either implicit initems of variable length. The first unit in an item contains an opcode, and
the opcode or contained in the data that follows it.the length of the item is either implicit in the opcode or contained in the
 data that follows it.
   
 In many cases listed below, LINK_SIZE data values are specified for offsets  In many cases listed below, LINK_SIZE data values are specified for offsets
 within the compiled pattern. LINK_SIZE always specifies a number of bytes. The  within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
Line 207  Matching literal characters Line 209  Matching literal characters
   
 The OP_CHAR opcode is followed by a single character that is to be matched   The OP_CHAR opcode is followed by a single character that is to be matched 
 casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,  casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
the character may be more than one unit long.the character may be more than one unit long. In UTF-32 mode, characters
 are always exactly one unit long.
   
   
 Repeating single characters  Repeating single characters
Line 228  following opcodes, which come in caseful and caseless  Line 231  following opcodes, which come in caseful and caseless 
   OP_POSQUERY     OP_POSQUERYI      OP_POSQUERY     OP_POSQUERYI  
   
 Each opcode is followed by the character that is to be repeated. In ASCII mode,  Each opcode is followed by the character that is to be repeated. In ASCII mode,
these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable.these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in
 UTF-32 mode these are one-unit items.
 Those with "MIN" in their names are the minimizing versions. Those with "POS"  Those with "MIN" in their names are the minimizing versions. Those with "POS"
 in their names are possessive versions. Other repeats make use of these  in their names are possessive versions. Other repeats make use of these
 opcodes:  opcodes:
Line 285  Character classes Line 289  Character classes
   
 If there is only one character in the class, OP_CHAR or OP_CHARI is used for a  If there is only one character in the class, OP_CHAR or OP_CHARI is used for a
 positive class, and OP_NOT or OP_NOTI for a negative one (that is, for  positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
something like [^a]). However, OP_NOT[I] can be used only with single-unitsomething like [^a]). 
characters, so in UTF-8 (UTF-16) mode, the use of OP_NOT[I] applies only to 
characters whose code points are no greater than 127 (0xffff). 
   
 Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for  Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for
 repeated, negated, single-character classes. The normal single-character  repeated, negated, single-character classes. The normal single-character
Line 301  bit map containing a 1 bit for every character that is Line 303  bit map containing a 1 bit for every character that is
 counted from the least significant end of each unit. In caseless mode, bits for  counted from the least significant end of each unit. In caseless mode, bits for
 both cases are set.  both cases are set.
   
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16 mode,The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode,
 subject characters with values greater than 255 can be handled correctly. For  subject characters with values greater than 255 can be handled correctly. For
 OP_CLASS they do not match, whereas for OP_NCLASS they do.  OP_CLASS they do not match, whereas for OP_NCLASS they do.
   
Line 414  OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opc Line 416  OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opc
 is OP_REVERSE, followed by a two byte (one short) count of the number of  is OP_REVERSE, followed by a two byte (one short) count of the number of
 characters to move back the pointer in the subject string. In ASCII mode, the   characters to move back the pointer in the subject string. In ASCII mode, the 
 count is a number of units, but in UTF-8/16 mode each character may occupy more  count is a number of units, but in UTF-8/16 mode each character may occupy more
than one unit. A separate count is present in each alternative of a lookbehindthan one unit; in UTF-32 mode each character occupies exactly one unit.
 A separate count is present in each alternative of a lookbehind
 assertion, allowing them to have different fixed lengths.  assertion, allowing them to have different fixed lengths.
   
   
Line 467  item giving the length of the next item. Line 470  item giving the length of the next item.
   
   
 Philip Hazel  Philip Hazel
December 2011February 2012

Removed from v.1.1.1.2  
changed lines
  Added in v.1.1.1.4


FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>