embedaddon/pcre/HACKING - diff

Return to HACKING CVS log

Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre

Diff for /embedaddon/pcre/HACKING between versions 1.1.1.3 and 1.1.1.4

version 1.1.1.3, 2012/10/09 09:19:17	version 1.1.1.4, 2013/07/22 08:25:56
Line 49 complexity in Perl regular expressions, I couldn't do	Line 49 complexity in Perl regular expressions, I couldn't do
first pass through the pattern is helpful for other reasons.	first pass through the pattern is helpful for other reasons.


Support for 16-bit data strings	Support for 16-bit and 32-bit data strings
-------------------------------	-------------------------------------------

From release 8.30, PCRE supports 16-bit as well as 8-bit data strings, by being	From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from
compilable in either 8-bit or 16-bit modes, or both. Thus, two different	release 8.32, PCRE supports 32-bit data strings. The library can be compiled
libraries can be created. In the description that follows, the word "short" is	in any combination of 8-bit, 16-bit or 32-bit modes, creating different
	libraries. In the description that follows, the word "short" is
used for a 16-bit data quantity, and the word "unit" is used for a quantity	used for a 16-bit data quantity, and the word "unit" is used for a quantity
that is a byte in 8-bit mode and a short in 16-bit mode. However, so as not to	that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned
over-complicate the text, the names of PCRE functions are given in 8-bit form	integer in 32-bit mode. However, so as not to over-complicate the text, the
only.	names of PCRE functions are given in 8-bit form only.


Computing the memory requirement: how it was	Computing the memory requirement: how it was
Line 138 Format of compiled patterns	Line 139 Format of compiled patterns
---------------------------	---------------------------

The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or	The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
shorts in 16-bit mode), containing items of variable length. The first unit in	shorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing
an item contains an opcode, and the length of the item is either implicit in	items of variable length. The first unit in an item contains an opcode, and
the opcode or contained in the data that follows it.	the length of the item is either implicit in the opcode or contained in the
	data that follows it.

In many cases listed below, LINK_SIZE data values are specified for offsets	In many cases listed below, LINK_SIZE data values are specified for offsets
within the compiled pattern. LINK_SIZE always specifies a number of bytes. The	within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
Line 207 Matching literal characters	Line 209 Matching literal characters

The OP_CHAR opcode is followed by a single character that is to be matched	The OP_CHAR opcode is followed by a single character that is to be matched
casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,	casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
the character may be more than one unit long.	the character may be more than one unit long. In UTF-32 mode, characters
	are always exactly one unit long.


Repeating single characters	Repeating single characters
Line 228 following opcodes, which come in caseful and caseless	Line 231 following opcodes, which come in caseful and caseless
OP_POSQUERY OP_POSQUERYI	OP_POSQUERY OP_POSQUERYI

Each opcode is followed by the character that is to be repeated. In ASCII mode,	Each opcode is followed by the character that is to be repeated. In ASCII mode,
these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable.	these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in
	UTF-32 mode these are one-unit items.
Those with "MIN" in their names are the minimizing versions. Those with "POS"	Those with "MIN" in their names are the minimizing versions. Those with "POS"
in their names are possessive versions. Other repeats make use of these	in their names are possessive versions. Other repeats make use of these
opcodes:	opcodes:
Line 299 bit map containing a 1 bit for every character that is	Line 303 bit map containing a 1 bit for every character that is
counted from the least significant end of each unit. In caseless mode, bits for	counted from the least significant end of each unit. In caseless mode, bits for
both cases are set.	both cases are set.

The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16 mode,	The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode,
subject characters with values greater than 255 can be handled correctly. For	subject characters with values greater than 255 can be handled correctly. For
OP_CLASS they do not match, whereas for OP_NCLASS they do.	OP_CLASS they do not match, whereas for OP_NCLASS they do.

Line 412 OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opc	Line 416 OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opc
is OP_REVERSE, followed by a two byte (one short) count of the number of	is OP_REVERSE, followed by a two byte (one short) count of the number of
characters to move back the pointer in the subject string. In ASCII mode, the	characters to move back the pointer in the subject string. In ASCII mode, the
count is a number of units, but in UTF-8/16 mode each character may occupy more	count is a number of units, but in UTF-8/16 mode each character may occupy more
than one unit. A separate count is present in each alternative of a lookbehind	than one unit; in UTF-32 mode each character occupies exactly one unit.
	A separate count is present in each alternative of a lookbehind
assertion, allowing them to have different fixed lengths.	assertion, allowing them to have different fixed lengths.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>

Removed from v.1.1.1.3
changed lines
	Added in v.1.1.1.4