embedaddon/pcre/HACKING - diff

Return to HACKING CVS log

Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre

Diff for /embedaddon/pcre/HACKING between versions 1.1 and 1.1.1.4

version 1.1, 2012/02/21 23:05:51	version 1.1.1.4, 2013/07/22 08:25:56
Line 49 complexity in Perl regular expressions, I couldn't do	Line 49 complexity in Perl regular expressions, I couldn't do
first pass through the pattern is helpful for other reasons.	first pass through the pattern is helpful for other reasons.


	Support for 16-bit and 32-bit data strings
	-------------------------------------------

	From release 8.30, PCRE supports 16-bit as well as 8-bit data strings; and from
	release 8.32, PCRE supports 32-bit data strings. The library can be compiled
	in any combination of 8-bit, 16-bit or 32-bit modes, creating different
	libraries. In the description that follows, the word "short" is
	used for a 16-bit data quantity, and the word "unit" is used for a quantity
	that is a byte in 8-bit mode, a short in 16-bit mode and a 32-bit unsigned
	integer in 32-bit mode. However, so as not to over-complicate the text, the
	names of PCRE functions are given in 8-bit form only.


Computing the memory requirement: how it was	Computing the memory requirement: how it was
--------------------------------------------	--------------------------------------------

Line 125 any more.	Line 138 any more.
Format of compiled patterns	Format of compiled patterns
---------------------------	---------------------------

The compiled form of a pattern is a vector of bytes, containing items of	The compiled form of a pattern is a vector of units (bytes in 8-bit mode, or
variable length. The first byte in an item is an opcode, and the length of the	shorts in 16-bit mode, 32-bit unsigned integers in 32-bit mode), containing
item is either implicit in the opcode or contained in the data bytes that	items of variable length. The first unit in an item contains an opcode, and
follow it.	the length of the item is either implicit in the opcode or contained in the
	data that follows it.

In many cases below LINK_SIZE data values are specified for offsets within the	In many cases listed below, LINK_SIZE data values are specified for offsets
compiled pattern. The default value for LINK_SIZE is 2, but PCRE can be	within the compiled pattern. LINK_SIZE always specifies a number of bytes. The
compiled to use 3-byte or 4-byte values for these offsets (impairing the	default value for LINK_SIZE is 2, but PCRE can be compiled to use 3-byte or
performance). This is necessary only when patterns whose compiled length is	4-byte values for these offsets, although this impairs the performance. (3-byte
greater than 64K are going to be processed. In this description, we assume the	LINK_SIZE values are available only in 8-bit mode.) Specifing a LINK_SIZE
"normal" compilation options. Data values that are counts (e.g. for	larger than 2 is necessary only when patterns whose compiled length is greater
quantifiers) are always just two bytes long.	than 64K are going to be processed. In this description, we assume the "normal"
	compilation options. Data values that are counts (e.g. for quantifiers) are
	always just two bytes long (one short in 16-bit mode).

Opcodes with no following data	Opcodes with no following data
------------------------------	------------------------------

These items are all just one byte long	These items are all just one unit long

OP_END end of pattern	OP_END end of pattern
OP_ANY match any one character other than newline	OP_ANY match any one character other than newline
Line 182 Backtracking control verbs with (optional) data	Line 198 Backtracking control verbs with (optional) data
-----------------------------------------------	-----------------------------------------------

(*THEN) without an argument generates the opcode OP_THEN and no following data.	(*THEN) without an argument generates the opcode OP_THEN and no following data.
OP_MARK is followed by the mark name, preceded by a one-byte length, and	OP_MARK is followed by the mark name, preceded by a one-unit length, and
followed by a binary zero. For (PRUNE), (SKIP), and (*THEN) with arguments,	followed by a binary zero. For (PRUNE), (SKIP), and (*THEN) with arguments,
the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name	the opcodes OP_PRUNE_ARG, OP_SKIP_ARG, and OP_THEN_ARG are used, with the name
following in the same format.	following in the same format.
Line 192 Matching literal characters	Line 208 Matching literal characters
---------------------------	---------------------------

The OP_CHAR opcode is followed by a single character that is to be matched	The OP_CHAR opcode is followed by a single character that is to be matched
casefully. For caseless matching, OP_CHARI is used. In UTF-8 mode, the	casefully. For caseless matching, OP_CHARI is used. In UTF-8 or UTF-16 modes,
character may be more than one byte long. (Earlier versions of PCRE used	the character may be more than one unit long. In UTF-32 mode, characters
multi-character strings, but this was changed to allow some new features to be	are always exactly one unit long.
added.)


Repeating single characters	Repeating single characters
---------------------------	---------------------------

The common repeats (*, +, ?) when applied to a single character use the	The common repeats (*, +, ?), when applied to a single character, use the
following opcodes, which come in caseful and caseless versions:	following opcodes, which come in caseful and caseless versions:

Caseful Caseless	Caseful Caseless
Line 215 following opcodes, which come in caseful and caseless	Line 230 following opcodes, which come in caseful and caseless
OP_MINQUERY OP_MINQUERYI	OP_MINQUERY OP_MINQUERYI
OP_POSQUERY OP_POSQUERYI	OP_POSQUERY OP_POSQUERYI

In ASCII mode, these are two-byte items; in UTF-8 mode, the length is variable.	Each opcode is followed by the character that is to be repeated. In ASCII mode,
Those with "MIN" in their name are the minimizing versions. Those with "POS" in	these are two-unit items; in UTF-8 or UTF-16 modes, the length is variable; in
their names are possessive versions. Each is followed by the character that is	UTF-32 mode these are one-unit items.
to be repeated. Other repeats make use of these opcodes:	Those with "MIN" in their names are the minimizing versions. Those with "POS"
	in their names are possessive versions. Other repeats make use of these
	opcodes:

Caseful Caseless	Caseful Caseless
OP_UPTO OP_UPTOI	OP_UPTO OP_UPTOI
Line 226 to be repeated. Other repeats make use of these opcode	Line 243 to be repeated. Other repeats make use of these opcode
OP_POSUPTO OP_POSUPTOI	OP_POSUPTO OP_POSUPTOI
OP_EXACT OP_EXACTI	OP_EXACT OP_EXACTI

Each of these is followed by a two-byte count (most significant first) and the	Each of these is followed by a two-byte (one short) count (most significant
repeated character. OP_UPTO matches from 0 to the given number. A repeat with a	byte first in 8-bit mode) and then the repeated character. OP_UPTO matches from
non-zero minimum and a fixed maximum is coded as an OP_EXACT followed by an	0 to the given number. A repeat with a non-zero minimum and a fixed maximum is
OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).	coded as an OP_EXACT followed by an OP_UPTO (or OP_MINUPTO or OPT_POSUPTO).


Repeating character types	Repeating character types
Line 237 Repeating character types	Line 254 Repeating character types

Repeats of things like \d are done exactly as for single characters, except	Repeats of things like \d are done exactly as for single characters, except
that instead of a character, the opcode for the type is stored in the data	that instead of a character, the opcode for the type is stored in the data
byte. The opcodes are:	unit. The opcodes are:

OP_TYPESTAR	OP_TYPESTAR
OP_TYPEMINSTAR	OP_TYPEMINSTAR
Line 259 Match by Unicode property	Line 276 Match by Unicode property

OP_PROP and OP_NOTPROP are used for positive and negative matches of a	OP_PROP and OP_NOTPROP are used for positive and negative matches of a
character by testing its Unicode property (the \p and \P escape sequences).	character by testing its Unicode property (the \p and \P escape sequences).
Each is followed by two bytes that encode the desired property as a type and a	Each is followed by two units that encode the desired property as a type and a
value.	value.

Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by	Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by
three bytes: OP_PROP or OP_NOTPROP and then the desired property type and	three units: OP_PROP or OP_NOTPROP, and then the desired property type and
value.	value.


Character classes	Character classes
-----------------	-----------------

If there is only one character, OP_CHAR or OP_CHARI is used for a positive	If there is only one character in the class, OP_CHAR or OP_CHARI is used for a
class, and OP_NOT or OP_NOTI for a negative one (that is, for something like	positive class, and OP_NOT or OP_NOTI for a negative one (that is, for
[^a]). However, in UTF-8 mode, the use of OP_NOT[I] applies only to characters	something like [^a]).
with values < 128, because OP_NOT[I] is confined to single bytes.

Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for a	Another set of 13 repeating opcodes (called OP_NOTSTAR etc.) are used for
repeated, negated, single-character class. The normal single-character opcodes	repeated, negated, single-character classes. The normal single-character
(OP_STAR, etc.) are used for a repeated positive single-character class.	opcodes (OP_STAR, etc.) are used for repeated positive single-character
	classes.

When there is more than one character in a class and all the characters are	When there is more than one character in a class and all the characters are
less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a	less than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a
negative one. In either case, the opcode is followed by a 32-byte bit map	negative one. In either case, the opcode is followed by a 32-byte (16-short)
containing a 1 bit for every character that is acceptable. The bits are counted	bit map containing a 1 bit for every character that is acceptable. The bits are
from the least significant end of each byte. In caseless mode, bits for both	counted from the least significant end of each unit. In caseless mode, bits for
cases are set.	both cases are set.

The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode,	The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8/16/32 mode,
subject characters with values greater than 256 can be handled correctly. For	subject characters with values greater than 255 can be handled correctly. For
OP_CLASS they do not match, whereas for OP_NCLASS they do.	OP_CLASS they do not match, whereas for OP_NCLASS they do.

For classes containing characters with values > 255, OP_XCLASS is used. It	For classes containing characters with values greater than 255, OP_XCLASS is
optionally uses a bit map (if any characters lie within it), followed by a list	used. It optionally uses a bit map (if any characters lie within it), followed
of pairs (for a range) and single characters. In caseless mode, both cases are	by a list of pairs (for a range) and single characters. In caseless mode, both
explicitly listed. There is a flag character than indicates whether it is a	cases are explicitly listed. There is a flag character than indicates whether
positive or a negative class.	it is a positive or a negative class.


Back references	Back references
---------------	---------------

OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes containing the	OP_REF (caseful) or OP_REFI (caseless) is followed by two bytes (one short)
reference number.	containing the reference number.


Repeating character classes and back references	Repeating character classes and back references
Line 321 if it is one of	Line 338 if it is one of
OP_CRRANGE	OP_CRRANGE
OP_CRMINRANGE	OP_CRMINRANGE

All but the last two are just single-byte items. The others are followed by	All but the last two are just single-unit items. The others are followed by
four bytes of data, comprising the minimum and maximum repeat counts. There are	four bytes (two shorts) of data, comprising the minimum and maximum repeat
no special possessive opcodes for these repeats; a possessive repeat is	counts. There are no special possessive opcodes for these repeats; a possessive
compiled into an atomic group.	repeat is compiled into an atomic group.


Brackets and alternation	Brackets and alternation
Line 334 A pair of non-capturing (round) brackets is wrapped ro	Line 351 A pair of non-capturing (round) brackets is wrapped ro
compile time, so alternation always happens in the context of brackets.	compile time, so alternation always happens in the context of brackets.

[Note for North Americans: "bracket" to some English speakers, including	[Note for North Americans: "bracket" to some English speakers, including
myself, can be round, square, curly, or pointy. Hence this usage.]	myself, can be round, square, curly, or pointy. Hence this usage rather than
	"parentheses".]

Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99	Non-capturing brackets use the opcode OP_BRA. Originally PCRE was limited to 99
capturing brackets and it used a different opcode for each one. From release	capturing brackets and it used a different opcode for each one. From release
Line 346 A bracket opcode is followed by LINK_SIZE bytes which	Line 364 A bracket opcode is followed by LINK_SIZE bytes which
next alternative OP_ALT or, if there aren't any branches, to the matching	next alternative OP_ALT or, if there aren't any branches, to the matching
OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to	OP_KET opcode. Each OP_ALT is followed by LINK_SIZE bytes giving the offset to
the next one, or to the OP_KET opcode. For capturing brackets, the bracket	the next one, or to the OP_KET opcode. For capturing brackets, the bracket
number immediately follows the offset, always as a 2-byte item.	number immediately follows the offset, always as a 2-byte (one short) item.

OP_KET is used for subpatterns that do not repeat indefinitely, while	OP_KET is used for subpatterns that do not repeat indefinitely, and
OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or	OP_KETRMIN and OP_KETRMAX are used for indefinite repetitions, minimally or
maximally respectively (see below for possessive repetitions). All three are	maximally respectively (see below for possessive repetitions). All three are
followed by LINK_SIZE bytes giving (as a positive number) the offset back to	followed by LINK_SIZE bytes giving (as a positive number) the offset back to
Line 356 the matching bracket opcode.	Line 374 the matching bracket opcode.

If a subpattern is quantified such that it is permitted to match zero times, it	If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are	is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
single-byte opcodes that tell the matcher that skipping the following	single-unit opcodes that tell the matcher that skipping the following
subpattern entirely is a valid branch. In the case of the first two, not	subpattern entirely is a valid branch. In the case of the first two, not
skipping the pattern is also valid (greedy and non-greedy). The third is used	skipping the pattern is also valid (greedy and non-greedy). The third is used
when a pattern has the quantifier {0,0}. It cannot be entirely discarded,	when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
Line 395 Assertions	Line 413 Assertions
Forward assertions are just like other subpatterns, but starting with one of	Forward assertions are just like other subpatterns, but starting with one of
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes	the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion	OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion
is OP_REVERSE, followed by a two byte count of the number of characters to move	is OP_REVERSE, followed by a two byte (one short) count of the number of
back the pointer in the subject string. When operating in UTF-8 mode, the count	characters to move back the pointer in the subject string. In ASCII mode, the
is a character count rather than a byte count. A separate count is present in	count is a number of units, but in UTF-8/16 mode each character may occupy more
each alternative of a lookbehind assertion, allowing them to have different	than one unit; in UTF-32 mode each character occupies exactly one unit.
fixed lengths.	A separate count is present in each alternative of a lookbehind
	assertion, allowing them to have different fixed lengths.


Once-only (atomic) subpatterns	Once-only (atomic) subpatterns
Line 416 Conditional subpatterns	Line 435 Conditional subpatterns
These are like other subpatterns, but they start with the opcode OP_COND, or	These are like other subpatterns, but they start with the opcode OP_COND, or
OP_SCOND for one that might match an empty string in an unbounded repeat. If	OP_SCOND for one that might match an empty string in an unbounded repeat. If
the condition is a back reference, this is stored at the start of the	the condition is a back reference, this is stored at the start of the
subpattern using the opcode OP_CREF followed by two bytes containing the	subpattern using the opcode OP_CREF followed by two bytes (one short)
reference number. OP_NCREF is used instead if the reference was generated by	containing the reference number. OP_NCREF is used instead if the reference was
name (so that the runtime code knows to check for duplicate names).	generated by name (so that the runtime code knows to check for duplicate
	names).

If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of	If the condition is "in recursion" (coded as "(?(R)"), or "in recursion of
group x" (coded as "(?(Rx)"), the group number is stored at the start of the	group x" (coded as "(?(Rx)"), the group number is stored at the start of the
subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of	subpattern using the opcode OP_RREF or OP_NRREF (cf OP_NCREF), and a value of
zero for "the whole pattern". For a DEFINE condition, just the single byte	zero for "the whole pattern". For a DEFINE condition, just the single unit
OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern	OP_DEF is used (it has no associated data). Otherwise, a conditional subpattern
always starts with one of the assertions.	always starts with one of the assertions.

Line 442 are not strictly a recursion.	Line 462 are not strictly a recursion.
Callout	Callout
-------	-------

OP_CALLOUT is followed by one byte of data that holds a callout number in the	OP_CALLOUT is followed by one unit of data that holds a callout number in the
range 0 to 254 for manual callouts, or 255 for an automatic callout. In both	range 0 to 254 for manual callouts, or 255 for an automatic callout. In both
cases there follows a two-byte value giving the offset in the pattern to the	cases there follows a two-byte (one short) value giving the offset in the
start of the following item, and another two-byte item giving the length of the	pattern to the start of the following item, and another two-byte (one short)
next item.	item giving the length of the next item.


Philip Hazel	Philip Hazel
October 2011	February 2012

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>

Removed from v.1.1
changed lines
	Added in v.1.1.1.4