embedaddon/pcre/doc/html/pcrepattern.html - diff

Return to pcrepattern.html CVS log

Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc / html

Diff for /embedaddon/pcre/doc/html/pcrepattern.html between versions 1.1.1.2 and 1.1.1.3

version 1.1.1.2, 2012/02/21 23:50:25	version 1.1.1.3, 2012/10/09 09:19:18
Line 227 backslash. All other characters (in particular, those	Line 227 backslash. All other characters (in particular, those
greater than 127) are treated as literals.	greater than 127) are treated as literals.
</P>	</P>
<P>	<P>
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the	If a pattern is compiled with the PCRE_EXTENDED option, white space in the
pattern (other than in a character class) and characters between a # outside	pattern (other than in a character class) and characters between a # outside
a character class and the next newline are ignored. An escaping backslash can	a character class and the next newline are ignored. An escaping backslash can
be used to include a whitespace or # character as part of the pattern.	be used to include a white space or # character as part of the pattern.
</P>	</P>
<P>	<P>
If you want to remove the special meaning from a sequence of characters, you	If you want to remove the special meaning from a sequence of characters, you
Line 264 one of the following escape sequences than the binary	Line 264 one of the following escape sequences than the binary
\a alarm, that is, the BEL character (hex 07)	\a alarm, that is, the BEL character (hex 07)
\cx "control-x", where x is any ASCII character	\cx "control-x", where x is any ASCII character
\e escape (hex 1B)	\e escape (hex 1B)
\f formfeed (hex 0C)	\f form feed (hex 0C)
\n linefeed (hex 0A)	\n linefeed (hex 0A)
\r carriage return (hex 0D)	\r carriage return (hex 0D)
\t tab (hex 09)	\t tab (hex 09)
Line 307 as just described only when it is followed by two hexa	Line 307 as just described only when it is followed by two hexa
Otherwise, it matches a literal "x" character. In JavaScript mode, support for	Otherwise, it matches a literal "x" character. In JavaScript mode, support for
code points greater than 256 is provided by \u, which must be followed by	code points greater than 256 is provided by \u, which must be followed by
four hexadecimal digits; otherwise it matches a literal "u" character.	four hexadecimal digits; otherwise it matches a literal "u" character.
	Character codes specified by \u in JavaScript mode are constrained in the same
	was as those specified by \x in non-JavaScript mode.
</P>	</P>
<P>	<P>
Characters whose value is less than 256 can be defined by either of the two	Characters whose value is less than 256 can be defined by either of the two
Line 406 Another use of backslash is for specifying generic cha	Line 408 Another use of backslash is for specifying generic cha
<pre>	<pre>
\d any decimal digit	\d any decimal digit
\D any character that is not a decimal digit	\D any character that is not a decimal digit
\h any horizontal whitespace character	\h any horizontal white space character
\H any character that is not a horizontal whitespace character	\H any character that is not a horizontal white space character
\s any whitespace character	\s any white space character
\S any character that is not a whitespace character	\S any character that is not a white space character
\v any vertical whitespace character	\v any vertical white space character
\V any character that is not a vertical whitespace character	\V any character that is not a vertical white space character
\w any "word" character	\w any "word" character
\W any "non-word" character	\W any "non-word" character
</pre>	</pre>
Line 497 The vertical space characters are:	Line 499 The vertical space characters are:
<pre>	<pre>
U+000A Linefeed	U+000A Linefeed
U+000B Vertical tab	U+000B Vertical tab
U+000C Formfeed	U+000C Form feed
U+000D Carriage return	U+000D Carriage return
U+0085 Next line	U+0085 Next line
U+2028 Line separator	U+2028 Line separator
Line 520 This is an example of an "atomic group", details of wh	Line 522 This is an example of an "atomic group", details of wh
<a href="#atomicgroup">below.</a>	<a href="#atomicgroup">below.</a>
This particular group matches either the two-character sequence CR followed by	This particular group matches either the two-character sequence CR followed by
LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,	LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next	U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
line, U+0085). The two-character sequence is treated as a single unit that	line, U+0085). The two-character sequence is treated as a single unit that
cannot be split.	cannot be split.
</P>	</P>
Line 596 Armenian,	Line 598 Armenian,
Avestan,	Avestan,
Balinese,	Balinese,
Bamum,	Bamum,
	Batak,
Bengali,	Bengali,
Bopomofo,	Bopomofo,
	Brahmi,
Braille,	Braille,
Buginese,	Buginese,
Buhid,	Buhid,
Canadian_Aboriginal,	Canadian_Aboriginal,
Carian,	Carian,
	Chakma,
Cham,	Cham,
Cherokee,	Cherokee,
Common,	Common,
Line 645 Lisu,	Line 650 Lisu,
Lycian,	Lycian,
Lydian,	Lydian,
Malayalam,	Malayalam,
	Mandaic,
Meetei_Mayek,	Meetei_Mayek,
	Meroitic_Cursive,
	Meroitic_Hieroglyphs,
	Miao,
Mongolian,	Mongolian,
Myanmar,	Myanmar,
New_Tai_Lue,	New_Tai_Lue,
Line 664 Rejang,	Line 673 Rejang,
Runic,	Runic,
Samaritan,	Samaritan,
Saurashtra,	Saurashtra,
	Sharada,
Shavian,	Shavian,
Sinhala,	Sinhala,
	Sora_Sompeng,
Sundanese,	Sundanese,
Syloti_Nagri,	Syloti_Nagri,
Syriac,	Syriac,
Line 674 Tagbanwa,	Line 685 Tagbanwa,
Tai_Le,	Tai_Le,
Tai_Tham,	Tai_Tham,
Tai_Viet,	Tai_Viet,
	Takri,
Tamil,	Tamil,
Telugu,	Telugu,
Thaana,	Thaana,
Line 812 PCRE_UCP is set. They are:	Line 824 PCRE_UCP is set. They are:
Xwd Any Perl "word" character	Xwd Any Perl "word" character
</pre>	</pre>
Xan matches characters that have either the L (letter) or the N (number)	Xan matches characters that have either the L (letter) or the N (number)
property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or	property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
carriage return, and any other character that has the Z (separator) property.	carriage return, and any other character that has the Z (separator) property.
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the	Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
same characters as Xan, plus underscore.	same characters as Xan, plus underscore.
Line 1008 used. Because \C breaks up characters into individual	Line 1020 used. Because \C breaks up characters into individual
unit with \C in a UTF mode means that the rest of the string may start with a	unit with \C in a UTF mode means that the rest of the string may start with a
malformed UTF character. This has undefined results, because PCRE assumes that	malformed UTF character. This has undefined results, because PCRE assumes that
it is dealing with valid UTF strings (and by default it checks this at the	it is dealing with valid UTF strings (and by default it checks this at the
start of processing unless the PCRE_NO_UTF8_CHECK option is used).	start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option
	is used).
</P>	</P>
<P>	<P>
PCRE does not allow \C to appear in lookbehind assertions	PCRE does not allow \C to appear in lookbehind assertions
Line 1818 Because there may be many capturing parentheses in a p	Line 1831 Because there may be many capturing parentheses in a p
following a backslash are taken as part of a potential back reference number.	following a backslash are taken as part of a potential back reference number.
If the pattern continues with a digit character, some delimiter must be used to	If the pattern continues with a digit character, some delimiter must be used to
terminate the back reference. If the PCRE_EXTENDED option is set, this can be	terminate the back reference. If the PCRE_EXTENDED option is set, this can be
whitespace. Otherwise, the \g{ syntax or an empty comment (see	white space. Otherwise, the \g{ syntax or an empty comment (see
<a href="#comments">"Comments"</a>	<a href="#comments">"Comments"</a>
below) can be used.	below) can be used.
</P>	</P>
Line 2160 point in the pattern; the idea of DEFINE is that it ca	Line 2173 point in the pattern; the idea of DEFINE is that it ca
subroutines that can be referenced from elsewhere. (The use of	subroutines that can be referenced from elsewhere. (The use of
<a href="#subpatternsassubroutines">subroutines</a>	<a href="#subpatternsassubroutines">subroutines</a>
is described below.) For example, a pattern to match an IPv4 address such as	is described below.) For example, a pattern to match an IPv4 address such as
"192.168.23.245" could be written like this (ignore whitespace and line	"192.168.23.245" could be written like this (ignore white space and line
breaks):	breaks):
<pre>	<pre>
(?(DEFINE) (?<byte> 2[0-4]\d \| 25[0-5] \| 1\d\d \| [1-9]?\d) )	(?(DEFINE) (?<byte> 2[0-4]\d \| 25[0-5] \| 1\d\d \| [1-9]?\d) )
Line 2554 exception: the name from a (MARK), (PRUNE), or (*THE	Line 2567 exception: the name from a (MARK), (PRUNE), or (*THE
a successful positive assertion <i>is</i> passed back when a match succeeds	a successful positive assertion <i>is</i> passed back when a match succeeds
(compare capturing parentheses in assertions). Note that such subpatterns are	(compare capturing parentheses in assertions). Note that such subpatterns are
processed as anchored at the point where they are tested. Note also that Perl's	processed as anchored at the point where they are tested. Note also that Perl's
treatment of subroutines is different in some cases.	treatment of subroutines and assertions is different in some cases.
</P>	</P>
<P>	<P>
The new verbs make use of what was previously invalid syntax: an opening	The new verbs make use of what was previously invalid syntax: an opening
parenthesis followed by an asterisk. They are generally of the form	parenthesis followed by an asterisk. They are generally of the form
(VERB) or (VERB:NAME). Some may take either form, with differing behaviour,	(VERB) or (VERB:NAME). Some may take either form, with differing behaviour,
depending on whether or not an argument is present. A name is any sequence of	depending on whether or not an argument is present. A name is any sequence of
characters that does not include a closing parenthesis. If the name is empty,	characters that does not include a closing parenthesis. The maximum length of
that is, if the closing parenthesis immediately follows the colon, the effect	name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name
is as if the colon were not there. Any number of these verbs may occur in a	is empty, that is, if the closing parenthesis immediately follows the colon,
pattern.	the effect is as if the colon were not there. Any number of these verbs may
</P>	occur in a pattern.
	<a name="nooptimize"></a></P>
	<br><b>
	Optimizations that affect backtracking verbs
	</b><br>
<P>	<P>
PCRE contains some optimizations that are used to speed up matching by running	PCRE contains some optimizations that are used to speed up matching by running
some checks at the start of each match attempt. For example, it may know the	some checks at the start of each match attempt. For example, it may know the
Line 2574 present. When one of these optimizations suppresses th	Line 2591 present. When one of these optimizations suppresses th
included backtracking verbs will not, of course, be processed. You can suppress	included backtracking verbs will not, of course, be processed. You can suppress
the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option	the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the	when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
pattern with (*NO_START_OPT).	pattern with (*NO_START_OPT). There is more discussion of this option in the
	section entitled
	<a href="pcreapi.html#execoptions">"Option bits for <b>pcre_exec()</b>"</a>
	in the
	<a href="pcreapi.html"><b>pcreapi</b></a>
	documentation.
</P>	</P>
<P>	<P>
Experiments with Perl suggest that it too has similar optimizations, sometimes	Experiments with Perl suggest that it too has similar optimizations, sometimes
Line 2662 After a partial match or a failed match, the name of t	Line 2684 After a partial match or a failed match, the name of t
No match, mark = B	No match, mark = B
</pre>	</pre>
Note that in this unanchored example the mark is retained from the match	Note that in this unanchored example the mark is retained from the match
attempt that started at the letter "X". Subsequent match attempts starting at	attempt that started at the letter "X" in the subject. Subsequent match
"P" and then with an empty string do not get as far as the (*MARK) item, but	attempts starting at "P" and then with an empty string do not get as far as the
nevertheless do not reset it.	(*MARK) item, but nevertheless do not reset it.
</P>	</P>
	<P>
	If you are interested in (*MARK) values after failed matches, you should
	probably set the PCRE_NO_START_OPTIMIZE option
	<a href="#nooptimize">(see above)</a>
	to ensure that the match is always attempted.
	</P>
<br><b>	<br><b>
Verbs that act after backtracking	Verbs that act after backtracking
</b><br>	</b><br>
Line 2843 Cambridge CB2 3QH, England.	Line 2871 Cambridge CB2 3QH, England.
</P>	</P>
<br><a name="SEC28" href="#TOC1">REVISION</a><br>	<br><a name="SEC28" href="#TOC1">REVISION</a><br>
<P>	<P>
Last updated: 09 January 2012	Last updated: 17 June 2012
<br>	<br>
Copyright © 1997-2012 University of Cambridge.	Copyright © 1997-2012 University of Cambridge.
<br>	<br>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>

Removed from v.1.1.1.2
changed lines
	Added in v.1.1.1.3