Annotation of embedaddon/pcre/doc/html/pcresyntax.html, revision 1.1.1.2
1.1 misho 1: <html>
2: <head>
3: <title>pcresyntax specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcresyntax man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42: <li><a name="TOC27" href="#SEC27">REVISION</a>
43: </ul>
44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45: <P>
46: The full syntax and semantics of the regular expressions that are supported by
47: PCRE are described in the
48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
1.1.1.2 ! misho 49: documentation. This document contains a quick-reference summary of the syntax.
1.1 misho 50: </P>
51: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52: <P>
53: <pre>
54: \x where x is non-alphanumeric is a literal x
55: \Q...\E treat enclosed characters as literal
56: </PRE>
57: </P>
58: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59: <P>
60: <pre>
61: \a alarm, that is, the BEL character (hex 07)
62: \cx "control-x", where x is any ASCII character
63: \e escape (hex 1B)
64: \f formfeed (hex 0C)
65: \n newline (hex 0A)
66: \r carriage return (hex 0D)
67: \t tab (hex 09)
68: \ddd character with octal code ddd, or backreference
69: \xhh character with hex code hh
70: \x{hhh..} character with hex code hhh..
71: </PRE>
72: </P>
73: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74: <P>
75: <pre>
76: . any character except newline;
77: in dotall mode, any character whatsoever
1.1.1.2 ! misho 78: \C one data unit, even in UTF mode (best avoided)
1.1 misho 79: \d a decimal digit
80: \D a character that is not a decimal digit
81: \h a horizontal whitespace character
82: \H a character that is not a horizontal whitespace character
83: \N a character that is not a newline
84: \p{<i>xx</i>} a character with the <i>xx</i> property
85: \P{<i>xx</i>} a character without the <i>xx</i> property
86: \R a newline sequence
87: \s a whitespace character
88: \S a character that is not a whitespace character
89: \v a vertical whitespace character
90: \V a character that is not a vertical whitespace character
91: \w a "word" character
92: \W a "non-word" character
93: \X an extended Unicode sequence
94: </pre>
95: In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
1.1.1.2 ! misho 96: characters, even in a UTF mode. However, this can be changed by setting the
1.1 misho 97: PCRE_UCP option.
98: </P>
99: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
100: <P>
101: <pre>
102: C Other
103: Cc Control
104: Cf Format
105: Cn Unassigned
106: Co Private use
107: Cs Surrogate
108:
109: L Letter
110: Ll Lower case letter
111: Lm Modifier letter
112: Lo Other letter
113: Lt Title case letter
114: Lu Upper case letter
115: L& Ll, Lu, or Lt
116:
117: M Mark
118: Mc Spacing mark
119: Me Enclosing mark
120: Mn Non-spacing mark
121:
122: N Number
123: Nd Decimal number
124: Nl Letter number
125: No Other number
126:
127: P Punctuation
128: Pc Connector punctuation
129: Pd Dash punctuation
130: Pe Close punctuation
131: Pf Final punctuation
132: Pi Initial punctuation
133: Po Other punctuation
134: Ps Open punctuation
135:
136: S Symbol
137: Sc Currency symbol
138: Sk Modifier symbol
139: Sm Mathematical symbol
140: So Other symbol
141:
142: Z Separator
143: Zl Line separator
144: Zp Paragraph separator
145: Zs Space separator
146: </PRE>
147: </P>
148: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
149: <P>
150: <pre>
151: Xan Alphanumeric: union of properties L and N
152: Xps POSIX space: property Z or tab, NL, VT, FF, CR
153: Xsp Perl space: property Z or tab, NL, FF, CR
154: Xwd Perl word: property Xan or underscore
155: </PRE>
156: </P>
157: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
158: <P>
159: Arabic,
160: Armenian,
161: Avestan,
162: Balinese,
163: Bamum,
164: Bengali,
165: Bopomofo,
166: Braille,
167: Buginese,
168: Buhid,
169: Canadian_Aboriginal,
170: Carian,
171: Cham,
172: Cherokee,
173: Common,
174: Coptic,
175: Cuneiform,
176: Cypriot,
177: Cyrillic,
178: Deseret,
179: Devanagari,
180: Egyptian_Hieroglyphs,
181: Ethiopic,
182: Georgian,
183: Glagolitic,
184: Gothic,
185: Greek,
186: Gujarati,
187: Gurmukhi,
188: Han,
189: Hangul,
190: Hanunoo,
191: Hebrew,
192: Hiragana,
193: Imperial_Aramaic,
194: Inherited,
195: Inscriptional_Pahlavi,
196: Inscriptional_Parthian,
197: Javanese,
198: Kaithi,
199: Kannada,
200: Katakana,
201: Kayah_Li,
202: Kharoshthi,
203: Khmer,
204: Lao,
205: Latin,
206: Lepcha,
207: Limbu,
208: Linear_B,
209: Lisu,
210: Lycian,
211: Lydian,
212: Malayalam,
213: Meetei_Mayek,
214: Mongolian,
215: Myanmar,
216: New_Tai_Lue,
217: Nko,
218: Ogham,
219: Old_Italic,
220: Old_Persian,
221: Old_South_Arabian,
222: Old_Turkic,
223: Ol_Chiki,
224: Oriya,
225: Osmanya,
226: Phags_Pa,
227: Phoenician,
228: Rejang,
229: Runic,
230: Samaritan,
231: Saurashtra,
232: Shavian,
233: Sinhala,
234: Sundanese,
235: Syloti_Nagri,
236: Syriac,
237: Tagalog,
238: Tagbanwa,
239: Tai_Le,
240: Tai_Tham,
241: Tai_Viet,
242: Tamil,
243: Telugu,
244: Thaana,
245: Thai,
246: Tibetan,
247: Tifinagh,
248: Ugaritic,
249: Vai,
250: Yi.
251: </P>
252: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
253: <P>
254: <pre>
255: [...] positive character class
256: [^...] negative character class
257: [x-y] range (can be used for hex characters)
258: [[:xxx:]] positive POSIX named set
259: [[:^xxx:]] negative POSIX named set
260:
261: alnum alphanumeric
262: alpha alphabetic
263: ascii 0-127
264: blank space or tab
265: cntrl control character
266: digit decimal digit
267: graph printing, excluding space
268: lower lower case letter
269: print printing, including space
270: punct printing, excluding alphanumeric
271: space whitespace
272: upper upper case letter
273: word same as \w
274: xdigit hexadecimal digit
275: </pre>
276: In PCRE, POSIX character set names recognize only ASCII characters by default,
277: but some of them use Unicode properties if PCRE_UCP is set. You can use
278: \Q...\E inside a character class.
279: </P>
280: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
281: <P>
282: <pre>
283: ? 0 or 1, greedy
284: ?+ 0 or 1, possessive
285: ?? 0 or 1, lazy
286: * 0 or more, greedy
287: *+ 0 or more, possessive
288: *? 0 or more, lazy
289: + 1 or more, greedy
290: ++ 1 or more, possessive
291: +? 1 or more, lazy
292: {n} exactly n
293: {n,m} at least n, no more than m, greedy
294: {n,m}+ at least n, no more than m, possessive
295: {n,m}? at least n, no more than m, lazy
296: {n,} n or more, greedy
297: {n,}+ n or more, possessive
298: {n,}? n or more, lazy
299: </PRE>
300: </P>
301: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
302: <P>
303: <pre>
304: \b word boundary
305: \B not a word boundary
306: ^ start of subject
307: also after internal newline in multiline mode
308: \A start of subject
309: $ end of subject
310: also before newline at end of subject
311: also before internal newline in multiline mode
312: \Z end of subject
313: also before newline at end of subject
314: \z end of subject
315: \G first matching position in subject
316: </PRE>
317: </P>
318: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
319: <P>
320: <pre>
321: \K reset start of match
322: </PRE>
323: </P>
324: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
325: <P>
326: <pre>
327: expr|expr|expr...
328: </PRE>
329: </P>
330: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
331: <P>
332: <pre>
333: (...) capturing group
334: (?<name>...) named capturing group (Perl)
335: (?'name'...) named capturing group (Perl)
336: (?P<name>...) named capturing group (Python)
337: (?:...) non-capturing group
338: (?|...) non-capturing group; reset group numbers for
339: capturing groups in each alternative
340: </PRE>
341: </P>
342: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
343: <P>
344: <pre>
345: (?>...) atomic, non-capturing group
346: </PRE>
347: </P>
348: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
349: <P>
350: <pre>
351: (?#....) comment (not nestable)
352: </PRE>
353: </P>
354: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
355: <P>
356: <pre>
357: (?i) caseless
358: (?J) allow duplicate names
359: (?m) multiline
360: (?s) single line (dotall)
361: (?U) default ungreedy (lazy)
362: (?x) extended (ignore white space)
363: (?-...) unset option(s)
364: </pre>
365: The following are recognized only at the start of a pattern or after one of the
366: newline-setting options with similar syntax:
367: <pre>
368: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
1.1.1.2 ! misho 369: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
! 370: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
1.1 misho 371: (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
372: </PRE>
373: </P>
374: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
375: <P>
376: <pre>
377: (?=...) positive look ahead
378: (?!...) negative look ahead
379: (?<=...) positive look behind
380: (?<!...) negative look behind
381: </pre>
382: Each top-level branch of a look behind must be of a fixed length.
383: </P>
384: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
385: <P>
386: <pre>
387: \n reference by number (can be ambiguous)
388: \gn reference by number
389: \g{n} reference by number
390: \g{-n} relative reference by number
391: \k<name> reference by name (Perl)
392: \k'name' reference by name (Perl)
393: \g{name} reference by name (Perl)
394: \k{name} reference by name (.NET)
395: (?P=name) reference by name (Python)
396: </PRE>
397: </P>
398: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
399: <P>
400: <pre>
401: (?R) recurse whole pattern
402: (?n) call subpattern by absolute number
403: (?+n) call subpattern by relative number
404: (?-n) call subpattern by relative number
405: (?&name) call subpattern by name (Perl)
406: (?P>name) call subpattern by name (Python)
407: \g<name> call subpattern by name (Oniguruma)
408: \g'name' call subpattern by name (Oniguruma)
409: \g<n> call subpattern by absolute number (Oniguruma)
410: \g'n' call subpattern by absolute number (Oniguruma)
411: \g<+n> call subpattern by relative number (PCRE extension)
412: \g'+n' call subpattern by relative number (PCRE extension)
413: \g<-n> call subpattern by relative number (PCRE extension)
414: \g'-n' call subpattern by relative number (PCRE extension)
415: </PRE>
416: </P>
417: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
418: <P>
419: <pre>
420: (?(condition)yes-pattern)
421: (?(condition)yes-pattern|no-pattern)
422:
423: (?(n)... absolute reference condition
424: (?(+n)... relative reference condition
425: (?(-n)... relative reference condition
426: (?(<name>)... named reference condition (Perl)
427: (?('name')... named reference condition (Perl)
428: (?(name)... named reference condition (PCRE)
429: (?(R)... overall recursion condition
430: (?(Rn)... specific group recursion condition
431: (?(R&name)... specific recursion condition
432: (?(DEFINE)... define subpattern for reference
433: (?(assert)... assertion condition
434: </PRE>
435: </P>
436: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
437: <P>
438: The following act immediately they are reached:
439: <pre>
440: (*ACCEPT) force successful match
441: (*FAIL) force backtrack; synonym (*F)
1.1.1.2 ! misho 442: (*MARK:NAME) set name to be passed back; synonym (*:NAME)
1.1 misho 443: </pre>
444: The following act only when a subsequent match failure causes a backtrack to
445: reach them. They all force a match failure, but they differ in what happens
446: afterwards. Those that advance the start-of-match point do so only if the
447: pattern is not anchored.
448: <pre>
449: (*COMMIT) overall failure, no advance of starting point
450: (*PRUNE) advance to next starting character
1.1.1.2 ! misho 451: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
! 452: (*SKIP) advance to current matching position
! 453: (*SKIP:NAME) advance to position corresponding to an earlier
! 454: (*MARK:NAME); if not found, the (*SKIP) is ignored
1.1 misho 455: (*THEN) local failure, backtrack to next alternation
1.1.1.2 ! misho 456: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
1.1 misho 457: </PRE>
458: </P>
459: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
460: <P>
461: These are recognized only at the very start of the pattern or after a
1.1.1.2 ! misho 462: (*BSR_...), (*UTF8), (*UTF16) or (*UCP) option.
1.1 misho 463: <pre>
464: (*CR) carriage return only
465: (*LF) linefeed only
466: (*CRLF) carriage return followed by linefeed
467: (*ANYCRLF) all three of the above
468: (*ANY) any Unicode newline sequence
469: </PRE>
470: </P>
471: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
472: <P>
473: These are recognized only at the very start of the pattern or after a
1.1.1.2 ! misho 474: (*...) option that sets the newline convention or a UTF or UCP mode.
1.1 misho 475: <pre>
476: (*BSR_ANYCRLF) CR, LF, or CRLF
477: (*BSR_UNICODE) any Unicode newline sequence
478: </PRE>
479: </P>
480: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
481: <P>
482: <pre>
483: (?C) callout
484: (?Cn) callout with data n
485: </PRE>
486: </P>
487: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
488: <P>
489: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
490: <b>pcrematching</b>(3), <b>pcre</b>(3).
491: </P>
492: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
493: <P>
494: Philip Hazel
495: <br>
496: University Computing Service
497: <br>
498: Cambridge CB2 3QH, England.
499: <br>
500: </P>
501: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
502: <P>
1.1.1.2 ! misho 503: Last updated: 10 January 2012
1.1 misho 504: <br>
1.1.1.2 ! misho 505: Copyright © 1997-2012 University of Cambridge.
1.1 misho 506: <br>
507: <p>
508: Return to the <a href="index.html">PCRE index page</a>.
509: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>