Annotation of embedaddon/pcre/doc/html/pcresyntax.html, revision 1.1.1.1
1.1 misho 1: <html>
2: <head>
3: <title>pcresyntax specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcresyntax man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42: <li><a name="TOC27" href="#SEC27">REVISION</a>
43: </ul>
44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45: <P>
46: The full syntax and semantics of the regular expressions that are supported by
47: PCRE are described in the
48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
49: documentation. This document contains just a quick-reference summary of the
50: syntax.
51: </P>
52: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
53: <P>
54: <pre>
55: \x where x is non-alphanumeric is a literal x
56: \Q...\E treat enclosed characters as literal
57: </PRE>
58: </P>
59: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
60: <P>
61: <pre>
62: \a alarm, that is, the BEL character (hex 07)
63: \cx "control-x", where x is any ASCII character
64: \e escape (hex 1B)
65: \f formfeed (hex 0C)
66: \n newline (hex 0A)
67: \r carriage return (hex 0D)
68: \t tab (hex 09)
69: \ddd character with octal code ddd, or backreference
70: \xhh character with hex code hh
71: \x{hhh..} character with hex code hhh..
72: </PRE>
73: </P>
74: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
75: <P>
76: <pre>
77: . any character except newline;
78: in dotall mode, any character whatsoever
79: \C one byte, even in UTF-8 mode (best avoided)
80: \d a decimal digit
81: \D a character that is not a decimal digit
82: \h a horizontal whitespace character
83: \H a character that is not a horizontal whitespace character
84: \N a character that is not a newline
85: \p{<i>xx</i>} a character with the <i>xx</i> property
86: \P{<i>xx</i>} a character without the <i>xx</i> property
87: \R a newline sequence
88: \s a whitespace character
89: \S a character that is not a whitespace character
90: \v a vertical whitespace character
91: \V a character that is not a vertical whitespace character
92: \w a "word" character
93: \W a "non-word" character
94: \X an extended Unicode sequence
95: </pre>
96: In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
97: characters, even in UTF-8 mode. However, this can be changed by setting the
98: PCRE_UCP option.
99: </P>
100: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
101: <P>
102: <pre>
103: C Other
104: Cc Control
105: Cf Format
106: Cn Unassigned
107: Co Private use
108: Cs Surrogate
109:
110: L Letter
111: Ll Lower case letter
112: Lm Modifier letter
113: Lo Other letter
114: Lt Title case letter
115: Lu Upper case letter
116: L& Ll, Lu, or Lt
117:
118: M Mark
119: Mc Spacing mark
120: Me Enclosing mark
121: Mn Non-spacing mark
122:
123: N Number
124: Nd Decimal number
125: Nl Letter number
126: No Other number
127:
128: P Punctuation
129: Pc Connector punctuation
130: Pd Dash punctuation
131: Pe Close punctuation
132: Pf Final punctuation
133: Pi Initial punctuation
134: Po Other punctuation
135: Ps Open punctuation
136:
137: S Symbol
138: Sc Currency symbol
139: Sk Modifier symbol
140: Sm Mathematical symbol
141: So Other symbol
142:
143: Z Separator
144: Zl Line separator
145: Zp Paragraph separator
146: Zs Space separator
147: </PRE>
148: </P>
149: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
150: <P>
151: <pre>
152: Xan Alphanumeric: union of properties L and N
153: Xps POSIX space: property Z or tab, NL, VT, FF, CR
154: Xsp Perl space: property Z or tab, NL, FF, CR
155: Xwd Perl word: property Xan or underscore
156: </PRE>
157: </P>
158: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
159: <P>
160: Arabic,
161: Armenian,
162: Avestan,
163: Balinese,
164: Bamum,
165: Bengali,
166: Bopomofo,
167: Braille,
168: Buginese,
169: Buhid,
170: Canadian_Aboriginal,
171: Carian,
172: Cham,
173: Cherokee,
174: Common,
175: Coptic,
176: Cuneiform,
177: Cypriot,
178: Cyrillic,
179: Deseret,
180: Devanagari,
181: Egyptian_Hieroglyphs,
182: Ethiopic,
183: Georgian,
184: Glagolitic,
185: Gothic,
186: Greek,
187: Gujarati,
188: Gurmukhi,
189: Han,
190: Hangul,
191: Hanunoo,
192: Hebrew,
193: Hiragana,
194: Imperial_Aramaic,
195: Inherited,
196: Inscriptional_Pahlavi,
197: Inscriptional_Parthian,
198: Javanese,
199: Kaithi,
200: Kannada,
201: Katakana,
202: Kayah_Li,
203: Kharoshthi,
204: Khmer,
205: Lao,
206: Latin,
207: Lepcha,
208: Limbu,
209: Linear_B,
210: Lisu,
211: Lycian,
212: Lydian,
213: Malayalam,
214: Meetei_Mayek,
215: Mongolian,
216: Myanmar,
217: New_Tai_Lue,
218: Nko,
219: Ogham,
220: Old_Italic,
221: Old_Persian,
222: Old_South_Arabian,
223: Old_Turkic,
224: Ol_Chiki,
225: Oriya,
226: Osmanya,
227: Phags_Pa,
228: Phoenician,
229: Rejang,
230: Runic,
231: Samaritan,
232: Saurashtra,
233: Shavian,
234: Sinhala,
235: Sundanese,
236: Syloti_Nagri,
237: Syriac,
238: Tagalog,
239: Tagbanwa,
240: Tai_Le,
241: Tai_Tham,
242: Tai_Viet,
243: Tamil,
244: Telugu,
245: Thaana,
246: Thai,
247: Tibetan,
248: Tifinagh,
249: Ugaritic,
250: Vai,
251: Yi.
252: </P>
253: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
254: <P>
255: <pre>
256: [...] positive character class
257: [^...] negative character class
258: [x-y] range (can be used for hex characters)
259: [[:xxx:]] positive POSIX named set
260: [[:^xxx:]] negative POSIX named set
261:
262: alnum alphanumeric
263: alpha alphabetic
264: ascii 0-127
265: blank space or tab
266: cntrl control character
267: digit decimal digit
268: graph printing, excluding space
269: lower lower case letter
270: print printing, including space
271: punct printing, excluding alphanumeric
272: space whitespace
273: upper upper case letter
274: word same as \w
275: xdigit hexadecimal digit
276: </pre>
277: In PCRE, POSIX character set names recognize only ASCII characters by default,
278: but some of them use Unicode properties if PCRE_UCP is set. You can use
279: \Q...\E inside a character class.
280: </P>
281: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
282: <P>
283: <pre>
284: ? 0 or 1, greedy
285: ?+ 0 or 1, possessive
286: ?? 0 or 1, lazy
287: * 0 or more, greedy
288: *+ 0 or more, possessive
289: *? 0 or more, lazy
290: + 1 or more, greedy
291: ++ 1 or more, possessive
292: +? 1 or more, lazy
293: {n} exactly n
294: {n,m} at least n, no more than m, greedy
295: {n,m}+ at least n, no more than m, possessive
296: {n,m}? at least n, no more than m, lazy
297: {n,} n or more, greedy
298: {n,}+ n or more, possessive
299: {n,}? n or more, lazy
300: </PRE>
301: </P>
302: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
303: <P>
304: <pre>
305: \b word boundary
306: \B not a word boundary
307: ^ start of subject
308: also after internal newline in multiline mode
309: \A start of subject
310: $ end of subject
311: also before newline at end of subject
312: also before internal newline in multiline mode
313: \Z end of subject
314: also before newline at end of subject
315: \z end of subject
316: \G first matching position in subject
317: </PRE>
318: </P>
319: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
320: <P>
321: <pre>
322: \K reset start of match
323: </PRE>
324: </P>
325: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
326: <P>
327: <pre>
328: expr|expr|expr...
329: </PRE>
330: </P>
331: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
332: <P>
333: <pre>
334: (...) capturing group
335: (?<name>...) named capturing group (Perl)
336: (?'name'...) named capturing group (Perl)
337: (?P<name>...) named capturing group (Python)
338: (?:...) non-capturing group
339: (?|...) non-capturing group; reset group numbers for
340: capturing groups in each alternative
341: </PRE>
342: </P>
343: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
344: <P>
345: <pre>
346: (?>...) atomic, non-capturing group
347: </PRE>
348: </P>
349: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
350: <P>
351: <pre>
352: (?#....) comment (not nestable)
353: </PRE>
354: </P>
355: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
356: <P>
357: <pre>
358: (?i) caseless
359: (?J) allow duplicate names
360: (?m) multiline
361: (?s) single line (dotall)
362: (?U) default ungreedy (lazy)
363: (?x) extended (ignore white space)
364: (?-...) unset option(s)
365: </pre>
366: The following are recognized only at the start of a pattern or after one of the
367: newline-setting options with similar syntax:
368: <pre>
369: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
370: (*UTF8) set UTF-8 mode (PCRE_UTF8)
371: (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
372: </PRE>
373: </P>
374: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
375: <P>
376: <pre>
377: (?=...) positive look ahead
378: (?!...) negative look ahead
379: (?<=...) positive look behind
380: (?<!...) negative look behind
381: </pre>
382: Each top-level branch of a look behind must be of a fixed length.
383: </P>
384: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
385: <P>
386: <pre>
387: \n reference by number (can be ambiguous)
388: \gn reference by number
389: \g{n} reference by number
390: \g{-n} relative reference by number
391: \k<name> reference by name (Perl)
392: \k'name' reference by name (Perl)
393: \g{name} reference by name (Perl)
394: \k{name} reference by name (.NET)
395: (?P=name) reference by name (Python)
396: </PRE>
397: </P>
398: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
399: <P>
400: <pre>
401: (?R) recurse whole pattern
402: (?n) call subpattern by absolute number
403: (?+n) call subpattern by relative number
404: (?-n) call subpattern by relative number
405: (?&name) call subpattern by name (Perl)
406: (?P>name) call subpattern by name (Python)
407: \g<name> call subpattern by name (Oniguruma)
408: \g'name' call subpattern by name (Oniguruma)
409: \g<n> call subpattern by absolute number (Oniguruma)
410: \g'n' call subpattern by absolute number (Oniguruma)
411: \g<+n> call subpattern by relative number (PCRE extension)
412: \g'+n' call subpattern by relative number (PCRE extension)
413: \g<-n> call subpattern by relative number (PCRE extension)
414: \g'-n' call subpattern by relative number (PCRE extension)
415: </PRE>
416: </P>
417: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
418: <P>
419: <pre>
420: (?(condition)yes-pattern)
421: (?(condition)yes-pattern|no-pattern)
422:
423: (?(n)... absolute reference condition
424: (?(+n)... relative reference condition
425: (?(-n)... relative reference condition
426: (?(<name>)... named reference condition (Perl)
427: (?('name')... named reference condition (Perl)
428: (?(name)... named reference condition (PCRE)
429: (?(R)... overall recursion condition
430: (?(Rn)... specific group recursion condition
431: (?(R&name)... specific recursion condition
432: (?(DEFINE)... define subpattern for reference
433: (?(assert)... assertion condition
434: </PRE>
435: </P>
436: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
437: <P>
438: The following act immediately they are reached:
439: <pre>
440: (*ACCEPT) force successful match
441: (*FAIL) force backtrack; synonym (*F)
442: </pre>
443: The following act only when a subsequent match failure causes a backtrack to
444: reach them. They all force a match failure, but they differ in what happens
445: afterwards. Those that advance the start-of-match point do so only if the
446: pattern is not anchored.
447: <pre>
448: (*COMMIT) overall failure, no advance of starting point
449: (*PRUNE) advance to next starting character
450: (*SKIP) advance start to current matching position
451: (*THEN) local failure, backtrack to next alternation
452: </PRE>
453: </P>
454: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
455: <P>
456: These are recognized only at the very start of the pattern or after a
457: (*BSR_...) or (*UTF8) or (*UCP) option.
458: <pre>
459: (*CR) carriage return only
460: (*LF) linefeed only
461: (*CRLF) carriage return followed by linefeed
462: (*ANYCRLF) all three of the above
463: (*ANY) any Unicode newline sequence
464: </PRE>
465: </P>
466: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
467: <P>
468: These are recognized only at the very start of the pattern or after a
469: (*...) option that sets the newline convention or UTF-8 or UCP mode.
470: <pre>
471: (*BSR_ANYCRLF) CR, LF, or CRLF
472: (*BSR_UNICODE) any Unicode newline sequence
473: </PRE>
474: </P>
475: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
476: <P>
477: <pre>
478: (?C) callout
479: (?Cn) callout with data n
480: </PRE>
481: </P>
482: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
483: <P>
484: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
485: <b>pcrematching</b>(3), <b>pcre</b>(3).
486: </P>
487: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
488: <P>
489: Philip Hazel
490: <br>
491: University Computing Service
492: <br>
493: Cambridge CB2 3QH, England.
494: <br>
495: </P>
496: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
497: <P>
498: Last updated: 21 November 2010
499: <br>
500: Copyright © 1997-2010 University of Cambridge.
501: <br>
502: <p>
503: Return to the <a href="index.html">PCRE index page</a>.
504: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>