1: <html>
2: <head>
3: <title>pcresyntax specification</title>
4: </head>
5: <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6: <h1>pcresyntax man page</h1>
7: <p>
8: Return to the <a href="index.html">PCRE index page</a>.
9: </p>
10: <p>
11: This page is part of the PCRE HTML documentation. It was generated automatically
12: from the original man page. If there is any nonsense in it, please consult the
13: man page, in case the conversion went wrong.
14: <br>
15: <ul>
16: <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17: <li><a name="TOC2" href="#SEC2">QUOTING</a>
18: <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19: <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20: <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21: <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22: <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23: <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24: <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25: <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26: <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27: <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28: <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29: <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30: <li><a name="TOC15" href="#SEC15">COMMENT</a>
31: <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32: <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33: <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34: <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35: <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36: <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37: <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38: <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39: <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40: <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41: <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42: <li><a name="TOC27" href="#SEC27">REVISION</a>
43: </ul>
44: <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45: <P>
46: The full syntax and semantics of the regular expressions that are supported by
47: PCRE are described in the
48: <a href="pcrepattern.html"><b>pcrepattern</b></a>
49: documentation. This document contains a quick-reference summary of the syntax.
50: </P>
51: <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52: <P>
53: <pre>
54: \x where x is non-alphanumeric is a literal x
55: \Q...\E treat enclosed characters as literal
56: </PRE>
57: </P>
58: <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59: <P>
60: <pre>
61: \a alarm, that is, the BEL character (hex 07)
62: \cx "control-x", where x is any ASCII character
63: \e escape (hex 1B)
64: \f form feed (hex 0C)
65: \n newline (hex 0A)
66: \r carriage return (hex 0D)
67: \t tab (hex 09)
68: \0dd character with octal code 0dd
69: \ddd character with octal code ddd, or backreference
70: \o{ddd..} character with octal code ddd..
71: \xhh character with hex code hh
72: \x{hhh..} character with hex code hhh..
73: </pre>
74: Note that \0dd is always an octal code, and that \8 and \9 are the literal
75: characters "8" and "9".
76: </P>
77: <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
78: <P>
79: <pre>
80: . any character except newline;
81: in dotall mode, any character whatsoever
82: \C one data unit, even in UTF mode (best avoided)
83: \d a decimal digit
84: \D a character that is not a decimal digit
85: \h a horizontal white space character
86: \H a character that is not a horizontal white space character
87: \N a character that is not a newline
88: \p{<i>xx</i>} a character with the <i>xx</i> property
89: \P{<i>xx</i>} a character without the <i>xx</i> property
90: \R a newline sequence
91: \s a white space character
92: \S a character that is not a white space character
93: \v a vertical white space character
94: \V a character that is not a vertical white space character
95: \w a "word" character
96: \W a "non-word" character
97: \X a Unicode extended grapheme cluster
98: </pre>
99: By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101: happening, \s and \w may also match characters with code points in the range
102: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103: is changed to use Unicode properties and they match many more characters.
104: </P>
105: <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
106: <P>
107: <pre>
108: C Other
109: Cc Control
110: Cf Format
111: Cn Unassigned
112: Co Private use
113: Cs Surrogate
114:
115: L Letter
116: Ll Lower case letter
117: Lm Modifier letter
118: Lo Other letter
119: Lt Title case letter
120: Lu Upper case letter
121: L& Ll, Lu, or Lt
122:
123: M Mark
124: Mc Spacing mark
125: Me Enclosing mark
126: Mn Non-spacing mark
127:
128: N Number
129: Nd Decimal number
130: Nl Letter number
131: No Other number
132:
133: P Punctuation
134: Pc Connector punctuation
135: Pd Dash punctuation
136: Pe Close punctuation
137: Pf Final punctuation
138: Pi Initial punctuation
139: Po Other punctuation
140: Ps Open punctuation
141:
142: S Symbol
143: Sc Currency symbol
144: Sk Modifier symbol
145: Sm Mathematical symbol
146: So Other symbol
147:
148: Z Separator
149: Zl Line separator
150: Zp Paragraph separator
151: Zs Space separator
152: </PRE>
153: </P>
154: <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
155: <P>
156: <pre>
157: Xan Alphanumeric: union of properties L and N
158: Xps POSIX space: property Z or tab, NL, VT, FF, CR
159: Xsp Perl space: property Z or tab, NL, VT, FF, CR
160: Xuc Univerally-named character: one that can be
161: represented by a Universal Character Name
162: Xwd Perl word: property Xan or underscore
163: </pre>
164: Perl and POSIX space are now the same. Perl added VT to its space character set
165: at release 5.18 and PCRE changed at release 8.34.
166: </P>
167: <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
168: <P>
169: Arabic,
170: Armenian,
171: Avestan,
172: Balinese,
173: Bamum,
174: Batak,
175: Bengali,
176: Bopomofo,
177: Brahmi,
178: Braille,
179: Buginese,
180: Buhid,
181: Canadian_Aboriginal,
182: Carian,
183: Chakma,
184: Cham,
185: Cherokee,
186: Common,
187: Coptic,
188: Cuneiform,
189: Cypriot,
190: Cyrillic,
191: Deseret,
192: Devanagari,
193: Egyptian_Hieroglyphs,
194: Ethiopic,
195: Georgian,
196: Glagolitic,
197: Gothic,
198: Greek,
199: Gujarati,
200: Gurmukhi,
201: Han,
202: Hangul,
203: Hanunoo,
204: Hebrew,
205: Hiragana,
206: Imperial_Aramaic,
207: Inherited,
208: Inscriptional_Pahlavi,
209: Inscriptional_Parthian,
210: Javanese,
211: Kaithi,
212: Kannada,
213: Katakana,
214: Kayah_Li,
215: Kharoshthi,
216: Khmer,
217: Lao,
218: Latin,
219: Lepcha,
220: Limbu,
221: Linear_B,
222: Lisu,
223: Lycian,
224: Lydian,
225: Malayalam,
226: Mandaic,
227: Meetei_Mayek,
228: Meroitic_Cursive,
229: Meroitic_Hieroglyphs,
230: Miao,
231: Mongolian,
232: Myanmar,
233: New_Tai_Lue,
234: Nko,
235: Ogham,
236: Old_Italic,
237: Old_Persian,
238: Old_South_Arabian,
239: Old_Turkic,
240: Ol_Chiki,
241: Oriya,
242: Osmanya,
243: Phags_Pa,
244: Phoenician,
245: Rejang,
246: Runic,
247: Samaritan,
248: Saurashtra,
249: Sharada,
250: Shavian,
251: Sinhala,
252: Sora_Sompeng,
253: Sundanese,
254: Syloti_Nagri,
255: Syriac,
256: Tagalog,
257: Tagbanwa,
258: Tai_Le,
259: Tai_Tham,
260: Tai_Viet,
261: Takri,
262: Tamil,
263: Telugu,
264: Thaana,
265: Thai,
266: Tibetan,
267: Tifinagh,
268: Ugaritic,
269: Vai,
270: Yi.
271: </P>
272: <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
273: <P>
274: <pre>
275: [...] positive character class
276: [^...] negative character class
277: [x-y] range (can be used for hex characters)
278: [[:xxx:]] positive POSIX named set
279: [[:^xxx:]] negative POSIX named set
280:
281: alnum alphanumeric
282: alpha alphabetic
283: ascii 0-127
284: blank space or tab
285: cntrl control character
286: digit decimal digit
287: graph printing, excluding space
288: lower lower case letter
289: print printing, including space
290: punct printing, excluding alphanumeric
291: space white space
292: upper upper case letter
293: word same as \w
294: xdigit hexadecimal digit
295: </pre>
296: In PCRE, POSIX character set names recognize only ASCII characters by default,
297: but some of them use Unicode properties if PCRE_UCP is set. You can use
298: \Q...\E inside a character class.
299: </P>
300: <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
301: <P>
302: <pre>
303: ? 0 or 1, greedy
304: ?+ 0 or 1, possessive
305: ?? 0 or 1, lazy
306: * 0 or more, greedy
307: *+ 0 or more, possessive
308: *? 0 or more, lazy
309: + 1 or more, greedy
310: ++ 1 or more, possessive
311: +? 1 or more, lazy
312: {n} exactly n
313: {n,m} at least n, no more than m, greedy
314: {n,m}+ at least n, no more than m, possessive
315: {n,m}? at least n, no more than m, lazy
316: {n,} n or more, greedy
317: {n,}+ n or more, possessive
318: {n,}? n or more, lazy
319: </PRE>
320: </P>
321: <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
322: <P>
323: <pre>
324: \b word boundary
325: \B not a word boundary
326: ^ start of subject
327: also after internal newline in multiline mode
328: \A start of subject
329: $ end of subject
330: also before newline at end of subject
331: also before internal newline in multiline mode
332: \Z end of subject
333: also before newline at end of subject
334: \z end of subject
335: \G first matching position in subject
336: </PRE>
337: </P>
338: <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
339: <P>
340: <pre>
341: \K reset start of match
342: </PRE>
343: </P>
344: <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
345: <P>
346: <pre>
347: expr|expr|expr...
348: </PRE>
349: </P>
350: <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
351: <P>
352: <pre>
353: (...) capturing group
354: (?<name>...) named capturing group (Perl)
355: (?'name'...) named capturing group (Perl)
356: (?P<name>...) named capturing group (Python)
357: (?:...) non-capturing group
358: (?|...) non-capturing group; reset group numbers for
359: capturing groups in each alternative
360: </PRE>
361: </P>
362: <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
363: <P>
364: <pre>
365: (?>...) atomic, non-capturing group
366: </PRE>
367: </P>
368: <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
369: <P>
370: <pre>
371: (?#....) comment (not nestable)
372: </PRE>
373: </P>
374: <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
375: <P>
376: <pre>
377: (?i) caseless
378: (?J) allow duplicate names
379: (?m) multiline
380: (?s) single line (dotall)
381: (?U) default ungreedy (lazy)
382: (?x) extended (ignore white space)
383: (?-...) unset option(s)
384: </pre>
385: The following are recognized only at the start of a pattern or after one of the
386: newline-setting options with similar syntax:
387: <pre>
388: (*LIMIT_MATCH=d) set the match limit to d (decimal number)
389: (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
390: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
391: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
392: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
393: (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
394: (*UTF) set appropriate UTF mode for the library in use
395: (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
396: </pre>
397: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
398: limits set by the caller of pcre_exec(), not increase them.
399: </P>
400: <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
401: <P>
402: <pre>
403: (?=...) positive look ahead
404: (?!...) negative look ahead
405: (?<=...) positive look behind
406: (?<!...) negative look behind
407: </pre>
408: Each top-level branch of a look behind must be of a fixed length.
409: </P>
410: <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
411: <P>
412: <pre>
413: \n reference by number (can be ambiguous)
414: \gn reference by number
415: \g{n} reference by number
416: \g{-n} relative reference by number
417: \k<name> reference by name (Perl)
418: \k'name' reference by name (Perl)
419: \g{name} reference by name (Perl)
420: \k{name} reference by name (.NET)
421: (?P=name) reference by name (Python)
422: </PRE>
423: </P>
424: <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
425: <P>
426: <pre>
427: (?R) recurse whole pattern
428: (?n) call subpattern by absolute number
429: (?+n) call subpattern by relative number
430: (?-n) call subpattern by relative number
431: (?&name) call subpattern by name (Perl)
432: (?P>name) call subpattern by name (Python)
433: \g<name> call subpattern by name (Oniguruma)
434: \g'name' call subpattern by name (Oniguruma)
435: \g<n> call subpattern by absolute number (Oniguruma)
436: \g'n' call subpattern by absolute number (Oniguruma)
437: \g<+n> call subpattern by relative number (PCRE extension)
438: \g'+n' call subpattern by relative number (PCRE extension)
439: \g<-n> call subpattern by relative number (PCRE extension)
440: \g'-n' call subpattern by relative number (PCRE extension)
441: </PRE>
442: </P>
443: <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
444: <P>
445: <pre>
446: (?(condition)yes-pattern)
447: (?(condition)yes-pattern|no-pattern)
448:
449: (?(n)... absolute reference condition
450: (?(+n)... relative reference condition
451: (?(-n)... relative reference condition
452: (?(<name>)... named reference condition (Perl)
453: (?('name')... named reference condition (Perl)
454: (?(name)... named reference condition (PCRE)
455: (?(R)... overall recursion condition
456: (?(Rn)... specific group recursion condition
457: (?(R&name)... specific recursion condition
458: (?(DEFINE)... define subpattern for reference
459: (?(assert)... assertion condition
460: </PRE>
461: </P>
462: <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
463: <P>
464: The following act immediately they are reached:
465: <pre>
466: (*ACCEPT) force successful match
467: (*FAIL) force backtrack; synonym (*F)
468: (*MARK:NAME) set name to be passed back; synonym (*:NAME)
469: </pre>
470: The following act only when a subsequent match failure causes a backtrack to
471: reach them. They all force a match failure, but they differ in what happens
472: afterwards. Those that advance the start-of-match point do so only if the
473: pattern is not anchored.
474: <pre>
475: (*COMMIT) overall failure, no advance of starting point
476: (*PRUNE) advance to next starting character
477: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
478: (*SKIP) advance to current matching position
479: (*SKIP:NAME) advance to position corresponding to an earlier
480: (*MARK:NAME); if not found, the (*SKIP) is ignored
481: (*THEN) local failure, backtrack to next alternation
482: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
483: </PRE>
484: </P>
485: <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
486: <P>
487: These are recognized only at the very start of the pattern or after a
488: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
489: <pre>
490: (*CR) carriage return only
491: (*LF) linefeed only
492: (*CRLF) carriage return followed by linefeed
493: (*ANYCRLF) all three of the above
494: (*ANY) any Unicode newline sequence
495: </PRE>
496: </P>
497: <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
498: <P>
499: These are recognized only at the very start of the pattern or after a
500: (*...) option that sets the newline convention or a UTF or UCP mode.
501: <pre>
502: (*BSR_ANYCRLF) CR, LF, or CRLF
503: (*BSR_UNICODE) any Unicode newline sequence
504: </PRE>
505: </P>
506: <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
507: <P>
508: <pre>
509: (?C) callout
510: (?Cn) callout with data n
511: </PRE>
512: </P>
513: <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
514: <P>
515: <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
516: <b>pcrematching</b>(3), <b>pcre</b>(3).
517: </P>
518: <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
519: <P>
520: Philip Hazel
521: <br>
522: University Computing Service
523: <br>
524: Cambridge CB2 3QH, England.
525: <br>
526: </P>
527: <br><a name="SEC27" href="#TOC1">REVISION</a><br>
528: <P>
529: Last updated: 12 November 2013
530: <br>
531: Copyright © 1997-2013 University of Cambridge.
532: <br>
533: <p>
534: Return to the <a href="index.html">PCRE index page</a>.
535: </p>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>