1: .TH PCRESYNTAX 3 "12 November 2013" "PCRE 8.34"
2: .SH NAME
3: PCRE - Perl-compatible regular expressions
4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5: .rs
6: .sp
7: The full syntax and semantics of the regular expressions that are supported by
8: PCRE are described in the
9: .\" HREF
10: \fBpcrepattern\fP
11: .\"
12: documentation. This document contains a quick-reference summary of the syntax.
13: .
14: .
15: .SH "QUOTING"
16: .rs
17: .sp
18: \ex where x is non-alphanumeric is a literal x
19: \eQ...\eE treat enclosed characters as literal
20: .
21: .
22: .SH "CHARACTERS"
23: .rs
24: .sp
25: \ea alarm, that is, the BEL character (hex 07)
26: \ecx "control-x", where x is any ASCII character
27: \ee escape (hex 1B)
28: \ef form feed (hex 0C)
29: \en newline (hex 0A)
30: \er carriage return (hex 0D)
31: \et tab (hex 09)
32: \e0dd character with octal code 0dd
33: \eddd character with octal code ddd, or backreference
34: \eo{ddd..} character with octal code ddd..
35: \exhh character with hex code hh
36: \ex{hhh..} character with hex code hhh..
37: .sp
38: Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
39: characters "8" and "9".
40: .
41: .
42: .SH "CHARACTER TYPES"
43: .rs
44: .sp
45: . any character except newline;
46: in dotall mode, any character whatsoever
47: \eC one data unit, even in UTF mode (best avoided)
48: \ed a decimal digit
49: \eD a character that is not a decimal digit
50: \eh a horizontal white space character
51: \eH a character that is not a horizontal white space character
52: \eN a character that is not a newline
53: \ep{\fIxx\fP} a character with the \fIxx\fP property
54: \eP{\fIxx\fP} a character without the \fIxx\fP property
55: \eR a newline sequence
56: \es a white space character
57: \eS a character that is not a white space character
58: \ev a vertical white space character
59: \eV a character that is not a vertical white space character
60: \ew a "word" character
61: \eW a "non-word" character
62: \eX a Unicode extended grapheme cluster
63: .sp
64: By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
65: or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
66: happening, \es and \ew may also match characters with code points in the range
67: 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
68: is changed to use Unicode properties and they match many more characters.
69: .
70: .
71: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
72: .rs
73: .sp
74: C Other
75: Cc Control
76: Cf Format
77: Cn Unassigned
78: Co Private use
79: Cs Surrogate
80: .sp
81: L Letter
82: Ll Lower case letter
83: Lm Modifier letter
84: Lo Other letter
85: Lt Title case letter
86: Lu Upper case letter
87: L& Ll, Lu, or Lt
88: .sp
89: M Mark
90: Mc Spacing mark
91: Me Enclosing mark
92: Mn Non-spacing mark
93: .sp
94: N Number
95: Nd Decimal number
96: Nl Letter number
97: No Other number
98: .sp
99: P Punctuation
100: Pc Connector punctuation
101: Pd Dash punctuation
102: Pe Close punctuation
103: Pf Final punctuation
104: Pi Initial punctuation
105: Po Other punctuation
106: Ps Open punctuation
107: .sp
108: S Symbol
109: Sc Currency symbol
110: Sk Modifier symbol
111: Sm Mathematical symbol
112: So Other symbol
113: .sp
114: Z Separator
115: Zl Line separator
116: Zp Paragraph separator
117: Zs Space separator
118: .
119: .
120: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
121: .rs
122: .sp
123: Xan Alphanumeric: union of properties L and N
124: Xps POSIX space: property Z or tab, NL, VT, FF, CR
125: Xsp Perl space: property Z or tab, NL, VT, FF, CR
126: Xuc Univerally-named character: one that can be
127: represented by a Universal Character Name
128: Xwd Perl word: property Xan or underscore
129: .sp
130: Perl and POSIX space are now the same. Perl added VT to its space character set
131: at release 5.18 and PCRE changed at release 8.34.
132: .
133: .
134: .SH "SCRIPT NAMES FOR \ep AND \eP"
135: .rs
136: .sp
137: Arabic,
138: Armenian,
139: Avestan,
140: Balinese,
141: Bamum,
142: Batak,
143: Bengali,
144: Bopomofo,
145: Brahmi,
146: Braille,
147: Buginese,
148: Buhid,
149: Canadian_Aboriginal,
150: Carian,
151: Chakma,
152: Cham,
153: Cherokee,
154: Common,
155: Coptic,
156: Cuneiform,
157: Cypriot,
158: Cyrillic,
159: Deseret,
160: Devanagari,
161: Egyptian_Hieroglyphs,
162: Ethiopic,
163: Georgian,
164: Glagolitic,
165: Gothic,
166: Greek,
167: Gujarati,
168: Gurmukhi,
169: Han,
170: Hangul,
171: Hanunoo,
172: Hebrew,
173: Hiragana,
174: Imperial_Aramaic,
175: Inherited,
176: Inscriptional_Pahlavi,
177: Inscriptional_Parthian,
178: Javanese,
179: Kaithi,
180: Kannada,
181: Katakana,
182: Kayah_Li,
183: Kharoshthi,
184: Khmer,
185: Lao,
186: Latin,
187: Lepcha,
188: Limbu,
189: Linear_B,
190: Lisu,
191: Lycian,
192: Lydian,
193: Malayalam,
194: Mandaic,
195: Meetei_Mayek,
196: Meroitic_Cursive,
197: Meroitic_Hieroglyphs,
198: Miao,
199: Mongolian,
200: Myanmar,
201: New_Tai_Lue,
202: Nko,
203: Ogham,
204: Old_Italic,
205: Old_Persian,
206: Old_South_Arabian,
207: Old_Turkic,
208: Ol_Chiki,
209: Oriya,
210: Osmanya,
211: Phags_Pa,
212: Phoenician,
213: Rejang,
214: Runic,
215: Samaritan,
216: Saurashtra,
217: Sharada,
218: Shavian,
219: Sinhala,
220: Sora_Sompeng,
221: Sundanese,
222: Syloti_Nagri,
223: Syriac,
224: Tagalog,
225: Tagbanwa,
226: Tai_Le,
227: Tai_Tham,
228: Tai_Viet,
229: Takri,
230: Tamil,
231: Telugu,
232: Thaana,
233: Thai,
234: Tibetan,
235: Tifinagh,
236: Ugaritic,
237: Vai,
238: Yi.
239: .
240: .
241: .SH "CHARACTER CLASSES"
242: .rs
243: .sp
244: [...] positive character class
245: [^...] negative character class
246: [x-y] range (can be used for hex characters)
247: [[:xxx:]] positive POSIX named set
248: [[:^xxx:]] negative POSIX named set
249: .sp
250: alnum alphanumeric
251: alpha alphabetic
252: ascii 0-127
253: blank space or tab
254: cntrl control character
255: digit decimal digit
256: graph printing, excluding space
257: lower lower case letter
258: print printing, including space
259: punct printing, excluding alphanumeric
260: space white space
261: upper upper case letter
262: word same as \ew
263: xdigit hexadecimal digit
264: .sp
265: In PCRE, POSIX character set names recognize only ASCII characters by default,
266: but some of them use Unicode properties if PCRE_UCP is set. You can use
267: \eQ...\eE inside a character class.
268: .
269: .
270: .SH "QUANTIFIERS"
271: .rs
272: .sp
273: ? 0 or 1, greedy
274: ?+ 0 or 1, possessive
275: ?? 0 or 1, lazy
276: * 0 or more, greedy
277: *+ 0 or more, possessive
278: *? 0 or more, lazy
279: + 1 or more, greedy
280: ++ 1 or more, possessive
281: +? 1 or more, lazy
282: {n} exactly n
283: {n,m} at least n, no more than m, greedy
284: {n,m}+ at least n, no more than m, possessive
285: {n,m}? at least n, no more than m, lazy
286: {n,} n or more, greedy
287: {n,}+ n or more, possessive
288: {n,}? n or more, lazy
289: .
290: .
291: .SH "ANCHORS AND SIMPLE ASSERTIONS"
292: .rs
293: .sp
294: \eb word boundary
295: \eB not a word boundary
296: ^ start of subject
297: also after internal newline in multiline mode
298: \eA start of subject
299: $ end of subject
300: also before newline at end of subject
301: also before internal newline in multiline mode
302: \eZ end of subject
303: also before newline at end of subject
304: \ez end of subject
305: \eG first matching position in subject
306: .
307: .
308: .SH "MATCH POINT RESET"
309: .rs
310: .sp
311: \eK reset start of match
312: .
313: .
314: .SH "ALTERNATION"
315: .rs
316: .sp
317: expr|expr|expr...
318: .
319: .
320: .SH "CAPTURING"
321: .rs
322: .sp
323: (...) capturing group
324: (?<name>...) named capturing group (Perl)
325: (?'name'...) named capturing group (Perl)
326: (?P<name>...) named capturing group (Python)
327: (?:...) non-capturing group
328: (?|...) non-capturing group; reset group numbers for
329: capturing groups in each alternative
330: .
331: .
332: .SH "ATOMIC GROUPS"
333: .rs
334: .sp
335: (?>...) atomic, non-capturing group
336: .
337: .
338: .
339: .
340: .SH "COMMENT"
341: .rs
342: .sp
343: (?#....) comment (not nestable)
344: .
345: .
346: .SH "OPTION SETTING"
347: .rs
348: .sp
349: (?i) caseless
350: (?J) allow duplicate names
351: (?m) multiline
352: (?s) single line (dotall)
353: (?U) default ungreedy (lazy)
354: (?x) extended (ignore white space)
355: (?-...) unset option(s)
356: .sp
357: The following are recognized only at the start of a pattern or after one of the
358: newline-setting options with similar syntax:
359: .sp
360: (*LIMIT_MATCH=d) set the match limit to d (decimal number)
361: (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
362: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
363: (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
364: (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
365: (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
366: (*UTF) set appropriate UTF mode for the library in use
367: (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
368: .sp
369: Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
370: limits set by the caller of pcre_exec(), not increase them.
371: .
372: .
373: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
374: .rs
375: .sp
376: (?=...) positive look ahead
377: (?!...) negative look ahead
378: (?<=...) positive look behind
379: (?<!...) negative look behind
380: .sp
381: Each top-level branch of a look behind must be of a fixed length.
382: .
383: .
384: .SH "BACKREFERENCES"
385: .rs
386: .sp
387: \en reference by number (can be ambiguous)
388: \egn reference by number
389: \eg{n} reference by number
390: \eg{-n} relative reference by number
391: \ek<name> reference by name (Perl)
392: \ek'name' reference by name (Perl)
393: \eg{name} reference by name (Perl)
394: \ek{name} reference by name (.NET)
395: (?P=name) reference by name (Python)
396: .
397: .
398: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
399: .rs
400: .sp
401: (?R) recurse whole pattern
402: (?n) call subpattern by absolute number
403: (?+n) call subpattern by relative number
404: (?-n) call subpattern by relative number
405: (?&name) call subpattern by name (Perl)
406: (?P>name) call subpattern by name (Python)
407: \eg<name> call subpattern by name (Oniguruma)
408: \eg'name' call subpattern by name (Oniguruma)
409: \eg<n> call subpattern by absolute number (Oniguruma)
410: \eg'n' call subpattern by absolute number (Oniguruma)
411: \eg<+n> call subpattern by relative number (PCRE extension)
412: \eg'+n' call subpattern by relative number (PCRE extension)
413: \eg<-n> call subpattern by relative number (PCRE extension)
414: \eg'-n' call subpattern by relative number (PCRE extension)
415: .
416: .
417: .SH "CONDITIONAL PATTERNS"
418: .rs
419: .sp
420: (?(condition)yes-pattern)
421: (?(condition)yes-pattern|no-pattern)
422: .sp
423: (?(n)... absolute reference condition
424: (?(+n)... relative reference condition
425: (?(-n)... relative reference condition
426: (?(<name>)... named reference condition (Perl)
427: (?('name')... named reference condition (Perl)
428: (?(name)... named reference condition (PCRE)
429: (?(R)... overall recursion condition
430: (?(Rn)... specific group recursion condition
431: (?(R&name)... specific recursion condition
432: (?(DEFINE)... define subpattern for reference
433: (?(assert)... assertion condition
434: .
435: .
436: .SH "BACKTRACKING CONTROL"
437: .rs
438: .sp
439: The following act immediately they are reached:
440: .sp
441: (*ACCEPT) force successful match
442: (*FAIL) force backtrack; synonym (*F)
443: (*MARK:NAME) set name to be passed back; synonym (*:NAME)
444: .sp
445: The following act only when a subsequent match failure causes a backtrack to
446: reach them. They all force a match failure, but they differ in what happens
447: afterwards. Those that advance the start-of-match point do so only if the
448: pattern is not anchored.
449: .sp
450: (*COMMIT) overall failure, no advance of starting point
451: (*PRUNE) advance to next starting character
452: (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
453: (*SKIP) advance to current matching position
454: (*SKIP:NAME) advance to position corresponding to an earlier
455: (*MARK:NAME); if not found, the (*SKIP) is ignored
456: (*THEN) local failure, backtrack to next alternation
457: (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
458: .
459: .
460: .SH "NEWLINE CONVENTIONS"
461: .rs
462: .sp
463: These are recognized only at the very start of the pattern or after a
464: (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
465: .sp
466: (*CR) carriage return only
467: (*LF) linefeed only
468: (*CRLF) carriage return followed by linefeed
469: (*ANYCRLF) all three of the above
470: (*ANY) any Unicode newline sequence
471: .
472: .
473: .SH "WHAT \eR MATCHES"
474: .rs
475: .sp
476: These are recognized only at the very start of the pattern or after a
477: (*...) option that sets the newline convention or a UTF or UCP mode.
478: .sp
479: (*BSR_ANYCRLF) CR, LF, or CRLF
480: (*BSR_UNICODE) any Unicode newline sequence
481: .
482: .
483: .SH "CALLOUTS"
484: .rs
485: .sp
486: (?C) callout
487: (?Cn) callout with data n
488: .
489: .
490: .SH "SEE ALSO"
491: .rs
492: .sp
493: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
494: \fBpcrematching\fP(3), \fBpcre\fP(3).
495: .
496: .
497: .SH AUTHOR
498: .rs
499: .sp
500: .nf
501: Philip Hazel
502: University Computing Service
503: Cambridge CB2 3QH, England.
504: .fi
505: .
506: .
507: .SH REVISION
508: .rs
509: .sp
510: .nf
511: Last updated: 12 November 2013
512: Copyright (c) 1997-2013 University of Cambridge.
513: .fi
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>