Annotation of embedaddon/pcre/doc/pcresyntax.3, revision 1.1.1.1
1.1 misho 1: .TH PCRESYNTAX 3
2: .SH NAME
3: PCRE - Perl-compatible regular expressions
4: .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5: .rs
6: .sp
7: The full syntax and semantics of the regular expressions that are supported by
8: PCRE are described in the
9: .\" HREF
10: \fBpcrepattern\fP
11: .\"
12: documentation. This document contains just a quick-reference summary of the
13: syntax.
14: .
15: .
16: .SH "QUOTING"
17: .rs
18: .sp
19: \ex where x is non-alphanumeric is a literal x
20: \eQ...\eE treat enclosed characters as literal
21: .
22: .
23: .SH "CHARACTERS"
24: .rs
25: .sp
26: \ea alarm, that is, the BEL character (hex 07)
27: \ecx "control-x", where x is any ASCII character
28: \ee escape (hex 1B)
29: \ef formfeed (hex 0C)
30: \en newline (hex 0A)
31: \er carriage return (hex 0D)
32: \et tab (hex 09)
33: \eddd character with octal code ddd, or backreference
34: \exhh character with hex code hh
35: \ex{hhh..} character with hex code hhh..
36: .
37: .
38: .SH "CHARACTER TYPES"
39: .rs
40: .sp
41: . any character except newline;
42: in dotall mode, any character whatsoever
43: \eC one byte, even in UTF-8 mode (best avoided)
44: \ed a decimal digit
45: \eD a character that is not a decimal digit
46: \eh a horizontal whitespace character
47: \eH a character that is not a horizontal whitespace character
48: \eN a character that is not a newline
49: \ep{\fIxx\fP} a character with the \fIxx\fP property
50: \eP{\fIxx\fP} a character without the \fIxx\fP property
51: \eR a newline sequence
52: \es a whitespace character
53: \eS a character that is not a whitespace character
54: \ev a vertical whitespace character
55: \eV a character that is not a vertical whitespace character
56: \ew a "word" character
57: \eW a "non-word" character
58: \eX an extended Unicode sequence
59: .sp
60: In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
61: characters, even in UTF-8 mode. However, this can be changed by setting the
62: PCRE_UCP option.
63: .
64: .
65: .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
66: .rs
67: .sp
68: C Other
69: Cc Control
70: Cf Format
71: Cn Unassigned
72: Co Private use
73: Cs Surrogate
74: .sp
75: L Letter
76: Ll Lower case letter
77: Lm Modifier letter
78: Lo Other letter
79: Lt Title case letter
80: Lu Upper case letter
81: L& Ll, Lu, or Lt
82: .sp
83: M Mark
84: Mc Spacing mark
85: Me Enclosing mark
86: Mn Non-spacing mark
87: .sp
88: N Number
89: Nd Decimal number
90: Nl Letter number
91: No Other number
92: .sp
93: P Punctuation
94: Pc Connector punctuation
95: Pd Dash punctuation
96: Pe Close punctuation
97: Pf Final punctuation
98: Pi Initial punctuation
99: Po Other punctuation
100: Ps Open punctuation
101: .sp
102: S Symbol
103: Sc Currency symbol
104: Sk Modifier symbol
105: Sm Mathematical symbol
106: So Other symbol
107: .sp
108: Z Separator
109: Zl Line separator
110: Zp Paragraph separator
111: Zs Space separator
112: .
113: .
114: .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
115: .rs
116: .sp
117: Xan Alphanumeric: union of properties L and N
118: Xps POSIX space: property Z or tab, NL, VT, FF, CR
119: Xsp Perl space: property Z or tab, NL, FF, CR
120: Xwd Perl word: property Xan or underscore
121: .
122: .
123: .SH "SCRIPT NAMES FOR \ep AND \eP"
124: .rs
125: .sp
126: Arabic,
127: Armenian,
128: Avestan,
129: Balinese,
130: Bamum,
131: Bengali,
132: Bopomofo,
133: Braille,
134: Buginese,
135: Buhid,
136: Canadian_Aboriginal,
137: Carian,
138: Cham,
139: Cherokee,
140: Common,
141: Coptic,
142: Cuneiform,
143: Cypriot,
144: Cyrillic,
145: Deseret,
146: Devanagari,
147: Egyptian_Hieroglyphs,
148: Ethiopic,
149: Georgian,
150: Glagolitic,
151: Gothic,
152: Greek,
153: Gujarati,
154: Gurmukhi,
155: Han,
156: Hangul,
157: Hanunoo,
158: Hebrew,
159: Hiragana,
160: Imperial_Aramaic,
161: Inherited,
162: Inscriptional_Pahlavi,
163: Inscriptional_Parthian,
164: Javanese,
165: Kaithi,
166: Kannada,
167: Katakana,
168: Kayah_Li,
169: Kharoshthi,
170: Khmer,
171: Lao,
172: Latin,
173: Lepcha,
174: Limbu,
175: Linear_B,
176: Lisu,
177: Lycian,
178: Lydian,
179: Malayalam,
180: Meetei_Mayek,
181: Mongolian,
182: Myanmar,
183: New_Tai_Lue,
184: Nko,
185: Ogham,
186: Old_Italic,
187: Old_Persian,
188: Old_South_Arabian,
189: Old_Turkic,
190: Ol_Chiki,
191: Oriya,
192: Osmanya,
193: Phags_Pa,
194: Phoenician,
195: Rejang,
196: Runic,
197: Samaritan,
198: Saurashtra,
199: Shavian,
200: Sinhala,
201: Sundanese,
202: Syloti_Nagri,
203: Syriac,
204: Tagalog,
205: Tagbanwa,
206: Tai_Le,
207: Tai_Tham,
208: Tai_Viet,
209: Tamil,
210: Telugu,
211: Thaana,
212: Thai,
213: Tibetan,
214: Tifinagh,
215: Ugaritic,
216: Vai,
217: Yi.
218: .
219: .
220: .SH "CHARACTER CLASSES"
221: .rs
222: .sp
223: [...] positive character class
224: [^...] negative character class
225: [x-y] range (can be used for hex characters)
226: [[:xxx:]] positive POSIX named set
227: [[:^xxx:]] negative POSIX named set
228: .sp
229: alnum alphanumeric
230: alpha alphabetic
231: ascii 0-127
232: blank space or tab
233: cntrl control character
234: digit decimal digit
235: graph printing, excluding space
236: lower lower case letter
237: print printing, including space
238: punct printing, excluding alphanumeric
239: space whitespace
240: upper upper case letter
241: word same as \ew
242: xdigit hexadecimal digit
243: .sp
244: In PCRE, POSIX character set names recognize only ASCII characters by default,
245: but some of them use Unicode properties if PCRE_UCP is set. You can use
246: \eQ...\eE inside a character class.
247: .
248: .
249: .SH "QUANTIFIERS"
250: .rs
251: .sp
252: ? 0 or 1, greedy
253: ?+ 0 or 1, possessive
254: ?? 0 or 1, lazy
255: * 0 or more, greedy
256: *+ 0 or more, possessive
257: *? 0 or more, lazy
258: + 1 or more, greedy
259: ++ 1 or more, possessive
260: +? 1 or more, lazy
261: {n} exactly n
262: {n,m} at least n, no more than m, greedy
263: {n,m}+ at least n, no more than m, possessive
264: {n,m}? at least n, no more than m, lazy
265: {n,} n or more, greedy
266: {n,}+ n or more, possessive
267: {n,}? n or more, lazy
268: .
269: .
270: .SH "ANCHORS AND SIMPLE ASSERTIONS"
271: .rs
272: .sp
273: \eb word boundary
274: \eB not a word boundary
275: ^ start of subject
276: also after internal newline in multiline mode
277: \eA start of subject
278: $ end of subject
279: also before newline at end of subject
280: also before internal newline in multiline mode
281: \eZ end of subject
282: also before newline at end of subject
283: \ez end of subject
284: \eG first matching position in subject
285: .
286: .
287: .SH "MATCH POINT RESET"
288: .rs
289: .sp
290: \eK reset start of match
291: .
292: .
293: .SH "ALTERNATION"
294: .rs
295: .sp
296: expr|expr|expr...
297: .
298: .
299: .SH "CAPTURING"
300: .rs
301: .sp
302: (...) capturing group
303: (?<name>...) named capturing group (Perl)
304: (?'name'...) named capturing group (Perl)
305: (?P<name>...) named capturing group (Python)
306: (?:...) non-capturing group
307: (?|...) non-capturing group; reset group numbers for
308: capturing groups in each alternative
309: .
310: .
311: .SH "ATOMIC GROUPS"
312: .rs
313: .sp
314: (?>...) atomic, non-capturing group
315: .
316: .
317: .
318: .
319: .SH "COMMENT"
320: .rs
321: .sp
322: (?#....) comment (not nestable)
323: .
324: .
325: .SH "OPTION SETTING"
326: .rs
327: .sp
328: (?i) caseless
329: (?J) allow duplicate names
330: (?m) multiline
331: (?s) single line (dotall)
332: (?U) default ungreedy (lazy)
333: (?x) extended (ignore white space)
334: (?-...) unset option(s)
335: .sp
336: The following are recognized only at the start of a pattern or after one of the
337: newline-setting options with similar syntax:
338: .sp
339: (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
340: (*UTF8) set UTF-8 mode (PCRE_UTF8)
341: (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
342: .
343: .
344: .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
345: .rs
346: .sp
347: (?=...) positive look ahead
348: (?!...) negative look ahead
349: (?<=...) positive look behind
350: (?<!...) negative look behind
351: .sp
352: Each top-level branch of a look behind must be of a fixed length.
353: .
354: .
355: .SH "BACKREFERENCES"
356: .rs
357: .sp
358: \en reference by number (can be ambiguous)
359: \egn reference by number
360: \eg{n} reference by number
361: \eg{-n} relative reference by number
362: \ek<name> reference by name (Perl)
363: \ek'name' reference by name (Perl)
364: \eg{name} reference by name (Perl)
365: \ek{name} reference by name (.NET)
366: (?P=name) reference by name (Python)
367: .
368: .
369: .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
370: .rs
371: .sp
372: (?R) recurse whole pattern
373: (?n) call subpattern by absolute number
374: (?+n) call subpattern by relative number
375: (?-n) call subpattern by relative number
376: (?&name) call subpattern by name (Perl)
377: (?P>name) call subpattern by name (Python)
378: \eg<name> call subpattern by name (Oniguruma)
379: \eg'name' call subpattern by name (Oniguruma)
380: \eg<n> call subpattern by absolute number (Oniguruma)
381: \eg'n' call subpattern by absolute number (Oniguruma)
382: \eg<+n> call subpattern by relative number (PCRE extension)
383: \eg'+n' call subpattern by relative number (PCRE extension)
384: \eg<-n> call subpattern by relative number (PCRE extension)
385: \eg'-n' call subpattern by relative number (PCRE extension)
386: .
387: .
388: .SH "CONDITIONAL PATTERNS"
389: .rs
390: .sp
391: (?(condition)yes-pattern)
392: (?(condition)yes-pattern|no-pattern)
393: .sp
394: (?(n)... absolute reference condition
395: (?(+n)... relative reference condition
396: (?(-n)... relative reference condition
397: (?(<name>)... named reference condition (Perl)
398: (?('name')... named reference condition (Perl)
399: (?(name)... named reference condition (PCRE)
400: (?(R)... overall recursion condition
401: (?(Rn)... specific group recursion condition
402: (?(R&name)... specific recursion condition
403: (?(DEFINE)... define subpattern for reference
404: (?(assert)... assertion condition
405: .
406: .
407: .SH "BACKTRACKING CONTROL"
408: .rs
409: .sp
410: The following act immediately they are reached:
411: .sp
412: (*ACCEPT) force successful match
413: (*FAIL) force backtrack; synonym (*F)
414: .sp
415: The following act only when a subsequent match failure causes a backtrack to
416: reach them. They all force a match failure, but they differ in what happens
417: afterwards. Those that advance the start-of-match point do so only if the
418: pattern is not anchored.
419: .sp
420: (*COMMIT) overall failure, no advance of starting point
421: (*PRUNE) advance to next starting character
422: (*SKIP) advance start to current matching position
423: (*THEN) local failure, backtrack to next alternation
424: .
425: .
426: .SH "NEWLINE CONVENTIONS"
427: .rs
428: .sp
429: These are recognized only at the very start of the pattern or after a
430: (*BSR_...) or (*UTF8) or (*UCP) option.
431: .sp
432: (*CR) carriage return only
433: (*LF) linefeed only
434: (*CRLF) carriage return followed by linefeed
435: (*ANYCRLF) all three of the above
436: (*ANY) any Unicode newline sequence
437: .
438: .
439: .SH "WHAT \eR MATCHES"
440: .rs
441: .sp
442: These are recognized only at the very start of the pattern or after a
443: (*...) option that sets the newline convention or UTF-8 or UCP mode.
444: .sp
445: (*BSR_ANYCRLF) CR, LF, or CRLF
446: (*BSR_UNICODE) any Unicode newline sequence
447: .
448: .
449: .SH "CALLOUTS"
450: .rs
451: .sp
452: (?C) callout
453: (?Cn) callout with data n
454: .
455: .
456: .SH "SEE ALSO"
457: .rs
458: .sp
459: \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
460: \fBpcrematching\fP(3), \fBpcre\fP(3).
461: .
462: .
463: .SH AUTHOR
464: .rs
465: .sp
466: .nf
467: Philip Hazel
468: University Computing Service
469: Cambridge CB2 3QH, England.
470: .fi
471: .
472: .
473: .SH REVISION
474: .rs
475: .sp
476: .nf
477: Last updated: 21 November 2010
478: Copyright (c) 1997-2010 University of Cambridge.
479: .fi
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>