Annotation of embedaddon/php/ext/ereg/regex/regex.3, revision 1.1.1.1
1.1 misho 1: .TH REGEX 3 "17 May 1993"
2: .BY "Henry Spencer"
3: .de ZR
4: .\" one other place knows this name: the SEE ALSO section
5: .IR regex (7) \\$1
6: ..
7: .SH NAME
8: regcomp, regexec, regerror, regfree \- regular-expression library
9: .SH SYNOPSIS
10: .ft B
11: .\".na
12: #include <sys/types.h>
13: .br
14: #include <regex.h>
15: .HP 10
16: int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
17: .HP
18: int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
19: size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
20: .HP
21: size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
22: char\ *errbuf, size_t\ errbuf_size);
23: .HP
24: void\ regfree(regex_t\ *preg);
25: .\".ad
26: .ft
27: .SH DESCRIPTION
28: These routines implement POSIX 1003.2 regular expressions (``RE''s);
29: see
30: .ZR .
31: .I Regcomp
32: compiles an RE written as a string into an internal form,
33: .I regexec
34: matches that internal form against a string and reports results,
35: .I regerror
36: transforms error codes from either into human-readable messages,
37: and
38: .I regfree
39: frees any dynamically-allocated storage used by the internal form
40: of an RE.
41: .PP
42: The header
43: .I <regex.h>
44: declares two structure types,
45: .I regex_t
46: and
47: .IR regmatch_t ,
48: the former for compiled internal forms and the latter for match reporting.
49: It also declares the four functions,
50: a type
51: .IR regoff_t ,
52: and a number of constants with names starting with ``REG_''.
53: .PP
54: .I Regcomp
55: compiles the regular expression contained in the
56: .I pattern
57: string,
58: subject to the flags in
59: .IR cflags ,
60: and places the results in the
61: .I regex_t
62: structure pointed to by
63: .IR preg .
64: .I Cflags
65: is the bitwise OR of zero or more of the following flags:
66: .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
67: Compile modern (``extended'') REs,
68: rather than the obsolete (``basic'') REs that
69: are the default.
70: .IP REG_BASIC
71: This is a synonym for 0,
72: provided as a counterpart to REG_EXTENDED to improve readability.
73: .IP REG_NOSPEC
74: Compile with recognition of all special characters turned off.
75: All characters are thus considered ordinary,
76: so the ``RE'' is a literal string.
77: This is an extension,
78: compatible with but not specified by POSIX 1003.2,
79: and should be used with
80: caution in software intended to be portable to other systems.
81: REG_EXTENDED and REG_NOSPEC may not be used
82: in the same call to
83: .IR regcomp .
84: .IP REG_ICASE
85: Compile for matching that ignores upper/lower case distinctions.
86: See
87: .ZR .
88: .IP REG_NOSUB
89: Compile for matching that need only report success or failure,
90: not what was matched.
91: .IP REG_NEWLINE
92: Compile for newline-sensitive matching.
93: By default, newline is a completely ordinary character with no special
94: meaning in either REs or strings.
95: With this flag,
96: `[^' bracket expressions and `.' never match newline,
97: a `^' anchor matches the null string after any newline in the string
98: in addition to its normal function,
99: and the `$' anchor matches the null string before any newline in the
100: string in addition to its normal function.
101: .IP REG_PEND
102: The regular expression ends,
103: not at the first NUL,
104: but just before the character pointed to by the
105: .I re_endp
106: member of the structure pointed to by
107: .IR preg .
108: The
109: .I re_endp
110: member is of type
111: .IR const\ char\ * .
112: This flag permits inclusion of NULs in the RE;
113: they are considered ordinary characters.
114: This is an extension,
115: compatible with but not specified by POSIX 1003.2,
116: and should be used with
117: caution in software intended to be portable to other systems.
118: .PP
119: When successful,
120: .I regcomp
121: returns 0 and fills in the structure pointed to by
122: .IR preg .
123: One member of that structure
124: (other than
125: .IR re_endp )
126: is publicized:
127: .IR re_nsub ,
128: of type
129: .IR size_t ,
130: contains the number of parenthesized subexpressions within the RE
131: (except that the value of this member is undefined if the
132: REG_NOSUB flag was used).
133: If
134: .I regcomp
135: fails, it returns a non-zero error code;
136: see DIAGNOSTICS.
137: .PP
138: .I Regexec
139: matches the compiled RE pointed to by
140: .I preg
141: against the
142: .IR string ,
143: subject to the flags in
144: .IR eflags ,
145: and reports results using
146: .IR nmatch ,
147: .IR pmatch ,
148: and the returned value.
149: The RE must have been compiled by a previous invocation of
150: .IR regcomp .
151: The compiled form is not altered during execution of
152: .IR regexec ,
153: so a single compiled RE can be used simultaneously by multiple threads.
154: .PP
155: By default,
156: the NUL-terminated string pointed to by
157: .I string
158: is considered to be the text of an entire line, minus any terminating
159: newline.
160: The
161: .I eflags
162: argument is the bitwise OR of zero or more of the following flags:
163: .IP REG_NOTBOL \w'REG_STARTEND'u+2n
164: The first character of
165: the string
166: is not the beginning of a line, so the `^' anchor should not match before it.
167: This does not affect the behavior of newlines under REG_NEWLINE.
168: .IP REG_NOTEOL
169: The NUL terminating
170: the string
171: does not end a line, so the `$' anchor should not match before it.
172: This does not affect the behavior of newlines under REG_NEWLINE.
173: .IP REG_STARTEND
174: The string is considered to start at
175: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
176: and to have a terminating NUL located at
177: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
178: (there need not actually be a NUL at that location),
179: regardless of the value of
180: .IR nmatch .
181: See below for the definition of
182: .IR pmatch
183: and
184: .IR nmatch .
185: This is an extension,
186: compatible with but not specified by POSIX 1003.2,
187: and should be used with
188: caution in software intended to be portable to other systems.
189: Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
190: REG_STARTEND affects only the location of the string,
191: not how it is matched.
192: .PP
193: See
194: .ZR
195: for a discussion of what is matched in situations where an RE or a
196: portion thereof could match any of several substrings of
197: .IR string .
198: .PP
199: Normally,
200: .I regexec
201: returns 0 for success and the non-zero code REG_NOMATCH for failure.
202: Other non-zero error codes may be returned in exceptional situations;
203: see DIAGNOSTICS.
204: .PP
205: If REG_NOSUB was specified in the compilation of the RE,
206: or if
207: .I nmatch
208: is 0,
209: .I regexec
210: ignores the
211: .I pmatch
212: argument (but see below for the case where REG_STARTEND is specified).
213: Otherwise,
214: .I pmatch
215: points to an array of
216: .I nmatch
217: structures of type
218: .IR regmatch_t .
219: Such a structure has at least the members
220: .I rm_so
221: and
222: .IR rm_eo ,
223: both of type
224: .I regoff_t
225: (a signed arithmetic type at least as large as an
226: .I off_t
227: and a
228: .IR ssize_t ),
229: containing respectively the offset of the first character of a substring
230: and the offset of the first character after the end of the substring.
231: Offsets are measured from the beginning of the
232: .I string
233: argument given to
234: .IR regexec .
235: An empty substring is denoted by equal offsets,
236: both indicating the character following the empty substring.
237: .PP
238: The 0th member of the
239: .I pmatch
240: array is filled in to indicate what substring of
241: .I string
242: was matched by the entire RE.
243: Remaining members report what substring was matched by parenthesized
244: subexpressions within the RE;
245: member
246: .I i
247: reports subexpression
248: .IR i ,
249: with subexpressions counted (starting at 1) by the order of their opening
250: parentheses in the RE, left to right.
251: Unused entries in the array\(emcorresponding either to subexpressions that
252: did not participate in the match at all, or to subexpressions that do not
253: exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
254: .I rm_so
255: and
256: .I rm_eo
257: set to \-1.
258: If a subexpression participated in the match several times,
259: the reported substring is the last one it matched.
260: (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
261: the parenthesized subexpression matches each of the three `b's and then
262: an infinite number of empty strings following the last `b',
263: so the reported substring is one of the empties.)
264: .PP
265: If REG_STARTEND is specified,
266: .I pmatch
267: must point to at least one
268: .I regmatch_t
269: (even if
270: .I nmatch
271: is 0 or REG_NOSUB was specified),
272: to hold the input offsets for REG_STARTEND.
273: Use for output is still entirely controlled by
274: .IR nmatch ;
275: if
276: .I nmatch
277: is 0 or REG_NOSUB was specified,
278: the value of
279: .IR pmatch [0]
280: will not be changed by a successful
281: .IR regexec .
282: .PP
283: .I Regerror
284: maps a non-zero
285: .I errcode
286: from either
287: .I regcomp
288: or
289: .I regexec
290: to a human-readable, printable message.
291: If
292: .I preg
293: is non-NULL,
294: the error code should have arisen from use of
295: the
296: .I regex_t
297: pointed to by
298: .IR preg ,
299: and if the error code came from
300: .IR regcomp ,
301: it should have been the result from the most recent
302: .I regcomp
303: using that
304: .IR regex_t .
305: .RI ( Regerror
306: may be able to supply a more detailed message using information
307: from the
308: .IR regex_t .)
309: .I Regerror
310: places the NUL-terminated message into the buffer pointed to by
311: .IR errbuf ,
312: limiting the length (including the NUL) to at most
313: .I errbuf_size
314: bytes.
315: If the whole message won't fit,
316: as much of it as will fit before the terminating NUL is supplied.
317: In any case,
318: the returned value is the size of buffer needed to hold the whole
319: message (including terminating NUL).
320: If
321: .I errbuf_size
322: is 0,
323: .I errbuf
324: is ignored but the return value is still correct.
325: .PP
326: If the
327: .I errcode
328: given to
329: .I regerror
330: is first ORed with REG_ITOA,
331: the ``message'' that results is the printable name of the error code,
332: e.g. ``REG_NOMATCH'',
333: rather than an explanation thereof.
334: If
335: .I errcode
336: is REG_ATOI,
337: then
338: .I preg
339: shall be non-NULL and the
340: .I re_endp
341: member of the structure it points to
342: must point to the printable name of an error code;
343: in this case, the result in
344: .I errbuf
345: is the decimal digits of
346: the numeric value of the error code
347: (0 if the name is not recognized).
348: REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
349: they are extensions,
350: compatible with but not specified by POSIX 1003.2,
351: and should be used with
352: caution in software intended to be portable to other systems.
353: Be warned also that they are considered experimental and changes are possible.
354: .PP
355: .I Regfree
356: frees any dynamically-allocated storage associated with the compiled RE
357: pointed to by
358: .IR preg .
359: The remaining
360: .I regex_t
361: is no longer a valid compiled RE
362: and the effect of supplying it to
363: .I regexec
364: or
365: .I regerror
366: is undefined.
367: .PP
368: None of these functions references global variables except for tables
369: of constants;
370: all are safe for use from multiple threads if the arguments are safe.
371: .SH IMPLEMENTATION CHOICES
372: There are a number of decisions that 1003.2 leaves up to the implementor,
373: either by explicitly saying ``undefined'' or by virtue of them being
374: forbidden by the RE grammar.
375: This implementation treats them as follows.
376: .PP
377: See
378: .ZR
379: for a discussion of the definition of case-independent matching.
380: .PP
381: There is no particular limit on the length of REs,
382: except insofar as memory is limited.
383: Memory usage is approximately linear in RE size, and largely insensitive
384: to RE complexity, except for bounded repetitions.
385: See BUGS for one short RE using them
386: that will run almost any system out of memory.
387: .PP
388: A backslashed character other than one specifically given a magic meaning
389: by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
390: is taken as an ordinary character.
391: .PP
392: Any unmatched [ is a REG_EBRACK error.
393: .PP
394: Equivalence classes cannot begin or end bracket-expression ranges.
395: The endpoint of one range cannot begin another.
396: .PP
397: RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
398: .PP
399: A repetition operator (?, *, +, or bounds) cannot follow another
400: repetition operator.
401: A repetition operator cannot begin an expression or subexpression
402: or follow `^' or `|'.
403: .PP
404: `|' cannot appear first or last in a (sub)expression or after another `|',
405: i.e. an operand of `|' cannot be an empty subexpression.
406: An empty parenthesized subexpression, `()', is legal and matches an
407: empty (sub)string.
408: An empty string is not a legal RE.
409: .PP
410: A `{' followed by a digit is considered the beginning of bounds for a
411: bounded repetition, which must then follow the syntax for bounds.
412: A `{' \fInot\fR followed by a digit is considered an ordinary character.
413: .PP
414: `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
415: REs are anchors, not ordinary characters.
416: .SH SEE ALSO
417: grep(1), regex(7)
418: .PP
419: POSIX 1003.2, sections 2.8 (Regular Expression Notation)
420: and
421: B.5 (C Binding for Regular Expression Matching).
422: .SH DIAGNOSTICS
423: Non-zero error codes from
424: .I regcomp
425: and
426: .I regexec
427: include the following:
428: .PP
429: .nf
430: .ta \w'REG_ECOLLATE'u+3n
431: REG_NOMATCH regexec() failed to match
432: REG_BADPAT invalid regular expression
433: REG_ECOLLATE invalid collating element
434: REG_ECTYPE invalid character class
435: REG_EESCAPE \e applied to unescapable character
436: REG_ESUBREG invalid backreference number
437: REG_EBRACK brackets [ ] not balanced
438: REG_EPAREN parentheses ( ) not balanced
439: REG_EBRACE braces { } not balanced
440: REG_BADBR invalid repetition count(s) in { }
441: REG_ERANGE invalid character range in [ ]
442: REG_ESPACE ran out of memory
443: REG_BADRPT ?, *, or + operand invalid
444: REG_EMPTY empty (sub)expression
445: REG_ASSERT ``can't happen''\(emyou found a bug
446: REG_INVARG invalid argument, e.g. negative-length string
447: .fi
448: .SH HISTORY
449: Written by Henry Spencer at University of Toronto,
450: henry@zoo.toronto.edu.
451: .SH BUGS
452: This is an alpha release with known defects.
453: Please report problems.
454: .PP
455: There is one known functionality bug.
456: The implementation of internationalization is incomplete:
457: the locale is always assumed to be the default one of 1003.2,
458: and only the collating elements etc. of that locale are available.
459: .PP
460: The back-reference code is subtle and doubts linger about its correctness
461: in complex cases.
462: .PP
463: .I Regexec
464: performance is poor.
465: This will improve with later releases.
466: .I Nmatch
467: exceeding 0 is expensive;
468: .I nmatch
469: exceeding 1 is worse.
470: .I Regexec
471: is largely insensitive to RE complexity \fIexcept\fR that back
472: references are massively expensive.
473: RE length does matter; in particular, there is a strong speed bonus
474: for keeping RE length under about 30 characters,
475: with most special characters counting roughly double.
476: .PP
477: .I Regcomp
478: implements bounded repetitions by macro expansion,
479: which is costly in time and space if counts are large
480: or bounded repetitions are nested.
481: An RE like, say,
482: `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
483: will (eventually) run almost any existing machine out of swap space.
484: .PP
485: There are suspected problems with response to obscure error conditions.
486: Notably,
487: certain kinds of internal overflow,
488: produced only by truly enormous REs or by multiply nested bounded repetitions,
489: are probably not handled well.
490: .PP
491: Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
492: a special character only in the presence of a previous unmatched `('.
493: This can't be fixed until the spec is fixed.
494: .PP
495: The standard's definition of back references is vague.
496: For example, does
497: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
498: Until the standard is clarified,
499: behavior in such cases should not be relied on.
500: .PP
501: The implementation of word-boundary matching is a bit of a kludge,
502: and bugs may lurk in combinations of word-boundary matching and anchoring.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>