Annotation of embedaddon/php/ext/ereg/regex/regex.3, revision 1.1
1.1 ! misho 1: .TH REGEX 3 "17 May 1993"
! 2: .BY "Henry Spencer"
! 3: .de ZR
! 4: .\" one other place knows this name: the SEE ALSO section
! 5: .IR regex (7) \\$1
! 6: ..
! 7: .SH NAME
! 8: regcomp, regexec, regerror, regfree \- regular-expression library
! 9: .SH SYNOPSIS
! 10: .ft B
! 11: .\".na
! 12: #include <sys/types.h>
! 13: .br
! 14: #include <regex.h>
! 15: .HP 10
! 16: int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
! 17: .HP
! 18: int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
! 19: size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
! 20: .HP
! 21: size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
! 22: char\ *errbuf, size_t\ errbuf_size);
! 23: .HP
! 24: void\ regfree(regex_t\ *preg);
! 25: .\".ad
! 26: .ft
! 27: .SH DESCRIPTION
! 28: These routines implement POSIX 1003.2 regular expressions (``RE''s);
! 29: see
! 30: .ZR .
! 31: .I Regcomp
! 32: compiles an RE written as a string into an internal form,
! 33: .I regexec
! 34: matches that internal form against a string and reports results,
! 35: .I regerror
! 36: transforms error codes from either into human-readable messages,
! 37: and
! 38: .I regfree
! 39: frees any dynamically-allocated storage used by the internal form
! 40: of an RE.
! 41: .PP
! 42: The header
! 43: .I <regex.h>
! 44: declares two structure types,
! 45: .I regex_t
! 46: and
! 47: .IR regmatch_t ,
! 48: the former for compiled internal forms and the latter for match reporting.
! 49: It also declares the four functions,
! 50: a type
! 51: .IR regoff_t ,
! 52: and a number of constants with names starting with ``REG_''.
! 53: .PP
! 54: .I Regcomp
! 55: compiles the regular expression contained in the
! 56: .I pattern
! 57: string,
! 58: subject to the flags in
! 59: .IR cflags ,
! 60: and places the results in the
! 61: .I regex_t
! 62: structure pointed to by
! 63: .IR preg .
! 64: .I Cflags
! 65: is the bitwise OR of zero or more of the following flags:
! 66: .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
! 67: Compile modern (``extended'') REs,
! 68: rather than the obsolete (``basic'') REs that
! 69: are the default.
! 70: .IP REG_BASIC
! 71: This is a synonym for 0,
! 72: provided as a counterpart to REG_EXTENDED to improve readability.
! 73: .IP REG_NOSPEC
! 74: Compile with recognition of all special characters turned off.
! 75: All characters are thus considered ordinary,
! 76: so the ``RE'' is a literal string.
! 77: This is an extension,
! 78: compatible with but not specified by POSIX 1003.2,
! 79: and should be used with
! 80: caution in software intended to be portable to other systems.
! 81: REG_EXTENDED and REG_NOSPEC may not be used
! 82: in the same call to
! 83: .IR regcomp .
! 84: .IP REG_ICASE
! 85: Compile for matching that ignores upper/lower case distinctions.
! 86: See
! 87: .ZR .
! 88: .IP REG_NOSUB
! 89: Compile for matching that need only report success or failure,
! 90: not what was matched.
! 91: .IP REG_NEWLINE
! 92: Compile for newline-sensitive matching.
! 93: By default, newline is a completely ordinary character with no special
! 94: meaning in either REs or strings.
! 95: With this flag,
! 96: `[^' bracket expressions and `.' never match newline,
! 97: a `^' anchor matches the null string after any newline in the string
! 98: in addition to its normal function,
! 99: and the `$' anchor matches the null string before any newline in the
! 100: string in addition to its normal function.
! 101: .IP REG_PEND
! 102: The regular expression ends,
! 103: not at the first NUL,
! 104: but just before the character pointed to by the
! 105: .I re_endp
! 106: member of the structure pointed to by
! 107: .IR preg .
! 108: The
! 109: .I re_endp
! 110: member is of type
! 111: .IR const\ char\ * .
! 112: This flag permits inclusion of NULs in the RE;
! 113: they are considered ordinary characters.
! 114: This is an extension,
! 115: compatible with but not specified by POSIX 1003.2,
! 116: and should be used with
! 117: caution in software intended to be portable to other systems.
! 118: .PP
! 119: When successful,
! 120: .I regcomp
! 121: returns 0 and fills in the structure pointed to by
! 122: .IR preg .
! 123: One member of that structure
! 124: (other than
! 125: .IR re_endp )
! 126: is publicized:
! 127: .IR re_nsub ,
! 128: of type
! 129: .IR size_t ,
! 130: contains the number of parenthesized subexpressions within the RE
! 131: (except that the value of this member is undefined if the
! 132: REG_NOSUB flag was used).
! 133: If
! 134: .I regcomp
! 135: fails, it returns a non-zero error code;
! 136: see DIAGNOSTICS.
! 137: .PP
! 138: .I Regexec
! 139: matches the compiled RE pointed to by
! 140: .I preg
! 141: against the
! 142: .IR string ,
! 143: subject to the flags in
! 144: .IR eflags ,
! 145: and reports results using
! 146: .IR nmatch ,
! 147: .IR pmatch ,
! 148: and the returned value.
! 149: The RE must have been compiled by a previous invocation of
! 150: .IR regcomp .
! 151: The compiled form is not altered during execution of
! 152: .IR regexec ,
! 153: so a single compiled RE can be used simultaneously by multiple threads.
! 154: .PP
! 155: By default,
! 156: the NUL-terminated string pointed to by
! 157: .I string
! 158: is considered to be the text of an entire line, minus any terminating
! 159: newline.
! 160: The
! 161: .I eflags
! 162: argument is the bitwise OR of zero or more of the following flags:
! 163: .IP REG_NOTBOL \w'REG_STARTEND'u+2n
! 164: The first character of
! 165: the string
! 166: is not the beginning of a line, so the `^' anchor should not match before it.
! 167: This does not affect the behavior of newlines under REG_NEWLINE.
! 168: .IP REG_NOTEOL
! 169: The NUL terminating
! 170: the string
! 171: does not end a line, so the `$' anchor should not match before it.
! 172: This does not affect the behavior of newlines under REG_NEWLINE.
! 173: .IP REG_STARTEND
! 174: The string is considered to start at
! 175: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
! 176: and to have a terminating NUL located at
! 177: \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
! 178: (there need not actually be a NUL at that location),
! 179: regardless of the value of
! 180: .IR nmatch .
! 181: See below for the definition of
! 182: .IR pmatch
! 183: and
! 184: .IR nmatch .
! 185: This is an extension,
! 186: compatible with but not specified by POSIX 1003.2,
! 187: and should be used with
! 188: caution in software intended to be portable to other systems.
! 189: Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
! 190: REG_STARTEND affects only the location of the string,
! 191: not how it is matched.
! 192: .PP
! 193: See
! 194: .ZR
! 195: for a discussion of what is matched in situations where an RE or a
! 196: portion thereof could match any of several substrings of
! 197: .IR string .
! 198: .PP
! 199: Normally,
! 200: .I regexec
! 201: returns 0 for success and the non-zero code REG_NOMATCH for failure.
! 202: Other non-zero error codes may be returned in exceptional situations;
! 203: see DIAGNOSTICS.
! 204: .PP
! 205: If REG_NOSUB was specified in the compilation of the RE,
! 206: or if
! 207: .I nmatch
! 208: is 0,
! 209: .I regexec
! 210: ignores the
! 211: .I pmatch
! 212: argument (but see below for the case where REG_STARTEND is specified).
! 213: Otherwise,
! 214: .I pmatch
! 215: points to an array of
! 216: .I nmatch
! 217: structures of type
! 218: .IR regmatch_t .
! 219: Such a structure has at least the members
! 220: .I rm_so
! 221: and
! 222: .IR rm_eo ,
! 223: both of type
! 224: .I regoff_t
! 225: (a signed arithmetic type at least as large as an
! 226: .I off_t
! 227: and a
! 228: .IR ssize_t ),
! 229: containing respectively the offset of the first character of a substring
! 230: and the offset of the first character after the end of the substring.
! 231: Offsets are measured from the beginning of the
! 232: .I string
! 233: argument given to
! 234: .IR regexec .
! 235: An empty substring is denoted by equal offsets,
! 236: both indicating the character following the empty substring.
! 237: .PP
! 238: The 0th member of the
! 239: .I pmatch
! 240: array is filled in to indicate what substring of
! 241: .I string
! 242: was matched by the entire RE.
! 243: Remaining members report what substring was matched by parenthesized
! 244: subexpressions within the RE;
! 245: member
! 246: .I i
! 247: reports subexpression
! 248: .IR i ,
! 249: with subexpressions counted (starting at 1) by the order of their opening
! 250: parentheses in the RE, left to right.
! 251: Unused entries in the array\(emcorresponding either to subexpressions that
! 252: did not participate in the match at all, or to subexpressions that do not
! 253: exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
! 254: .I rm_so
! 255: and
! 256: .I rm_eo
! 257: set to \-1.
! 258: If a subexpression participated in the match several times,
! 259: the reported substring is the last one it matched.
! 260: (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
! 261: the parenthesized subexpression matches each of the three `b's and then
! 262: an infinite number of empty strings following the last `b',
! 263: so the reported substring is one of the empties.)
! 264: .PP
! 265: If REG_STARTEND is specified,
! 266: .I pmatch
! 267: must point to at least one
! 268: .I regmatch_t
! 269: (even if
! 270: .I nmatch
! 271: is 0 or REG_NOSUB was specified),
! 272: to hold the input offsets for REG_STARTEND.
! 273: Use for output is still entirely controlled by
! 274: .IR nmatch ;
! 275: if
! 276: .I nmatch
! 277: is 0 or REG_NOSUB was specified,
! 278: the value of
! 279: .IR pmatch [0]
! 280: will not be changed by a successful
! 281: .IR regexec .
! 282: .PP
! 283: .I Regerror
! 284: maps a non-zero
! 285: .I errcode
! 286: from either
! 287: .I regcomp
! 288: or
! 289: .I regexec
! 290: to a human-readable, printable message.
! 291: If
! 292: .I preg
! 293: is non-NULL,
! 294: the error code should have arisen from use of
! 295: the
! 296: .I regex_t
! 297: pointed to by
! 298: .IR preg ,
! 299: and if the error code came from
! 300: .IR regcomp ,
! 301: it should have been the result from the most recent
! 302: .I regcomp
! 303: using that
! 304: .IR regex_t .
! 305: .RI ( Regerror
! 306: may be able to supply a more detailed message using information
! 307: from the
! 308: .IR regex_t .)
! 309: .I Regerror
! 310: places the NUL-terminated message into the buffer pointed to by
! 311: .IR errbuf ,
! 312: limiting the length (including the NUL) to at most
! 313: .I errbuf_size
! 314: bytes.
! 315: If the whole message won't fit,
! 316: as much of it as will fit before the terminating NUL is supplied.
! 317: In any case,
! 318: the returned value is the size of buffer needed to hold the whole
! 319: message (including terminating NUL).
! 320: If
! 321: .I errbuf_size
! 322: is 0,
! 323: .I errbuf
! 324: is ignored but the return value is still correct.
! 325: .PP
! 326: If the
! 327: .I errcode
! 328: given to
! 329: .I regerror
! 330: is first ORed with REG_ITOA,
! 331: the ``message'' that results is the printable name of the error code,
! 332: e.g. ``REG_NOMATCH'',
! 333: rather than an explanation thereof.
! 334: If
! 335: .I errcode
! 336: is REG_ATOI,
! 337: then
! 338: .I preg
! 339: shall be non-NULL and the
! 340: .I re_endp
! 341: member of the structure it points to
! 342: must point to the printable name of an error code;
! 343: in this case, the result in
! 344: .I errbuf
! 345: is the decimal digits of
! 346: the numeric value of the error code
! 347: (0 if the name is not recognized).
! 348: REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
! 349: they are extensions,
! 350: compatible with but not specified by POSIX 1003.2,
! 351: and should be used with
! 352: caution in software intended to be portable to other systems.
! 353: Be warned also that they are considered experimental and changes are possible.
! 354: .PP
! 355: .I Regfree
! 356: frees any dynamically-allocated storage associated with the compiled RE
! 357: pointed to by
! 358: .IR preg .
! 359: The remaining
! 360: .I regex_t
! 361: is no longer a valid compiled RE
! 362: and the effect of supplying it to
! 363: .I regexec
! 364: or
! 365: .I regerror
! 366: is undefined.
! 367: .PP
! 368: None of these functions references global variables except for tables
! 369: of constants;
! 370: all are safe for use from multiple threads if the arguments are safe.
! 371: .SH IMPLEMENTATION CHOICES
! 372: There are a number of decisions that 1003.2 leaves up to the implementor,
! 373: either by explicitly saying ``undefined'' or by virtue of them being
! 374: forbidden by the RE grammar.
! 375: This implementation treats them as follows.
! 376: .PP
! 377: See
! 378: .ZR
! 379: for a discussion of the definition of case-independent matching.
! 380: .PP
! 381: There is no particular limit on the length of REs,
! 382: except insofar as memory is limited.
! 383: Memory usage is approximately linear in RE size, and largely insensitive
! 384: to RE complexity, except for bounded repetitions.
! 385: See BUGS for one short RE using them
! 386: that will run almost any system out of memory.
! 387: .PP
! 388: A backslashed character other than one specifically given a magic meaning
! 389: by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
! 390: is taken as an ordinary character.
! 391: .PP
! 392: Any unmatched [ is a REG_EBRACK error.
! 393: .PP
! 394: Equivalence classes cannot begin or end bracket-expression ranges.
! 395: The endpoint of one range cannot begin another.
! 396: .PP
! 397: RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
! 398: .PP
! 399: A repetition operator (?, *, +, or bounds) cannot follow another
! 400: repetition operator.
! 401: A repetition operator cannot begin an expression or subexpression
! 402: or follow `^' or `|'.
! 403: .PP
! 404: `|' cannot appear first or last in a (sub)expression or after another `|',
! 405: i.e. an operand of `|' cannot be an empty subexpression.
! 406: An empty parenthesized subexpression, `()', is legal and matches an
! 407: empty (sub)string.
! 408: An empty string is not a legal RE.
! 409: .PP
! 410: A `{' followed by a digit is considered the beginning of bounds for a
! 411: bounded repetition, which must then follow the syntax for bounds.
! 412: A `{' \fInot\fR followed by a digit is considered an ordinary character.
! 413: .PP
! 414: `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
! 415: REs are anchors, not ordinary characters.
! 416: .SH SEE ALSO
! 417: grep(1), regex(7)
! 418: .PP
! 419: POSIX 1003.2, sections 2.8 (Regular Expression Notation)
! 420: and
! 421: B.5 (C Binding for Regular Expression Matching).
! 422: .SH DIAGNOSTICS
! 423: Non-zero error codes from
! 424: .I regcomp
! 425: and
! 426: .I regexec
! 427: include the following:
! 428: .PP
! 429: .nf
! 430: .ta \w'REG_ECOLLATE'u+3n
! 431: REG_NOMATCH regexec() failed to match
! 432: REG_BADPAT invalid regular expression
! 433: REG_ECOLLATE invalid collating element
! 434: REG_ECTYPE invalid character class
! 435: REG_EESCAPE \e applied to unescapable character
! 436: REG_ESUBREG invalid backreference number
! 437: REG_EBRACK brackets [ ] not balanced
! 438: REG_EPAREN parentheses ( ) not balanced
! 439: REG_EBRACE braces { } not balanced
! 440: REG_BADBR invalid repetition count(s) in { }
! 441: REG_ERANGE invalid character range in [ ]
! 442: REG_ESPACE ran out of memory
! 443: REG_BADRPT ?, *, or + operand invalid
! 444: REG_EMPTY empty (sub)expression
! 445: REG_ASSERT ``can't happen''\(emyou found a bug
! 446: REG_INVARG invalid argument, e.g. negative-length string
! 447: .fi
! 448: .SH HISTORY
! 449: Written by Henry Spencer at University of Toronto,
! 450: henry@zoo.toronto.edu.
! 451: .SH BUGS
! 452: This is an alpha release with known defects.
! 453: Please report problems.
! 454: .PP
! 455: There is one known functionality bug.
! 456: The implementation of internationalization is incomplete:
! 457: the locale is always assumed to be the default one of 1003.2,
! 458: and only the collating elements etc. of that locale are available.
! 459: .PP
! 460: The back-reference code is subtle and doubts linger about its correctness
! 461: in complex cases.
! 462: .PP
! 463: .I Regexec
! 464: performance is poor.
! 465: This will improve with later releases.
! 466: .I Nmatch
! 467: exceeding 0 is expensive;
! 468: .I nmatch
! 469: exceeding 1 is worse.
! 470: .I Regexec
! 471: is largely insensitive to RE complexity \fIexcept\fR that back
! 472: references are massively expensive.
! 473: RE length does matter; in particular, there is a strong speed bonus
! 474: for keeping RE length under about 30 characters,
! 475: with most special characters counting roughly double.
! 476: .PP
! 477: .I Regcomp
! 478: implements bounded repetitions by macro expansion,
! 479: which is costly in time and space if counts are large
! 480: or bounded repetitions are nested.
! 481: An RE like, say,
! 482: `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
! 483: will (eventually) run almost any existing machine out of swap space.
! 484: .PP
! 485: There are suspected problems with response to obscure error conditions.
! 486: Notably,
! 487: certain kinds of internal overflow,
! 488: produced only by truly enormous REs or by multiply nested bounded repetitions,
! 489: are probably not handled well.
! 490: .PP
! 491: Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
! 492: a special character only in the presence of a previous unmatched `('.
! 493: This can't be fixed until the spec is fixed.
! 494: .PP
! 495: The standard's definition of back references is vague.
! 496: For example, does
! 497: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
! 498: Until the standard is clarified,
! 499: behavior in such cases should not be relied on.
! 500: .PP
! 501: The implementation of word-boundary matching is a bit of a kludge,
! 502: and bugs may lurk in combinations of word-boundary matching and anchoring.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>