Annotation of embedaddon/php/ext/ereg/regex/regex.7, revision 1.1
1.1 ! misho 1: .TH REGEX 7 "7 Feb 1994"
! 2: .BY "Henry Spencer"
! 3: .SH NAME
! 4: regex \- POSIX 1003.2 regular expressions
! 5: .SH DESCRIPTION
! 6: Regular expressions (``RE''s),
! 7: as defined in POSIX 1003.2, come in two forms:
! 8: modern REs (roughly those of
! 9: .IR egrep ;
! 10: 1003.2 calls these ``extended'' REs)
! 11: and obsolete REs (roughly those of
! 12: .IR ed ;
! 13: 1003.2 ``basic'' REs).
! 14: Obsolete REs mostly exist for backward compatibility in some old programs;
! 15: they will be discussed at the end.
! 16: 1003.2 leaves some aspects of RE syntax and semantics open;
! 17: `\(dg' marks decisions on these aspects that
! 18: may not be fully portable to other 1003.2 implementations.
! 19: .PP
! 20: A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
! 21: separated by `|'.
! 22: It matches anything that matches one of the branches.
! 23: .PP
! 24: A branch is one\(dg or more \fIpieces\fR, concatenated.
! 25: It matches a match for the first, followed by a match for the second, etc.
! 26: .PP
! 27: A piece is an \fIatom\fR possibly followed
! 28: by a single\(dg `*', `+', `?', or \fIbound\fR.
! 29: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
! 30: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
! 31: An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
! 32: .PP
! 33: A \fIbound\fR is `{' followed by an unsigned decimal integer,
! 34: possibly followed by `,'
! 35: possibly followed by another unsigned decimal integer,
! 36: always followed by `}'.
! 37: The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
! 38: and if there are two of them, the first may not exceed the second.
! 39: An atom followed by a bound containing one integer \fIi\fR
! 40: and no comma matches
! 41: a sequence of exactly \fIi\fR matches of the atom.
! 42: An atom followed by a bound
! 43: containing one integer \fIi\fR and a comma matches
! 44: a sequence of \fIi\fR or more matches of the atom.
! 45: An atom followed by a bound
! 46: containing two integers \fIi\fR and \fIj\fR matches
! 47: a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
! 48: .PP
! 49: An atom is a regular expression enclosed in `()' (matching a match for the
! 50: regular expression),
! 51: an empty set of `()' (matching the null string)\(dg,
! 52: a \fIbracket expression\fR (see below), `.'
! 53: (matching any single character), `^' (matching the null string at the
! 54: beginning of a line), `$' (matching the null string at the
! 55: end of a line), a `\e' followed by one of the characters
! 56: `^.[$()|*+?{\e'
! 57: (matching that character taken as an ordinary character),
! 58: a `\e' followed by any other character\(dg
! 59: (matching that character taken as an ordinary character,
! 60: as if the `\e' had not been present\(dg),
! 61: or a single character with no other significance (matching that character).
! 62: A `{' followed by a character other than a digit is an ordinary
! 63: character, not the beginning of a bound\(dg.
! 64: It is illegal to end an RE with `\e'.
! 65: .PP
! 66: A \fIbracket expression\fR is a list of characters enclosed in `[]'.
! 67: It normally matches any single character from the list (but see below).
! 68: If the list begins with `^',
! 69: it matches any single character
! 70: (but see below) \fInot\fR from the rest of the list.
! 71: If two characters in the list are separated by `\-', this is shorthand
! 72: for the full \fIrange\fR of characters between those two (inclusive) in the
! 73: collating sequence,
! 74: e.g. `[0-9]' in ASCII matches any decimal digit.
! 75: It is illegal\(dg for two ranges to share an
! 76: endpoint, e.g. `a-c-e'.
! 77: Ranges are very collating-sequence-dependent,
! 78: and portable programs should avoid relying on them.
! 79: .PP
! 80: To include a literal `]' in the list, make it the first character
! 81: (following a possible `^').
! 82: To include a literal `\-', make it the first or last character,
! 83: or the second endpoint of a range.
! 84: To use a literal `\-' as the first endpoint of a range,
! 85: enclose it in `[.' and `.]' to make it a collating element (see below).
! 86: With the exception of these and some combinations using `[' (see next
! 87: paragraphs), all other special characters, including `\e', lose their
! 88: special significance within a bracket expression.
! 89: .PP
! 90: Within a bracket expression, a collating element (a character,
! 91: a multi-character sequence that collates as if it were a single character,
! 92: or a collating-sequence name for either)
! 93: enclosed in `[.' and `.]' stands for the
! 94: sequence of characters of that collating element.
! 95: The sequence is a single element of the bracket expression's list.
! 96: A bracket expression containing a multi-character collating element
! 97: can thus match more than one character,
! 98: e.g. if the collating sequence includes a `ch' collating element,
! 99: then the RE `[[.ch.]]*c' matches the first five characters
! 100: of `chchcc'.
! 101: .PP
! 102: Within a bracket expression, a collating element enclosed in `[=' and
! 103: `=]' is an equivalence class, standing for the sequences of characters
! 104: of all collating elements equivalent to that one, including itself.
! 105: (If there are no other equivalent collating elements,
! 106: the treatment is as if the enclosing delimiters were `[.' and `.]'.)
! 107: For example, if o and \o'o^' are the members of an equivalence class,
! 108: then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
! 109: An equivalence class may not\(dg be an endpoint
! 110: of a range.
! 111: .PP
! 112: Within a bracket expression, the name of a \fIcharacter class\fR enclosed
! 113: in `[:' and `:]' stands for the list of all characters belonging to that
! 114: class.
! 115: Standard character class names are:
! 116: .PP
! 117: .RS
! 118: .nf
! 119: .ta 3c 6c 9c
! 120: alnum digit punct
! 121: alpha graph space
! 122: blank lower upper
! 123: cntrl print xdigit
! 124: .fi
! 125: .RE
! 126: .PP
! 127: These stand for the character classes defined in
! 128: .IR ctype (3).
! 129: A locale may provide others.
! 130: A character class may not be used as an endpoint of a range.
! 131: .PP
! 132: There are two special cases\(dg of bracket expressions:
! 133: the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
! 134: the beginning and end of a word respectively.
! 135: A word is defined as a sequence of
! 136: word characters
! 137: which is neither preceded nor followed by
! 138: word characters.
! 139: A word character is an
! 140: .I alnum
! 141: character (as defined by
! 142: .IR ctype (3))
! 143: or an underscore.
! 144: This is an extension,
! 145: compatible with but not specified by POSIX 1003.2,
! 146: and should be used with
! 147: caution in software intended to be portable to other systems.
! 148: .PP
! 149: In the event that an RE could match more than one substring of a given
! 150: string,
! 151: the RE matches the one starting earliest in the string.
! 152: If the RE could match more than one substring starting at that point,
! 153: it matches the longest.
! 154: Subexpressions also match the longest possible substrings, subject to
! 155: the constraint that the whole match be as long as possible,
! 156: with subexpressions starting earlier in the RE taking priority over
! 157: ones starting later.
! 158: Note that higher-level subexpressions thus take priority over
! 159: their lower-level component subexpressions.
! 160: .PP
! 161: Match lengths are measured in characters, not collating elements.
! 162: A null string is considered longer than no match at all.
! 163: For example,
! 164: `bb*' matches the three middle characters of `abbbc',
! 165: `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
! 166: when `(.*).*' is matched against `abc' the parenthesized subexpression
! 167: matches all three characters, and
! 168: when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
! 169: subexpression match the null string.
! 170: .PP
! 171: If case-independent matching is specified,
! 172: the effect is much as if all case distinctions had vanished from the
! 173: alphabet.
! 174: When an alphabetic that exists in multiple cases appears as an
! 175: ordinary character outside a bracket expression, it is effectively
! 176: transformed into a bracket expression containing both cases,
! 177: e.g. `x' becomes `[xX]'.
! 178: When it appears inside a bracket expression, all case counterparts
! 179: of it are added to the bracket expression, so that (e.g.) `[x]'
! 180: becomes `[xX]' and `[^x]' becomes `[^xX]'.
! 181: .PP
! 182: No particular limit is imposed on the length of REs\(dg.
! 183: Programs intended to be portable should not employ REs longer
! 184: than 256 bytes,
! 185: as an implementation can refuse to accept such REs and remain
! 186: POSIX-compliant.
! 187: .PP
! 188: Obsolete (``basic'') regular expressions differ in several respects.
! 189: `|', `+', and `?' are ordinary characters and there is no equivalent
! 190: for their functionality.
! 191: The delimiters for bounds are `\e{' and `\e}',
! 192: with `{' and `}' by themselves ordinary characters.
! 193: The parentheses for nested subexpressions are `\e(' and `\e)',
! 194: with `(' and `)' by themselves ordinary characters.
! 195: `^' is an ordinary character except at the beginning of the
! 196: RE or\(dg the beginning of a parenthesized subexpression,
! 197: `$' is an ordinary character except at the end of the
! 198: RE or\(dg the end of a parenthesized subexpression,
! 199: and `*' is an ordinary character if it appears at the beginning of the
! 200: RE or the beginning of a parenthesized subexpression
! 201: (after a possible leading `^').
! 202: Finally, there is one new type of atom, a \fIback reference\fR:
! 203: `\e' followed by a non-zero decimal digit \fId\fR
! 204: matches the same sequence of characters
! 205: matched by the \fId\fRth parenthesized subexpression
! 206: (numbering subexpressions by the positions of their opening parentheses,
! 207: left to right),
! 208: so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
! 209: .SH SEE ALSO
! 210: regex(3)
! 211: .PP
! 212: POSIX 1003.2, section 2.8 (Regular Expression Notation).
! 213: .SH BUGS
! 214: Having two kinds of REs is a botch.
! 215: .PP
! 216: The current 1003.2 spec says that `)' is an ordinary character in
! 217: the absence of an unmatched `(';
! 218: this was an unintentional result of a wording error,
! 219: and change is likely.
! 220: Avoid relying on it.
! 221: .PP
! 222: Back references are a dreadful botch,
! 223: posing major problems for efficient implementations.
! 224: They are also somewhat vaguely defined
! 225: (does
! 226: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
! 227: Avoid using them.
! 228: .PP
! 229: 1003.2's specification of case-independent matching is vague.
! 230: The ``one case implies all cases'' definition given above
! 231: is current consensus among implementors as to the right interpretation.
! 232: .PP
! 233: The syntax for word boundaries is incredibly ugly.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>