Annotation of embedaddon/php/ext/ereg/regex/regex.7, revision 1.1

1.1     ! misho       1: .TH REGEX 7 "7 Feb 1994"
        !             2: .BY "Henry Spencer"
        !             3: .SH NAME
        !             4: regex \- POSIX 1003.2 regular expressions
        !             5: .SH DESCRIPTION
        !             6: Regular expressions (``RE''s),
        !             7: as defined in POSIX 1003.2, come in two forms:
        !             8: modern REs (roughly those of
        !             9: .IR egrep ;
        !            10: 1003.2 calls these ``extended'' REs)
        !            11: and obsolete REs (roughly those of
        !            12: .IR ed ;
        !            13: 1003.2 ``basic'' REs).
        !            14: Obsolete REs mostly exist for backward compatibility in some old programs;
        !            15: they will be discussed at the end.
        !            16: 1003.2 leaves some aspects of RE syntax and semantics open;
        !            17: `\(dg' marks decisions on these aspects that
        !            18: may not be fully portable to other 1003.2 implementations.
        !            19: .PP
        !            20: A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
        !            21: separated by `|'.
        !            22: It matches anything that matches one of the branches.
        !            23: .PP
        !            24: A branch is one\(dg or more \fIpieces\fR, concatenated.
        !            25: It matches a match for the first, followed by a match for the second, etc.
        !            26: .PP
        !            27: A piece is an \fIatom\fR possibly followed
        !            28: by a single\(dg `*', `+', `?', or \fIbound\fR.
        !            29: An atom followed by `*' matches a sequence of 0 or more matches of the atom.
        !            30: An atom followed by `+' matches a sequence of 1 or more matches of the atom.
        !            31: An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
        !            32: .PP
        !            33: A \fIbound\fR is `{' followed by an unsigned decimal integer,
        !            34: possibly followed by `,'
        !            35: possibly followed by another unsigned decimal integer,
        !            36: always followed by `}'.
        !            37: The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
        !            38: and if there are two of them, the first may not exceed the second.
        !            39: An atom followed by a bound containing one integer \fIi\fR
        !            40: and no comma matches
        !            41: a sequence of exactly \fIi\fR matches of the atom.
        !            42: An atom followed by a bound
        !            43: containing one integer \fIi\fR and a comma matches
        !            44: a sequence of \fIi\fR or more matches of the atom.
        !            45: An atom followed by a bound
        !            46: containing two integers \fIi\fR and \fIj\fR matches
        !            47: a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
        !            48: .PP
        !            49: An atom is a regular expression enclosed in `()' (matching a match for the
        !            50: regular expression),
        !            51: an empty set of `()' (matching the null string)\(dg,
        !            52: a \fIbracket expression\fR (see below), `.'
        !            53: (matching any single character), `^' (matching the null string at the
        !            54: beginning of a line), `$' (matching the null string at the
        !            55: end of a line), a `\e' followed by one of the characters
        !            56: `^.[$()|*+?{\e'
        !            57: (matching that character taken as an ordinary character),
        !            58: a `\e' followed by any other character\(dg
        !            59: (matching that character taken as an ordinary character,
        !            60: as if the `\e' had not been present\(dg),
        !            61: or a single character with no other significance (matching that character).
        !            62: A `{' followed by a character other than a digit is an ordinary
        !            63: character, not the beginning of a bound\(dg.
        !            64: It is illegal to end an RE with `\e'.
        !            65: .PP
        !            66: A \fIbracket expression\fR is a list of characters enclosed in `[]'.
        !            67: It normally matches any single character from the list (but see below).
        !            68: If the list begins with `^',
        !            69: it matches any single character
        !            70: (but see below) \fInot\fR from the rest of the list.
        !            71: If two characters in the list are separated by `\-', this is shorthand
        !            72: for the full \fIrange\fR of characters between those two (inclusive) in the
        !            73: collating sequence,
        !            74: e.g. `[0-9]' in ASCII matches any decimal digit.
        !            75: It is illegal\(dg for two ranges to share an
        !            76: endpoint, e.g. `a-c-e'.
        !            77: Ranges are very collating-sequence-dependent,
        !            78: and portable programs should avoid relying on them.
        !            79: .PP
        !            80: To include a literal `]' in the list, make it the first character
        !            81: (following a possible `^').
        !            82: To include a literal `\-', make it the first or last character,
        !            83: or the second endpoint of a range.
        !            84: To use a literal `\-' as the first endpoint of a range,
        !            85: enclose it in `[.' and `.]' to make it a collating element (see below).
        !            86: With the exception of these and some combinations using `[' (see next
        !            87: paragraphs), all other special characters, including `\e', lose their
        !            88: special significance within a bracket expression.
        !            89: .PP
        !            90: Within a bracket expression, a collating element (a character,
        !            91: a multi-character sequence that collates as if it were a single character,
        !            92: or a collating-sequence name for either)
        !            93: enclosed in `[.' and `.]' stands for the
        !            94: sequence of characters of that collating element.
        !            95: The sequence is a single element of the bracket expression's list.
        !            96: A bracket expression containing a multi-character collating element 
        !            97: can thus match more than one character,
        !            98: e.g. if the collating sequence includes a `ch' collating element,
        !            99: then the RE `[[.ch.]]*c' matches the first five characters
        !           100: of `chchcc'.
        !           101: .PP
        !           102: Within a bracket expression, a collating element enclosed in `[=' and
        !           103: `=]' is an equivalence class, standing for the sequences of characters
        !           104: of all collating elements equivalent to that one, including itself.
        !           105: (If there are no other equivalent collating elements,
        !           106: the treatment is as if the enclosing delimiters were `[.' and `.]'.)
        !           107: For example, if o and \o'o^' are the members of an equivalence class,
        !           108: then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
        !           109: An equivalence class may not\(dg be an endpoint
        !           110: of a range.
        !           111: .PP
        !           112: Within a bracket expression, the name of a \fIcharacter class\fR enclosed
        !           113: in `[:' and `:]' stands for the list of all characters belonging to that
        !           114: class.
        !           115: Standard character class names are:
        !           116: .PP
        !           117: .RS
        !           118: .nf
        !           119: .ta 3c 6c 9c
        !           120: alnum  digit   punct
        !           121: alpha  graph   space
        !           122: blank  lower   upper
        !           123: cntrl  print   xdigit
        !           124: .fi
        !           125: .RE
        !           126: .PP
        !           127: These stand for the character classes defined in
        !           128: .IR ctype (3).
        !           129: A locale may provide others.
        !           130: A character class may not be used as an endpoint of a range.
        !           131: .PP
        !           132: There are two special cases\(dg of bracket expressions:
        !           133: the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
        !           134: the beginning and end of a word respectively.
        !           135: A word is defined as a sequence of
        !           136: word characters
        !           137: which is neither preceded nor followed by
        !           138: word characters.
        !           139: A word character is an
        !           140: .I alnum
        !           141: character (as defined by
        !           142: .IR ctype (3))
        !           143: or an underscore.
        !           144: This is an extension,
        !           145: compatible with but not specified by POSIX 1003.2,
        !           146: and should be used with
        !           147: caution in software intended to be portable to other systems.
        !           148: .PP
        !           149: In the event that an RE could match more than one substring of a given
        !           150: string,
        !           151: the RE matches the one starting earliest in the string.
        !           152: If the RE could match more than one substring starting at that point,
        !           153: it matches the longest.
        !           154: Subexpressions also match the longest possible substrings, subject to
        !           155: the constraint that the whole match be as long as possible,
        !           156: with subexpressions starting earlier in the RE taking priority over
        !           157: ones starting later.
        !           158: Note that higher-level subexpressions thus take priority over
        !           159: their lower-level component subexpressions.
        !           160: .PP
        !           161: Match lengths are measured in characters, not collating elements.
        !           162: A null string is considered longer than no match at all.
        !           163: For example,
        !           164: `bb*' matches the three middle characters of `abbbc',
        !           165: `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
        !           166: when `(.*).*' is matched against `abc' the parenthesized subexpression
        !           167: matches all three characters, and
        !           168: when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
        !           169: subexpression match the null string.
        !           170: .PP
        !           171: If case-independent matching is specified,
        !           172: the effect is much as if all case distinctions had vanished from the
        !           173: alphabet.
        !           174: When an alphabetic that exists in multiple cases appears as an
        !           175: ordinary character outside a bracket expression, it is effectively
        !           176: transformed into a bracket expression containing both cases,
        !           177: e.g. `x' becomes `[xX]'.
        !           178: When it appears inside a bracket expression, all case counterparts
        !           179: of it are added to the bracket expression, so that (e.g.) `[x]'
        !           180: becomes `[xX]' and `[^x]' becomes `[^xX]'.
        !           181: .PP
        !           182: No particular limit is imposed on the length of REs\(dg.
        !           183: Programs intended to be portable should not employ REs longer
        !           184: than 256 bytes,
        !           185: as an implementation can refuse to accept such REs and remain
        !           186: POSIX-compliant.
        !           187: .PP
        !           188: Obsolete (``basic'') regular expressions differ in several respects.
        !           189: `|', `+', and `?' are ordinary characters and there is no equivalent
        !           190: for their functionality.
        !           191: The delimiters for bounds are `\e{' and `\e}',
        !           192: with `{' and `}' by themselves ordinary characters.
        !           193: The parentheses for nested subexpressions are `\e(' and `\e)',
        !           194: with `(' and `)' by themselves ordinary characters.
        !           195: `^' is an ordinary character except at the beginning of the
        !           196: RE or\(dg the beginning of a parenthesized subexpression,
        !           197: `$' is an ordinary character except at the end of the
        !           198: RE or\(dg the end of a parenthesized subexpression,
        !           199: and `*' is an ordinary character if it appears at the beginning of the
        !           200: RE or the beginning of a parenthesized subexpression
        !           201: (after a possible leading `^').
        !           202: Finally, there is one new type of atom, a \fIback reference\fR:
        !           203: `\e' followed by a non-zero decimal digit \fId\fR
        !           204: matches the same sequence of characters
        !           205: matched by the \fId\fRth parenthesized subexpression
        !           206: (numbering subexpressions by the positions of their opening parentheses,
        !           207: left to right),
        !           208: so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
        !           209: .SH SEE ALSO
        !           210: regex(3)
        !           211: .PP
        !           212: POSIX 1003.2, section 2.8 (Regular Expression Notation).
        !           213: .SH BUGS
        !           214: Having two kinds of REs is a botch.
        !           215: .PP
        !           216: The current 1003.2 spec says that `)' is an ordinary character in
        !           217: the absence of an unmatched `(';
        !           218: this was an unintentional result of a wording error,
        !           219: and change is likely.
        !           220: Avoid relying on it.
        !           221: .PP
        !           222: Back references are a dreadful botch,
        !           223: posing major problems for efficient implementations.
        !           224: They are also somewhat vaguely defined
        !           225: (does
        !           226: `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
        !           227: Avoid using them.
        !           228: .PP
        !           229: 1003.2's specification of case-independent matching is vague.
        !           230: The ``one case implies all cases'' definition given above
        !           231: is current consensus among implementors as to the right interpretation.
        !           232: .PP
        !           233: The syntax for word boundaries is incredibly ugly.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>