embedaddon/pcre/doc/pcre.txt - annotate

Return to pcre.txt CVS log
Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / pcre / doc
Annotation of embedaddon/pcre/doc/pcre.txt, revision 1.1.1.4

1.1       misho       1: -----------------------------------------------------------------------------
                      2: This file contains a concatenation of the PCRE man pages, converted to plain
                      3: text format for ease of searching with a text editor, or for use on systems
                      4: that do not have a man page processor. The small individual files that give
                      5: synopses of each function in the library have not been included. Neither has
                      6: the pcredemo program. There are separate text files for the pcregrep and
                      7: pcretest commands.
                      8: -----------------------------------------------------------------------------
                      9: 
                     10: 
1.1.1.4 ! misho      11: PCRE(3)                    Library Functions Manual                    PCRE(3)
        !            12: 
1.1       misho      13: 
                     14: 
                     15: NAME
                     16:        PCRE - Perl-compatible regular expressions
                     17: 
                     18: INTRODUCTION
                     19: 
                     20:        The  PCRE  library is a set of functions that implement regular expres-
                     21:        sion pattern matching using the same syntax and semantics as Perl, with
                     22:        just  a few differences. Some features that appeared in Python and PCRE
                     23:        before they appeared in Perl are also available using the  Python  syn-
                     24:        tax,  there  is  some  support for one or two .NET and Oniguruma syntax
                     25:        items, and there is an option for requesting some  minor  changes  that
                     26:        give better JavaScript compatibility.
                     27: 
1.1.1.2   misho      28:        Starting with release 8.30, it is possible to compile two separate PCRE
                     29:        libraries:  the  original,  which  supports  8-bit  character   strings
                     30:        (including  UTF-8  strings),  and a second library that supports 16-bit
                     31:        character strings (including UTF-16 strings). The build process  allows
                     32:        either  one  or both to be built. The majority of the work to make this
                     33:        possible was done by Zoltan Herczeg.
                     34: 
1.1.1.4 ! misho      35:        Starting with release 8.32 it is possible to compile a  third  separate
        !            36:        PCRE  library  that supports 32-bit character strings (including UTF-32
        !            37:        strings). The build process allows any combination of the 8-,  16-  and
        !            38:        32-bit  libraries. The work to make this possible was done by Christian
        !            39:        Persch.
        !            40: 
        !            41:        The three libraries contain identical sets of  functions,  except  that
        !            42:        the  names  in  the 16-bit library start with pcre16_ instead of pcre_,
        !            43:        and the names in the 32-bit  library  start  with  pcre32_  instead  of
        !            44:        pcre_.  To avoid over-complication and reduce the documentation mainte-
        !            45:        nance load, most of the documentation describes the 8-bit library, with
        !            46:        the  differences  for  the  16-bit and 32-bit libraries described sepa-
        !            47:        rately in the pcre16 and  pcre32  pages.  References  to  functions  or
        !            48:        structures  of  the  form  pcre[16|32]_xxx  should  be  read as meaning
        !            49:        "pcre_xxx when using the  8-bit  library,  pcre16_xxx  when  using  the
        !            50:        16-bit library, or pcre32_xxx when using the 32-bit library".
1.1.1.2   misho      51: 
1.1       misho      52:        The  current implementation of PCRE corresponds approximately with Perl
1.1.1.4 ! misho      53:        5.12, including support for UTF-8/16/32  encoded  strings  and  Unicode
        !            54:        general  category  properties. However, UTF-8/16/32 and Unicode support
        !            55:        has to be explicitly enabled; it is not the default. The Unicode tables
        !            56:        correspond to Unicode release 6.2.0.
1.1       misho      57: 
                     58:        In  addition to the Perl-compatible matching function, PCRE contains an
                     59:        alternative function that matches the same compiled patterns in a  dif-
                     60:        ferent way. In certain circumstances, the alternative function has some
                     61:        advantages.  For a discussion of the two matching algorithms,  see  the
                     62:        pcrematching page.
                     63: 
                     64:        PCRE  is  written  in C and released as a C library. A number of people
                     65:        have written wrappers and interfaces of various kinds.  In  particular,
1.1.1.2   misho      66:        Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
                     67:        library. This is now included as part of  the  PCRE  distribution.  The
                     68:        pcrecpp  page  has  details of this interface. Other people's contribu-
                     69:        tions can be found in the Contrib directory at the  primary  FTP  site,
                     70:        which is:
1.1       misho      71: 
                     72:        ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
                     73: 
1.1.1.2   misho      74:        Details  of  exactly which Perl regular expression features are and are
1.1       misho      75:        not supported by PCRE are given in separate documents. See the pcrepat-
1.1.1.2   misho      76:        tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
1.1       misho      77:        page.
                     78: 
1.1.1.2   misho      79:        Some features of PCRE can be included, excluded, or  changed  when  the
                     80:        library  is  built.  The pcre_config() function makes it possible for a
                     81:        client to discover which features are  available.  The  features  them-
                     82:        selves  are described in the pcrebuild page. Documentation about build-
                     83:        ing PCRE for various operating systems can be found in the  README  and
1.1.1.4 ! misho      84:        NON-AUTOTOOLS_BUILD files in the source distribution.
1.1       misho      85: 
1.1.1.2   misho      86:        The  libraries contains a number of undocumented internal functions and
                     87:        data tables that are used by more than one  of  the  exported  external
                     88:        functions,  but  which  are  not  intended for use by external callers.
1.1.1.4 ! misho      89:        Their names all begin with "_pcre_" or "_pcre16_" or "_pcre32_",  which
        !            90:        hopefully  will  not provoke any name clashes. In some environments, it
        !            91:        is possible to control which  external  symbols  are  exported  when  a
        !            92:        shared  library  is  built, and in these cases the undocumented symbols
        !            93:        are not exported.
        !            94: 
        !            95: 
        !            96: SECURITY CONSIDERATIONS
        !            97: 
        !            98:        If you are using PCRE in a non-UTF application that  permits  users  to
        !            99:        supply  arbitrary  patterns  for  compilation, you should be aware of a
        !           100:        feature that allows users to turn on UTF support from within a pattern,
        !           101:        provided  that  PCRE  was built with UTF support. For example, an 8-bit
        !           102:        pattern that begins with "(*UTF8)" or "(*UTF)"  turns  on  UTF-8  mode,
        !           103:        which  interprets  patterns and subjects as strings of UTF-8 characters
        !           104:        instead of individual 8-bit characters.  This causes both  the  pattern
        !           105:        and any data against which it is matched to be checked for UTF-8 valid-
        !           106:        ity. If the data string is very long, such a  check  might  use  suffi-
        !           107:        ciently  many  resources  as  to cause your application to lose perfor-
        !           108:        mance.
        !           109: 
        !           110:        One  way  of  guarding  against  this  possibility  is   to   use   the
        !           111:        pcre_fullinfo()  function  to  check the compiled pattern's options for
        !           112:        UTF.  Alternatively, from release 8.33, you can set the  PCRE_NEVER_UTF
        !           113:        option  at compile time. This causes an compile time error if a pattern
        !           114:        contains a UTF-setting sequence.
        !           115: 
        !           116:        If your application is one that supports UTF, be  aware  that  validity
        !           117:        checking  can  take time. If the same data string is to be matched many
        !           118:        times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
        !           119:        and subsequent matches to save redundant checks.
        !           120: 
        !           121:        Another  way  that  performance can be hit is by running a pattern that
        !           122:        has a very large search tree against a string that  will  never  match.
        !           123:        Nested  unlimited  repeats in a pattern are a common example. PCRE pro-
        !           124:        vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
        !           125:        ture in the pcreapi page.
1.1       misho     126: 
                    127: 
                    128: USER DOCUMENTATION
                    129: 
1.1.1.2   misho     130:        The  user  documentation  for PCRE comprises a number of different sec-
                    131:        tions. In the "man" format, each of these is a separate "man page".  In
                    132:        the  HTML  format, each is a separate page, linked from the index page.
                    133:        In the plain text format, all the sections, except  the  pcredemo  sec-
1.1       misho     134:        tion, are concatenated, for ease of searching. The sections are as fol-
                    135:        lows:
                    136: 
                    137:          pcre              this document
                    138:          pcre-config       show PCRE installation configuration information
1.1.1.4 ! misho     139:          pcre16            details of the 16-bit library
        !           140:          pcre32            details of the 32-bit library
1.1       misho     141:          pcreapi           details of PCRE's native C API
1.1.1.4 ! misho     142:          pcrebuild         building PCRE
1.1       misho     143:          pcrecallout       details of the callout feature
                    144:          pcrecompat        discussion of Perl compatibility
1.1.1.2   misho     145:          pcrecpp           details of the C++ wrapper for the 8-bit library
1.1       misho     146:          pcredemo          a demonstration C program that uses PCRE
1.1.1.2   misho     147:          pcregrep          description of the pcregrep command (8-bit only)
1.1       misho     148:          pcrejit           discussion of the just-in-time optimization support
                    149:          pcrelimits        details of size and other limits
                    150:          pcrematching      discussion of the two matching algorithms
                    151:          pcrepartial       details of the partial matching facility
                    152:          pcrepattern       syntax and semantics of supported
                    153:                              regular expressions
                    154:          pcreperform       discussion of performance issues
1.1.1.2   misho     155:          pcreposix         the POSIX-compatible C API for the 8-bit library
1.1       misho     156:          pcreprecompile    details of saving and re-using precompiled patterns
                    157:          pcresample        discussion of the pcredemo program
                    158:          pcrestack         discussion of stack usage
                    159:          pcresyntax        quick syntax reference
                    160:          pcretest          description of the pcretest testing command
1.1.1.4 ! misho     161:          pcreunicode       discussion of Unicode and UTF-8/16/32 support
1.1       misho     162: 
1.1.1.2   misho     163:        In addition, in the "man" and HTML formats, there is a short  page  for
1.1.1.4 ! misho     164:        each C library function, listing its arguments and results.
1.1       misho     165: 
                    166: 
                    167: AUTHOR
                    168: 
                    169:        Philip Hazel
                    170:        University Computing Service
                    171:        Cambridge CB2 3QH, England.
                    172: 
1.1.1.2   misho     173:        Putting  an actual email address here seems to have been a spam magnet,
                    174:        so I've taken it away. If you want to email me, use  my  two  initials,
1.1       misho     175:        followed by the two digits 10, at the domain cam.ac.uk.
                    176: 
                    177: 
                    178: REVISION
                    179: 
1.1.1.4 ! misho     180:        Last updated: 13 May 2013
        !           181:        Copyright (c) 1997-2013 University of Cambridge.
1.1.1.2   misho     182: ------------------------------------------------------------------------------
                    183: 
                    184: 
1.1.1.4 ! misho     185: PCRE(3)                    Library Functions Manual                    PCRE(3)
        !           186: 
1.1.1.2   misho     187: 
                    188: 
                    189: NAME
                    190:        PCRE - Perl-compatible regular expressions
                    191: 
                    192:        #include <pcre.h>
                    193: 
                    194: 
                    195: PCRE 16-BIT API BASIC FUNCTIONS
                    196: 
                    197:        pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
                    198:             const char **errptr, int *erroffset,
                    199:             const unsigned char *tableptr);
                    200: 
                    201:        pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
                    202:             int *errorcodeptr,
                    203:             const char **errptr, int *erroffset,
                    204:             const unsigned char *tableptr);
                    205: 
                    206:        pcre16_extra *pcre16_study(const pcre16 *code, int options,
                    207:             const char **errptr);
                    208: 
                    209:        void pcre16_free_study(pcre16_extra *extra);
                    210: 
                    211:        int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
                    212:             PCRE_SPTR16 subject, int length, int startoffset,
                    213:             int options, int *ovector, int ovecsize);
                    214: 
                    215:        int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
                    216:             PCRE_SPTR16 subject, int length, int startoffset,
                    217:             int options, int *ovector, int ovecsize,
                    218:             int *workspace, int wscount);
                    219: 
                    220: 
                    221: PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
                    222: 
                    223:        int pcre16_copy_named_substring(const pcre16 *code,
                    224:             PCRE_SPTR16 subject, int *ovector,
                    225:             int stringcount, PCRE_SPTR16 stringname,
                    226:             PCRE_UCHAR16 *buffer, int buffersize);
                    227: 
                    228:        int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
                    229:             int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
                    230:             int buffersize);
                    231: 
                    232:        int pcre16_get_named_substring(const pcre16 *code,
                    233:             PCRE_SPTR16 subject, int *ovector,
                    234:             int stringcount, PCRE_SPTR16 stringname,
                    235:             PCRE_SPTR16 *stringptr);
                    236: 
                    237:        int pcre16_get_stringnumber(const pcre16 *code,
                    238:             PCRE_SPTR16 name);
                    239: 
                    240:        int pcre16_get_stringtable_entries(const pcre16 *code,
                    241:             PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
                    242: 
                    243:        int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
                    244:             int stringcount, int stringnumber,
                    245:             PCRE_SPTR16 *stringptr);
                    246: 
                    247:        int pcre16_get_substring_list(PCRE_SPTR16 subject,
                    248:             int *ovector, int stringcount, PCRE_SPTR16 **listptr);
                    249: 
                    250:        void pcre16_free_substring(PCRE_SPTR16 stringptr);
                    251: 
                    252:        void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
                    253: 
                    254: 
                    255: PCRE 16-BIT API AUXILIARY FUNCTIONS
                    256: 
                    257:        pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
                    258: 
                    259:        void pcre16_jit_stack_free(pcre16_jit_stack *stack);
                    260: 
                    261:        void pcre16_assign_jit_stack(pcre16_extra *extra,
                    262:             pcre16_jit_callback callback, void *data);
                    263: 
                    264:        const unsigned char *pcre16_maketables(void);
                    265: 
                    266:        int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
                    267:             int what, void *where);
                    268: 
                    269:        int pcre16_refcount(pcre16 *code, int adjust);
                    270: 
                    271:        int pcre16_config(int what, void *where);
                    272: 
                    273:        const char *pcre16_version(void);
                    274: 
                    275:        int pcre16_pattern_to_host_byte_order(pcre16 *code,
                    276:             pcre16_extra *extra, const unsigned char *tables);
                    277: 
                    278: 
                    279: PCRE 16-BIT API INDIRECTED FUNCTIONS
                    280: 
                    281:        void *(*pcre16_malloc)(size_t);
                    282: 
                    283:        void (*pcre16_free)(void *);
                    284: 
                    285:        void *(*pcre16_stack_malloc)(size_t);
                    286: 
                    287:        void (*pcre16_stack_free)(void *);
                    288: 
                    289:        int (*pcre16_callout)(pcre16_callout_block *);
                    290: 
                    291: 
                    292: PCRE 16-BIT API 16-BIT-ONLY FUNCTION
                    293: 
                    294:        int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
                    295:             PCRE_SPTR16 input, int length, int *byte_order,
                    296:             int keep_boms);
                    297: 
                    298: 
                    299: THE PCRE 16-BIT LIBRARY
                    300: 
                    301:        Starting  with  release  8.30, it is possible to compile a PCRE library
                    302:        that supports 16-bit character strings, including  UTF-16  strings,  as
                    303:        well  as  or instead of the original 8-bit library. The majority of the
                    304:        work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
                    305:        libraries contain identical sets of functions, used in exactly the same
                    306:        way. Only the names of the functions and the data types of their  argu-
                    307:        ments  and results are different. To avoid over-complication and reduce
                    308:        the documentation maintenance load,  most  of  the  PCRE  documentation
                    309:        describes  the  8-bit  library,  with only occasional references to the
                    310:        16-bit library. This page describes what is different when you use  the
                    311:        16-bit library.
                    312: 
                    313:        WARNING:  A  single  application can be linked with both libraries, but
                    314:        you must take care when processing any particular pattern to use  func-
                    315:        tions  from  just one library. For example, if you want to study a pat-
                    316:        tern that was compiled with  pcre16_compile(),  you  must  do  so  with
                    317:        pcre16_study(), not pcre_study(), and you must free the study data with
                    318:        pcre16_free_study().
                    319: 
                    320: 
                    321: THE HEADER FILE
                    322: 
                    323:        There is only one header file, pcre.h. It contains prototypes  for  all
1.1.1.4 ! misho     324:        the functions in all libraries, as well as definitions of flags, struc-
        !           325:        tures, error codes, etc.
1.1.1.2   misho     326: 
                    327: 
                    328: THE LIBRARY NAME
                    329: 
                    330:        In Unix-like systems, the 16-bit library is called libpcre16,  and  can
                    331:        normally  be  accesss  by adding -lpcre16 to the command for linking an
                    332:        application that uses PCRE.
                    333: 
                    334: 
                    335: STRING TYPES
                    336: 
                    337:        In the 8-bit library, strings are passed to PCRE library  functions  as
                    338:        vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
                    339:        strings are passed as vectors of unsigned 16-bit quantities. The  macro
                    340:        PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
                    341:        defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
                    342:        int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
1.1.1.4 ! misho     343:        as "unsigned short int", but checks that it really  is  a  16-bit  data
        !           344:        type.  If  it is not, the build fails with an error message telling the
        !           345:        maintainer to modify the definition appropriately.
1.1.1.2   misho     346: 
                    347: 
                    348: STRUCTURE TYPES
                    349: 
                    350:        The types of the opaque structures that are used  for  compiled  16-bit
                    351:        patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
                    352:        The  type  of  the  user-accessible  structure  that  is  returned   by
                    353:        pcre16_study()  is  pcre16_extra, and the type of the structure that is
                    354:        used for passing data to a callout  function  is  pcre16_callout_block.
                    355:        These structures contain the same fields, with the same names, as their
                    356:        8-bit counterparts. The only difference is that pointers  to  character
                    357:        strings are 16-bit instead of 8-bit types.
                    358: 
                    359: 
                    360: 16-BIT FUNCTIONS
                    361: 
                    362:        For  every function in the 8-bit library there is a corresponding func-
                    363:        tion in the 16-bit library with a name that starts with pcre16_ instead
                    364:        of  pcre_.  The  prototypes are listed above. In addition, there is one
                    365:        extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
                    366:        function  that converts a UTF-16 character string to host byte order if
                    367:        necessary. The other 16-bit  functions  expect  the  strings  they  are
                    368:        passed to be in host byte order.
                    369: 
                    370:        The input and output arguments of pcre16_utf16_to_host_byte_order() may
                    371:        point to the same address, that is, conversion in place  is  supported.
                    372:        The output buffer must be at least as long as the input.
                    373: 
                    374:        The  length  argument  specifies the number of 16-bit data units in the
                    375:        input string; a negative value specifies a zero-terminated string.
                    376: 
                    377:        If byte_order is NULL, it is assumed that the string starts off in host
                    378:        byte  order. This may be changed by byte-order marks (BOMs) anywhere in
                    379:        the string (commonly as the first character).
                    380: 
                    381:        If byte_order is not NULL, a non-zero value of the integer to which  it
                    382:        points  means  that  the input starts off in host byte order, otherwise
                    383:        the opposite order is assumed. Again, BOMs in  the  string  can  change
                    384:        this. The final byte order is passed back at the end of processing.
                    385: 
                    386:        If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
                    387:        copied into the output string. Otherwise they are discarded.
                    388: 
                    389:        The result of the function is the number of 16-bit  units  placed  into
                    390:        the  output  buffer,  including  the  zero terminator if the string was
                    391:        zero-terminated.
                    392: 
                    393: 
                    394: SUBJECT STRING OFFSETS
                    395: 
1.1.1.4 ! misho     396:        The lengths and starting offsets of subject strings must  be  specified
        !           397:        in  16-bit  data units, and the offsets within subject strings that are
        !           398:        returned by the matching functions are in also 16-bit units rather than
        !           399:        bytes.
1.1.1.2   misho     400: 
                    401: 
                    402: NAMED SUBPATTERNS
                    403: 
                    404:        The  name-to-number translation table that is maintained for named sub-
                    405:        patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
                    406:        function returns the length of each entry in the table as the number of
                    407:        16-bit data units.
                    408: 
                    409: 
                    410: OPTION NAMES
                    411: 
                    412:        There   are   two   new   general   option   names,   PCRE_UTF16    and
                    413:        PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
                    414:        PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
1.1.1.3   misho     415:        define  the  same bits in the options word. There is a discussion about
                    416:        the validity of UTF-16 strings in the pcreunicode page.
1.1.1.2   misho     417: 
1.1.1.3   misho     418:        For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
                    419:        that  returns  1  if UTF-16 support is configured, otherwise 0. If this
1.1.1.4 ! misho     420:        option  is  given  to  pcre_config()  or  pcre32_config(),  or  if  the
        !           421:        PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF32  option is given to pcre16_con-
        !           422:        fig(), the result is the PCRE_ERROR_BADOPTION error.
1.1.1.2   misho     423: 
                    424: 
                    425: CHARACTER CODES
                    426: 
1.1.1.4 ! misho     427:        In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
1.1.1.2   misho     428:        treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
1.1.1.4 ! misho     429:        that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
        !           430:        types  for characters less than 0xff can therefore be influenced by the
        !           431:        locale in the same way as before.  Characters greater  than  0xff  have
1.1.1.2   misho     432:        only one case, and no "type" (such as letter or digit).
                    433: 
1.1.1.4 ! misho     434:        In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
        !           435:        0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
        !           436:        because  those  are "surrogate" values that are used in pairs to encode
1.1.1.2   misho     437:        values greater than 0xffff.
                    438: 
1.1.1.4 ! misho     439:        A UTF-16 string can indicate its endianness by special code knows as  a
1.1.1.2   misho     440:        byte-order mark (BOM). The PCRE functions do not handle this, expecting
1.1.1.4 ! misho     441:        strings  to  be  in  host  byte  order.  A  utility   function   called
        !           442:        pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
1.1.1.2   misho     443:        above).
                    444: 
                    445: 
                    446: ERROR NAMES
                    447: 
1.1.1.4 ! misho     448:        The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
        !           449:        spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
        !           450:        given when a compiled pattern is passed to a  function  that  processes
        !           451:        patterns  in  the  other  mode, for example, if a pattern compiled with
1.1.1.2   misho     452:        pcre_compile() is passed to pcre16_exec().
                    453: 
1.1.1.4 ! misho     454:        There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
        !           455:        invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
        !           456:        UTF-8 strings that are described in the section entitled "Reason  codes
        !           457:        for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
1.1.1.2   misho     458:        are:
                    459: 
                    460:          PCRE_UTF16_ERR1  Missing low surrogate at end of string
                    461:          PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
                    462:          PCRE_UTF16_ERR3  Isolated low surrogate
1.1.1.4 ! misho     463:          PCRE_UTF16_ERR4  Non-character
1.1.1.2   misho     464: 
                    465: 
                    466: ERROR TEXTS
                    467: 
1.1.1.4 ! misho     468:        If there is an error while compiling a pattern, the error text that  is
        !           469:        passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
1.1.1.2   misho     470:        character string, zero-terminated.
                    471: 
                    472: 
                    473: CALLOUTS
                    474: 
1.1.1.4 ! misho     475:        The subject and mark fields in the callout block that is  passed  to  a
1.1.1.2   misho     476:        callout function point to 16-bit vectors.
                    477: 
                    478: 
                    479: TESTING
                    480: 
1.1.1.4 ! misho     481:        The  pcretest  program continues to operate with 8-bit input and output
        !           482:        files, but it can be used for testing the 16-bit library. If it is  run
1.1.1.2   misho     483:        with the command line option -16, patterns and subject strings are con-
                    484:        verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
1.1.1.4 ! misho     485:        library  functions  are used instead of the 8-bit ones. Returned 16-bit
        !           486:        strings are converted to 8-bit for output. If both the  8-bit  and  the
        !           487:        32-bit libraries were not compiled, pcretest defaults to 16-bit and the
        !           488:        -16 option is ignored.
1.1.1.2   misho     489: 
1.1.1.3   misho     490:        When PCRE is being built, the RunTest script that is  called  by  "make
1.1.1.4 ! misho     491:        check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
        !           492:        16-bit and 32-bit libraries has been built, and runs the  tests  appro-
        !           493:        priately.
1.1.1.2   misho     494: 
                    495: 
                    496: NOT SUPPORTED IN 16-BIT MODE
                    497: 
                    498:        Not all the features of the 8-bit library are available with the 16-bit
1.1.1.4 ! misho     499:        library. The C++ and POSIX wrapper functions  support  only  the  8-bit
1.1.1.2   misho     500:        library, and the pcregrep program is at present 8-bit only.
                    501: 
                    502: 
                    503: AUTHOR
                    504: 
                    505:        Philip Hazel
                    506:        University Computing Service
                    507:        Cambridge CB2 3QH, England.
                    508: 
                    509: 
                    510: REVISION
                    511: 
1.1.1.4 ! misho     512:        Last updated: 12 May 2013
        !           513:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho     514: ------------------------------------------------------------------------------
                    515: 
                    516: 
1.1.1.4 ! misho     517: PCRE(3)                    Library Functions Manual                    PCRE(3)
        !           518: 
1.1       misho     519: 
                    520: 
                    521: NAME
                    522:        PCRE - Perl-compatible regular expressions
                    523: 
1.1.1.4 ! misho     524:        #include <pcre.h>
        !           525: 
        !           526: 
        !           527: PCRE 32-BIT API BASIC FUNCTIONS
        !           528: 
        !           529:        pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
        !           530:             const char **errptr, int *erroffset,
        !           531:             const unsigned char *tableptr);
        !           532: 
        !           533:        pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
        !           534:             int *errorcodeptr,
        !           535:             const char **errptr, int *erroffset,
        !           536:             const unsigned char *tableptr);
        !           537: 
        !           538:        pcre32_extra *pcre32_study(const pcre32 *code, int options,
        !           539:             const char **errptr);
        !           540: 
        !           541:        void pcre32_free_study(pcre32_extra *extra);
        !           542: 
        !           543:        int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
        !           544:             PCRE_SPTR32 subject, int length, int startoffset,
        !           545:             int options, int *ovector, int ovecsize);
        !           546: 
        !           547:        int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
        !           548:             PCRE_SPTR32 subject, int length, int startoffset,
        !           549:             int options, int *ovector, int ovecsize,
        !           550:             int *workspace, int wscount);
        !           551: 
        !           552: 
        !           553: PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
        !           554: 
        !           555:        int pcre32_copy_named_substring(const pcre32 *code,
        !           556:             PCRE_SPTR32 subject, int *ovector,
        !           557:             int stringcount, PCRE_SPTR32 stringname,
        !           558:             PCRE_UCHAR32 *buffer, int buffersize);
        !           559: 
        !           560:        int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
        !           561:             int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
        !           562:             int buffersize);
        !           563: 
        !           564:        int pcre32_get_named_substring(const pcre32 *code,
        !           565:             PCRE_SPTR32 subject, int *ovector,
        !           566:             int stringcount, PCRE_SPTR32 stringname,
        !           567:             PCRE_SPTR32 *stringptr);
        !           568: 
        !           569:        int pcre32_get_stringnumber(const pcre32 *code,
        !           570:             PCRE_SPTR32 name);
        !           571: 
        !           572:        int pcre32_get_stringtable_entries(const pcre32 *code,
        !           573:             PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
        !           574: 
        !           575:        int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
        !           576:             int stringcount, int stringnumber,
        !           577:             PCRE_SPTR32 *stringptr);
        !           578: 
        !           579:        int pcre32_get_substring_list(PCRE_SPTR32 subject,
        !           580:             int *ovector, int stringcount, PCRE_SPTR32 **listptr);
        !           581: 
        !           582:        void pcre32_free_substring(PCRE_SPTR32 stringptr);
        !           583: 
        !           584:        void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
        !           585: 
        !           586: 
        !           587: PCRE 32-BIT API AUXILIARY FUNCTIONS
        !           588: 
        !           589:        pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
        !           590: 
        !           591:        void pcre32_jit_stack_free(pcre32_jit_stack *stack);
        !           592: 
        !           593:        void pcre32_assign_jit_stack(pcre32_extra *extra,
        !           594:             pcre32_jit_callback callback, void *data);
        !           595: 
        !           596:        const unsigned char *pcre32_maketables(void);
        !           597: 
        !           598:        int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
        !           599:             int what, void *where);
        !           600: 
        !           601:        int pcre32_refcount(pcre32 *code, int adjust);
        !           602: 
        !           603:        int pcre32_config(int what, void *where);
        !           604: 
        !           605:        const char *pcre32_version(void);
        !           606: 
        !           607:        int pcre32_pattern_to_host_byte_order(pcre32 *code,
        !           608:             pcre32_extra *extra, const unsigned char *tables);
        !           609: 
        !           610: 
        !           611: PCRE 32-BIT API INDIRECTED FUNCTIONS
        !           612: 
        !           613:        void *(*pcre32_malloc)(size_t);
        !           614: 
        !           615:        void (*pcre32_free)(void *);
        !           616: 
        !           617:        void *(*pcre32_stack_malloc)(size_t);
        !           618: 
        !           619:        void (*pcre32_stack_free)(void *);
        !           620: 
        !           621:        int (*pcre32_callout)(pcre32_callout_block *);
        !           622: 
        !           623: 
        !           624: PCRE 32-BIT API 32-BIT-ONLY FUNCTION
        !           625: 
        !           626:        int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
        !           627:             PCRE_SPTR32 input, int length, int *byte_order,
        !           628:             int keep_boms);
        !           629: 
        !           630: 
        !           631: THE PCRE 32-BIT LIBRARY
        !           632: 
        !           633:        Starting  with  release  8.32, it is possible to compile a PCRE library
        !           634:        that supports 32-bit character strings, including  UTF-32  strings,  as
        !           635:        well as or instead of the original 8-bit library. This work was done by
        !           636:        Christian Persch, based on the work done  by  Zoltan  Herczeg  for  the
        !           637:        16-bit  library.  All  three  libraries contain identical sets of func-
        !           638:        tions, used in exactly the same way.  Only the names of  the  functions
        !           639:        and  the  data  types  of their arguments and results are different. To
        !           640:        avoid over-complication and reduce the documentation maintenance  load,
        !           641:        most  of  the PCRE documentation describes the 8-bit library, with only
        !           642:        occasional references to the 16-bit and  32-bit  libraries.  This  page
        !           643:        describes what is different when you use the 32-bit library.
        !           644: 
        !           645:        WARNING:  A  single  application  can  be linked with all or any of the
        !           646:        three libraries, but you must take care when processing any  particular
        !           647:        pattern  to  use  functions  from just one library. For example, if you
        !           648:        want to study a pattern that was compiled  with  pcre32_compile(),  you
        !           649:        must do so with pcre32_study(), not pcre_study(), and you must free the
        !           650:        study data with pcre32_free_study().
        !           651: 
        !           652: 
        !           653: THE HEADER FILE
        !           654: 
        !           655:        There is only one header file, pcre.h. It contains prototypes  for  all
        !           656:        the functions in all libraries, as well as definitions of flags, struc-
        !           657:        tures, error codes, etc.
        !           658: 
        !           659: 
        !           660: THE LIBRARY NAME
        !           661: 
        !           662:        In Unix-like systems, the 32-bit library is called libpcre32,  and  can
        !           663:        normally  be  accesss  by adding -lpcre32 to the command for linking an
        !           664:        application that uses PCRE.
        !           665: 
        !           666: 
        !           667: STRING TYPES
        !           668: 
        !           669:        In the 8-bit library, strings are passed to PCRE library  functions  as
        !           670:        vectors  of  bytes  with  the  C  type "char *". In the 32-bit library,
        !           671:        strings are passed as vectors of unsigned 32-bit quantities. The  macro
        !           672:        PCRE_UCHAR32  specifies  an  appropriate  data type, and PCRE_SPTR32 is
        !           673:        defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
        !           674:        int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
        !           675:        as "unsigned int", but checks that it really is a 32-bit data type.  If
        !           676:        it is not, the build fails with an error message telling the maintainer
        !           677:        to modify the definition appropriately.
        !           678: 
        !           679: 
        !           680: STRUCTURE TYPES
        !           681: 
        !           682:        The types of the opaque structures that are used  for  compiled  32-bit
        !           683:        patterns  and  JIT stacks are pcre32 and pcre32_jit_stack respectively.
        !           684:        The  type  of  the  user-accessible  structure  that  is  returned   by
        !           685:        pcre32_study()  is  pcre32_extra, and the type of the structure that is
        !           686:        used for passing data to a callout  function  is  pcre32_callout_block.
        !           687:        These structures contain the same fields, with the same names, as their
        !           688:        8-bit counterparts. The only difference is that pointers  to  character
        !           689:        strings are 32-bit instead of 8-bit types.
        !           690: 
        !           691: 
        !           692: 32-BIT FUNCTIONS
        !           693: 
        !           694:        For  every function in the 8-bit library there is a corresponding func-
        !           695:        tion in the 32-bit library with a name that starts with pcre32_ instead
        !           696:        of  pcre_.  The  prototypes are listed above. In addition, there is one
        !           697:        extra function, pcre32_utf32_to_host_byte_order(). This  is  a  utility
        !           698:        function  that converts a UTF-32 character string to host byte order if
        !           699:        necessary. The other 32-bit  functions  expect  the  strings  they  are
        !           700:        passed to be in host byte order.
        !           701: 
        !           702:        The input and output arguments of pcre32_utf32_to_host_byte_order() may
        !           703:        point to the same address, that is, conversion in place  is  supported.
        !           704:        The output buffer must be at least as long as the input.
        !           705: 
        !           706:        The  length  argument  specifies the number of 32-bit data units in the
        !           707:        input string; a negative value specifies a zero-terminated string.
        !           708: 
        !           709:        If byte_order is NULL, it is assumed that the string starts off in host
        !           710:        byte  order. This may be changed by byte-order marks (BOMs) anywhere in
        !           711:        the string (commonly as the first character).
        !           712: 
        !           713:        If byte_order is not NULL, a non-zero value of the integer to which  it
        !           714:        points  means  that  the input starts off in host byte order, otherwise
        !           715:        the opposite order is assumed. Again, BOMs in  the  string  can  change
        !           716:        this. The final byte order is passed back at the end of processing.
        !           717: 
        !           718:        If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
        !           719:        copied into the output string. Otherwise they are discarded.
        !           720: 
        !           721:        The result of the function is the number of 32-bit  units  placed  into
        !           722:        the  output  buffer,  including  the  zero terminator if the string was
        !           723:        zero-terminated.
        !           724: 
        !           725: 
        !           726: SUBJECT STRING OFFSETS
        !           727: 
        !           728:        The lengths and starting offsets of subject strings must  be  specified
        !           729:        in  32-bit  data units, and the offsets within subject strings that are
        !           730:        returned by the matching functions are in also 32-bit units rather than
        !           731:        bytes.
        !           732: 
        !           733: 
        !           734: NAMED SUBPATTERNS
        !           735: 
        !           736:        The  name-to-number translation table that is maintained for named sub-
        !           737:        patterns uses 32-bit characters.  The  pcre32_get_stringtable_entries()
        !           738:        function returns the length of each entry in the table as the number of
        !           739:        32-bit data units.
        !           740: 
        !           741: 
        !           742: OPTION NAMES
        !           743: 
        !           744:        There   are   two   new   general   option   names,   PCRE_UTF32    and
        !           745:        PCRE_NO_UTF32_CHECK,     which     correspond    to    PCRE_UTF8    and
        !           746:        PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
        !           747:        define  the  same bits in the options word. There is a discussion about
        !           748:        the validity of UTF-32 strings in the pcreunicode page.
        !           749: 
        !           750:        For the pcre32_config() function there is an  option  PCRE_CONFIG_UTF32
        !           751:        that  returns  1  if UTF-32 support is configured, otherwise 0. If this
        !           752:        option  is  given  to  pcre_config()  or  pcre16_config(),  or  if  the
        !           753:        PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF16  option is given to pcre32_con-
        !           754:        fig(), the result is the PCRE_ERROR_BADOPTION error.
        !           755: 
        !           756: 
        !           757: CHARACTER CODES
        !           758: 
        !           759:        In 32-bit mode, when  PCRE_UTF32  is  not  set,  character  values  are
        !           760:        treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
        !           761:        that they can range from 0 to 0x7fffffff instead of 0 to 0xff.  Charac-
        !           762:        ter  types for characters less than 0xff can therefore be influenced by
        !           763:        the locale in the same way as before.   Characters  greater  than  0xff
        !           764:        have only one case, and no "type" (such as letter or digit).
        !           765: 
        !           766:        In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
        !           767:        0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
        !           768:        because those are "surrogate" values that are ill-formed in UTF-32.
        !           769: 
        !           770:        A  UTF-32 string can indicate its endianness by special code knows as a
        !           771:        byte-order mark (BOM). The PCRE functions do not handle this, expecting
        !           772:        strings   to   be  in  host  byte  order.  A  utility  function  called
        !           773:        pcre32_utf32_to_host_byte_order() is provided to help  with  this  (see
        !           774:        above).
        !           775: 
        !           776: 
        !           777: ERROR NAMES
        !           778: 
        !           779:        The  error  PCRE_ERROR_BADUTF32  corresponds  to its 8-bit counterpart.
        !           780:        The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
        !           781:        to  a  function that processes patterns in the other mode, for example,
        !           782:        if a pattern compiled with pcre_compile() is passed to pcre32_exec().
        !           783: 
        !           784:        There are new error codes whose names  begin  with  PCRE_UTF32_ERR  for
        !           785:        invalid  UTF-32  strings,  corresponding to the PCRE_UTF8_ERR codes for
        !           786:        UTF-8 strings that are described in the section entitled "Reason  codes
        !           787:        for  invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
        !           788:        are:
        !           789: 
        !           790:          PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
        !           791:          PCRE_UTF32_ERR2  Non-character
        !           792:          PCRE_UTF32_ERR3  Character > 0x10ffff
        !           793: 
        !           794: 
        !           795: ERROR TEXTS
        !           796: 
        !           797:        If there is an error while compiling a pattern, the error text that  is
        !           798:        passed  back by pcre32_compile() or pcre32_compile2() is still an 8-bit
        !           799:        character string, zero-terminated.
        !           800: 
        !           801: 
        !           802: CALLOUTS
        !           803: 
        !           804:        The subject and mark fields in the callout block that is  passed  to  a
        !           805:        callout function point to 32-bit vectors.
        !           806: 
        !           807: 
        !           808: TESTING
        !           809: 
        !           810:        The  pcretest  program continues to operate with 8-bit input and output
        !           811:        files, but it can be used for testing the 32-bit library. If it is  run
        !           812:        with the command line option -32, patterns and subject strings are con-
        !           813:        verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
        !           814:        library  functions  are used instead of the 8-bit ones. Returned 32-bit
        !           815:        strings are converted to 8-bit for output. If both the  8-bit  and  the
        !           816:        16-bit libraries were not compiled, pcretest defaults to 32-bit and the
        !           817:        -32 option is ignored.
        !           818: 
        !           819:        When PCRE is being built, the RunTest script that is  called  by  "make
        !           820:        check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
        !           821:        16-bit and 32-bit libraries has been built, and runs the  tests  appro-
        !           822:        priately.
        !           823: 
        !           824: 
        !           825: NOT SUPPORTED IN 32-BIT MODE
        !           826: 
        !           827:        Not all the features of the 8-bit library are available with the 32-bit
        !           828:        library. The C++ and POSIX wrapper functions  support  only  the  8-bit
        !           829:        library, and the pcregrep program is at present 8-bit only.
        !           830: 
        !           831: 
        !           832: AUTHOR
        !           833: 
        !           834:        Philip Hazel
        !           835:        University Computing Service
        !           836:        Cambridge CB2 3QH, England.
        !           837: 
        !           838: 
        !           839: REVISION
        !           840: 
        !           841:        Last updated: 12 May 2013
        !           842:        Copyright (c) 1997-2013 University of Cambridge.
        !           843: ------------------------------------------------------------------------------
        !           844: 
        !           845: 
        !           846: PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)
        !           847: 
        !           848: 
        !           849: 
        !           850: NAME
        !           851:        PCRE - Perl-compatible regular expressions
        !           852: 
        !           853: BUILDING PCRE
        !           854: 
        !           855:        PCRE  is  distributed with a configure script that can be used to build
        !           856:        the library in Unix-like environments using the applications  known  as
        !           857:        Autotools.   Also  in  the  distribution  are files to support building
        !           858:        using CMake instead of configure. The text file README contains general
        !           859:        information  about  building  with Autotools (some of which is repeated
        !           860:        below), and also has some comments about building on various  operating
        !           861:        systems.  There  is  a lot more information about building PCRE without
        !           862:        using Autotools (including information about using CMake  and  building
        !           863:        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
        !           864:        consult this file as well as the README file if you are building  in  a
        !           865:        non-Unix-like environment.
        !           866: 
1.1       misho     867: 
                    868: PCRE BUILD-TIME OPTIONS
                    869: 
1.1.1.4 ! misho     870:        The  rest of this document describes the optional features of PCRE that
        !           871:        can be selected when the library is compiled. It  assumes  use  of  the
        !           872:        configure  script,  where  the  optional features are selected or dese-
        !           873:        lected by providing options to configure before running the  make  com-
        !           874:        mand.  However,  the same options can be selected in both Unix-like and
        !           875:        non-Unix-like environments using the GUI facility of cmake-gui  if  you
        !           876:        are using CMake instead of configure to build PCRE.
        !           877: 
        !           878:        If  you  are not using Autotools or CMake, option selection can be done
        !           879:        by editing the config.h file, or by passing parameter settings  to  the
        !           880:        compiler, as described in NON-AUTOTOOLS-BUILD.
1.1       misho     881: 
                    882:        The complete list of options for configure (which includes the standard
1.1.1.4 ! misho     883:        ones such as the  selection  of  the  installation  directory)  can  be
1.1       misho     884:        obtained by running
                    885: 
                    886:          ./configure --help
                    887: 
1.1.1.4 ! misho     888:        The  following  sections  include  descriptions  of options whose names
1.1       misho     889:        begin with --enable or --disable. These settings specify changes to the
1.1.1.4 ! misho     890:        defaults  for  the configure command. Because of the way that configure
        !           891:        works, --enable and --disable always come in pairs, so  the  complemen-
        !           892:        tary  option always exists as well, but as it specifies the default, it
1.1       misho     893:        is not described.
                    894: 
                    895: 
1.1.1.4 ! misho     896: BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
1.1.1.2   misho     897: 
1.1.1.4 ! misho     898:        By default, a library called libpcre  is  built,  containing  functions
        !           899:        that  take  string  arguments  contained in vectors of bytes, either as
        !           900:        single-byte characters, or interpreted as UTF-8 strings. You  can  also
        !           901:        build  a  separate library, called libpcre16, in which strings are con-
        !           902:        tained in vectors of 16-bit data units and interpreted either  as  sin-
1.1.1.2   misho     903:        gle-unit characters or UTF-16 strings, by adding
                    904: 
                    905:          --enable-pcre16
                    906: 
1.1.1.4 ! misho     907:        to  the  configure  command.  You  can  also build yet another separate
        !           908:        library, called libpcre32, in which strings are contained in vectors of
        !           909:        32-bit  data  units and interpreted either as single-unit characters or
        !           910:        UTF-32 strings, by adding
        !           911: 
        !           912:          --enable-pcre32
        !           913: 
1.1.1.2   misho     914:        to the configure command. If you do not want the 8-bit library, add
                    915: 
                    916:          --disable-pcre8
                    917: 
1.1.1.4 ! misho     918:        as well. At least one of the three libraries must be built.  Note  that
        !           919:        the  C++  and  POSIX  wrappers are for the 8-bit library only, and that
        !           920:        pcregrep is an 8-bit program. None of these are  built  if  you  select
        !           921:        only the 16-bit or 32-bit libraries.
1.1.1.2   misho     922: 
                    923: 
1.1       misho     924: BUILDING SHARED AND STATIC LIBRARIES
                    925: 
1.1.1.4 ! misho     926:        The  Autotools  PCRE building process uses libtool to build both shared
        !           927:        and static libraries by default. You  can  suppress  one  of  these  by
        !           928:        adding one of
1.1       misho     929: 
                    930:          --disable-shared
                    931:          --disable-static
                    932: 
                    933:        to the configure command, as required.
                    934: 
                    935: 
                    936: C++ SUPPORT
                    937: 
1.1.1.2   misho     938:        By  default,  if the 8-bit library is being built, the configure script
                    939:        will search for a C++ compiler and C++ header files. If it finds  them,
                    940:        it  automatically  builds  the C++ wrapper library (which supports only
                    941:        8-bit strings). You can disable this by adding
1.1       misho     942: 
                    943:          --disable-cpp
                    944: 
                    945:        to the configure command.
                    946: 
                    947: 
1.1.1.4 ! misho     948: UTF-8, UTF-16 AND UTF-32 SUPPORT
1.1       misho     949: 
1.1.1.2   misho     950:        To build PCRE with support for UTF Unicode character strings, add
1.1       misho     951: 
1.1.1.2   misho     952:          --enable-utf
1.1       misho     953: 
1.1.1.4 ! misho     954:        to the configure command. This setting applies to all three  libraries,
        !           955:        adding  support  for  UTF-8 to the 8-bit library, support for UTF-16 to
        !           956:        the 16-bit library, and  support  for  UTF-32  to  the  to  the  32-bit
        !           957:        library.  There  are no separate options for enabling UTF-8, UTF-16 and
        !           958:        UTF-32 independently because that would allow ridiculous settings  such
        !           959:        as  requesting UTF-16 support while building only the 8-bit library. It
        !           960:        is not possible to build one library with UTF support and another with-
        !           961:        out  in the same configuration. (For backwards compatibility, --enable-
        !           962:        utf8 is a synonym of --enable-utf.)
        !           963: 
        !           964:        Of itself, this setting does not make  PCRE  treat  strings  as  UTF-8,
        !           965:        UTF-16  or UTF-32. As well as compiling PCRE with this option, you also
        !           966:        have have to set the PCRE_UTF8, PCRE_UTF16  or  PCRE_UTF32  option  (as
        !           967:        appropriate) when you call one of the pattern compiling functions.
1.1       misho     968: 
1.1.1.4 ! misho     969:        If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
        !           970:        expects its input to be either ASCII or UTF-8 (depending  on  the  run-
1.1.1.3   misho     971:        time option). It is not possible to support both EBCDIC and UTF-8 codes
1.1.1.4 ! misho     972:        in the same version of  the  library.  Consequently,  --enable-utf  and
1.1       misho     973:        --enable-ebcdic are mutually exclusive.
                    974: 
                    975: 
                    976: UNICODE CHARACTER PROPERTY SUPPORT
                    977: 
1.1.1.4 ! misho     978:        UTF  support allows the libraries to process character codepoints up to
        !           979:        0x10ffff in the strings that they handle. On its own, however, it  does
1.1.1.2   misho     980:        not provide any facilities for accessing the properties of such charac-
                    981:        ters. If you want to be able to use the pattern escapes \P, \p, and \X,
                    982:        which refer to Unicode character properties, you must add
1.1       misho     983: 
                    984:          --enable-unicode-properties
                    985: 
1.1.1.4 ! misho     986:        to  the  configure  command. This implies UTF support, even if you have
1.1       misho     987:        not explicitly requested it.
                    988: 
1.1.1.4 ! misho     989:        Including Unicode property support adds around 30K  of  tables  to  the
        !           990:        PCRE  library.  Only  the general category properties such as Lu and Nd
1.1       misho     991:        are supported. Details are given in the pcrepattern documentation.
                    992: 
                    993: 
                    994: JUST-IN-TIME COMPILER SUPPORT
                    995: 
                    996:        Just-in-time compiler support is included in the build by specifying
                    997: 
                    998:          --enable-jit
                    999: 
1.1.1.4 ! misho    1000:        This support is available only for certain hardware  architectures.  If
        !          1001:        this  option  is  set  for  an unsupported architecture, a compile time
        !          1002:        error occurs.  See the pcrejit documentation for a  discussion  of  JIT
1.1       misho    1003:        usage. When JIT support is enabled, pcregrep automatically makes use of
                   1004:        it, unless you add
                   1005: 
                   1006:          --disable-pcregrep-jit
                   1007: 
                   1008:        to the "configure" command.
                   1009: 
                   1010: 
                   1011: CODE VALUE OF NEWLINE
                   1012: 
1.1.1.4 ! misho    1013:        By default, PCRE interprets the linefeed (LF) character  as  indicating
        !          1014:        the  end  of  a line. This is the normal newline character on Unix-like
        !          1015:        systems. You can compile PCRE to use carriage return (CR)  instead,  by
1.1       misho    1016:        adding
                   1017: 
                   1018:          --enable-newline-is-cr
                   1019: 
1.1.1.4 ! misho    1020:        to  the  configure  command.  There  is  also  a --enable-newline-is-lf
1.1       misho    1021:        option, which explicitly specifies linefeed as the newline character.
                   1022: 
                   1023:        Alternatively, you can specify that line endings are to be indicated by
                   1024:        the two character sequence CRLF. If you want this, add
                   1025: 
                   1026:          --enable-newline-is-crlf
                   1027: 
                   1028:        to the configure command. There is a fourth option, specified by
                   1029: 
                   1030:          --enable-newline-is-anycrlf
                   1031: 
1.1.1.4 ! misho    1032:        which  causes  PCRE  to recognize any of the three sequences CR, LF, or
1.1       misho    1033:        CRLF as indicating a line ending. Finally, a fifth option, specified by
                   1034: 
                   1035:          --enable-newline-is-any
                   1036: 
                   1037:        causes PCRE to recognize any Unicode newline sequence.
                   1038: 
1.1.1.4 ! misho    1039:        Whatever line ending convention is selected when PCRE is built  can  be
        !          1040:        overridden  when  the library functions are called. At build time it is
1.1       misho    1041:        conventional to use the standard for your operating system.
                   1042: 
                   1043: 
                   1044: WHAT \R MATCHES
                   1045: 
1.1.1.4 ! misho    1046:        By default, the sequence \R in a pattern matches  any  Unicode  newline
        !          1047:        sequence,  whatever  has  been selected as the line ending sequence. If
1.1       misho    1048:        you specify
                   1049: 
                   1050:          --enable-bsr-anycrlf
                   1051: 
1.1.1.4 ! misho    1052:        the default is changed so that \R matches only CR, LF, or  CRLF.  What-
        !          1053:        ever  is selected when PCRE is built can be overridden when the library
1.1       misho    1054:        functions are called.
                   1055: 
                   1056: 
                   1057: POSIX MALLOC USAGE
                   1058: 
1.1.1.4 ! misho    1059:        When the 8-bit library is called through the POSIX interface  (see  the
        !          1060:        pcreposix  documentation),  additional  working storage is required for
        !          1061:        holding the pointers to capturing  substrings,  because  PCRE  requires
1.1.1.2   misho    1062:        three integers per substring, whereas the POSIX interface provides only
1.1.1.4 ! misho    1063:        two. If the number of expected substrings is small, the  wrapper  func-
        !          1064:        tion  uses  space  on the stack, because this is faster than using mal-
        !          1065:        loc() for each call. The default threshold above which the stack is  no
1.1.1.2   misho    1066:        longer used is 10; it can be changed by adding a setting such as
1.1       misho    1067: 
                   1068:          --with-posix-malloc-threshold=20
                   1069: 
                   1070:        to the configure command.
                   1071: 
                   1072: 
                   1073: HANDLING VERY LARGE PATTERNS
                   1074: 
1.1.1.4 ! misho    1075:        Within  a  compiled  pattern,  offset values are used to point from one
        !          1076:        part to another (for example, from an opening parenthesis to an  alter-
        !          1077:        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
        !          1078:        two-byte values are used for these offsets, leading to a  maximum  size
        !          1079:        for  a compiled pattern of around 64K. This is sufficient to handle all
        !          1080:        but the most gigantic patterns.  Nevertheless, some people do  want  to
        !          1081:        process  truly  enormous patterns, so it is possible to compile PCRE to
        !          1082:        use three-byte or four-byte offsets by adding a setting such as
1.1       misho    1083: 
                   1084:          --with-link-size=3
                   1085: 
1.1.1.4 ! misho    1086:        to the configure command. The value given must be 2, 3, or 4.  For  the
        !          1087:        16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
        !          1088:        using longer offsets slows down the operation of PCRE because it has to
        !          1089:        load  additional  data  when  handling them. For the 32-bit library the
        !          1090:        value is always 4 and cannot be overridden; the value  of  --with-link-
        !          1091:        size is ignored.
1.1       misho    1092: 
                   1093: 
                   1094: AVOIDING EXCESSIVE STACK USAGE
                   1095: 
                   1096:        When matching with the pcre_exec() function, PCRE implements backtrack-
1.1.1.4 ! misho    1097:        ing by making recursive calls to an internal function  called  match().
        !          1098:        In  environments  where  the size of the stack is limited, this can se-
        !          1099:        verely limit PCRE's operation. (The Unix environment does  not  usually
1.1       misho    1100:        suffer from this problem, but it may sometimes be necessary to increase
1.1.1.4 ! misho    1101:        the maximum stack size.  There is a discussion in the  pcrestack  docu-
        !          1102:        mentation.)  An alternative approach to recursion that uses memory from
        !          1103:        the heap to remember data, instead of using recursive  function  calls,
        !          1104:        has  been  implemented to work round the problem of limited stack size.
1.1       misho    1105:        If you want to build a version of PCRE that works this way, add
                   1106: 
                   1107:          --disable-stack-for-recursion
                   1108: 
1.1.1.4 ! misho    1109:        to the configure command. With this configuration, PCRE  will  use  the
        !          1110:        pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
        !          1111:        ment functions. By default these point to malloc() and free(), but  you
1.1       misho    1112:        can replace the pointers so that your own functions are used instead.
                   1113: 
1.1.1.4 ! misho    1114:        Separate  functions  are  provided  rather  than  using pcre_malloc and
        !          1115:        pcre_free because the  usage  is  very  predictable:  the  block  sizes
        !          1116:        requested  are  always  the  same,  and  the blocks are always freed in
        !          1117:        reverse order. A calling program might be able to  implement  optimized
        !          1118:        functions  that  perform  better  than  malloc()  and free(). PCRE runs
1.1       misho    1119:        noticeably more slowly when built in this way. This option affects only
                   1120:        the pcre_exec() function; it is not relevant for pcre_dfa_exec().
                   1121: 
                   1122: 
                   1123: LIMITING PCRE RESOURCE USAGE
                   1124: 
1.1.1.4 ! misho    1125:        Internally,  PCRE has a function called match(), which it calls repeat-
        !          1126:        edly  (sometimes  recursively)  when  matching  a  pattern   with   the
        !          1127:        pcre_exec()  function.  By controlling the maximum number of times this
        !          1128:        function may be called during a single matching operation, a limit  can
        !          1129:        be  placed  on  the resources used by a single call to pcre_exec(). The
        !          1130:        limit can be changed at run time, as described in the pcreapi  documen-
        !          1131:        tation.  The default is 10 million, but this can be changed by adding a
1.1       misho    1132:        setting such as
                   1133: 
                   1134:          --with-match-limit=500000
                   1135: 
1.1.1.4 ! misho    1136:        to  the  configure  command.  This  setting  has  no  effect   on   the
1.1       misho    1137:        pcre_dfa_exec() matching function.
                   1138: 
1.1.1.4 ! misho    1139:        In  some  environments  it is desirable to limit the depth of recursive
1.1       misho    1140:        calls of match() more strictly than the total number of calls, in order
1.1.1.4 ! misho    1141:        to  restrict  the maximum amount of stack (or heap, if --disable-stack-
1.1       misho    1142:        for-recursion is specified) that is used. A second limit controls this;
1.1.1.4 ! misho    1143:        it  defaults  to  the  value  that is set for --with-match-limit, which
        !          1144:        imposes no additional constraints. However, you can set a  lower  limit
1.1       misho    1145:        by adding, for example,
                   1146: 
                   1147:          --with-match-limit-recursion=10000
                   1148: 
1.1.1.4 ! misho    1149:        to  the  configure  command.  This  value can also be overridden at run
1.1       misho    1150:        time.
                   1151: 
                   1152: 
                   1153: CREATING CHARACTER TABLES AT BUILD TIME
                   1154: 
1.1.1.4 ! misho    1155:        PCRE uses fixed tables for processing characters whose code values  are
        !          1156:        less  than 256. By default, PCRE is built with a set of tables that are
        !          1157:        distributed in the file pcre_chartables.c.dist. These  tables  are  for
1.1       misho    1158:        ASCII codes only. If you add
                   1159: 
                   1160:          --enable-rebuild-chartables
                   1161: 
1.1.1.4 ! misho    1162:        to  the  configure  command, the distributed tables are no longer used.
        !          1163:        Instead, a program called dftables is compiled and  run.  This  outputs
1.1       misho    1164:        the source for new set of tables, created in the default locale of your
1.1.1.4 ! misho    1165:        C run-time system. (This method of replacing the tables does  not  work
        !          1166:        if  you are cross compiling, because dftables is run on the local host.
1.1.1.3   misho    1167:        If you need to create alternative tables when cross compiling, you will
1.1       misho    1168:        have to do so "by hand".)
                   1169: 
                   1170: 
                   1171: USING EBCDIC CODE
                   1172: 
1.1.1.4 ! misho    1173:        PCRE  assumes  by  default that it will run in an environment where the
        !          1174:        character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
        !          1175:        This  is  the  case for most computer operating systems. PCRE can, how-
1.1       misho    1176:        ever, be compiled to run in an EBCDIC environment by adding
                   1177: 
                   1178:          --enable-ebcdic
                   1179: 
                   1180:        to the configure command. This setting implies --enable-rebuild-charta-
1.1.1.4 ! misho    1181:        bles.  You  should  only  use  it if you know that you are in an EBCDIC
        !          1182:        environment (for example,  an  IBM  mainframe  operating  system).  The
1.1.1.2   misho    1183:        --enable-ebcdic option is incompatible with --enable-utf.
1.1       misho    1184: 
1.1.1.4 ! misho    1185:        The EBCDIC character that corresponds to an ASCII LF is assumed to have
        !          1186:        the value 0x15 by default. However, in some EBCDIC  environments,  0x25
        !          1187:        is used. In such an environment you should use
        !          1188: 
        !          1189:          --enable-ebcdic-nl25
        !          1190: 
        !          1191:        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
        !          1192:        has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
        !          1193:        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
        !          1194:        acter (which, in Unicode, is 0x85).
        !          1195: 
        !          1196:        The options that select newline behaviour, such as --enable-newline-is-
        !          1197:        cr, and equivalent run-time options, refer to these character values in
        !          1198:        an EBCDIC environment.
        !          1199: 
1.1       misho    1200: 
                   1201: PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
                   1202: 
                   1203:        By default, pcregrep reads all files as plain text. You can build it so
                   1204:        that it recognizes files whose names end in .gz or .bz2, and reads them
                   1205:        with libz or libbz2, respectively, by adding one or both of
                   1206: 
                   1207:          --enable-pcregrep-libz
                   1208:          --enable-pcregrep-libbz2
                   1209: 
                   1210:        to the configure command. These options naturally require that the rel-
1.1.1.2   misho    1211:        evant  libraries  are installed on your system. Configuration will fail
1.1       misho    1212:        if they are not.
                   1213: 
                   1214: 
                   1215: PCREGREP BUFFER SIZE
                   1216: 
1.1.1.2   misho    1217:        pcregrep uses an internal buffer to hold a "window" on the file  it  is
1.1       misho    1218:        scanning, in order to be able to output "before" and "after" lines when
1.1.1.2   misho    1219:        it finds a match. The size of the buffer is controlled by  a  parameter
1.1       misho    1220:        whose default value is 20K. The buffer itself is three times this size,
                   1221:        but because of the way it is used for holding "before" lines, the long-
1.1.1.2   misho    1222:        est  line  that  is guaranteed to be processable is the parameter size.
1.1       misho    1223:        You can change the default parameter value by adding, for example,
                   1224: 
                   1225:          --with-pcregrep-bufsize=50K
                   1226: 
                   1227:        to the configure command. The caller of pcregrep can, however, override
                   1228:        this value by specifying a run-time option.
                   1229: 
                   1230: 
                   1231: PCRETEST OPTION FOR LIBREADLINE SUPPORT
                   1232: 
                   1233:        If you add
                   1234: 
                   1235:          --enable-pcretest-libreadline
                   1236: 
1.1.1.2   misho    1237:        to  the  configure  command,  pcretest  is  linked with the libreadline
                   1238:        library, and when its input is from a terminal, it reads it  using  the
1.1       misho    1239:        readline() function. This provides line-editing and history facilities.
                   1240:        Note that libreadline is GPL-licensed, so if you distribute a binary of
                   1241:        pcretest linked in this way, there may be licensing issues.
                   1242: 
1.1.1.2   misho    1243:        Setting  this  option  causes  the -lreadline option to be added to the
                   1244:        pcretest build. In many operating environments with  a  sytem-installed
1.1       misho    1245:        libreadline this is sufficient. However, in some environments (e.g.  if
1.1.1.2   misho    1246:        an unmodified distribution version of readline is in use),  some  extra
                   1247:        configuration  may  be necessary. The INSTALL file for libreadline says
1.1       misho    1248:        this:
                   1249: 
                   1250:          "Readline uses the termcap functions, but does not link with the
                   1251:          termcap or curses library itself, allowing applications which link
                   1252:          with readline the to choose an appropriate library."
                   1253: 
1.1.1.2   misho    1254:        If your environment has not been set up so that an appropriate  library
1.1       misho    1255:        is automatically included, you may need to add something like
                   1256: 
                   1257:          LIBS="-ncurses"
                   1258: 
                   1259:        immediately before the configure command.
                   1260: 
                   1261: 
1.1.1.4 ! misho    1262: DEBUGGING WITH VALGRIND SUPPORT
        !          1263: 
        !          1264:        By adding the
        !          1265: 
        !          1266:          --enable-valgrind
        !          1267: 
        !          1268:        option  to to the configure command, PCRE will use valgrind annotations
        !          1269:        to mark certain memory regions as  unaddressable.  This  allows  it  to
        !          1270:        detect invalid memory accesses, and is mostly useful for debugging PCRE
        !          1271:        itself.
        !          1272: 
        !          1273: 
        !          1274: CODE COVERAGE REPORTING
        !          1275: 
        !          1276:        If your C compiler is gcc, you can build a version  of  PCRE  that  can
        !          1277:        generate a code coverage report for its test suite. To enable this, you
        !          1278:        must install lcov version 1.6 or above. Then specify
        !          1279: 
        !          1280:          --enable-coverage
        !          1281: 
        !          1282:        to the configure command and build PCRE in the usual way.
        !          1283: 
        !          1284:        Note that using ccache (a caching C compiler) is incompatible with code
        !          1285:        coverage  reporting. If you have configured ccache to run automatically
        !          1286:        on your system, you must set the environment variable
        !          1287: 
        !          1288:          CCACHE_DISABLE=1
        !          1289: 
        !          1290:        before running make to build PCRE, so that ccache is not used.
        !          1291: 
        !          1292:        When --enable-coverage is used,  the  following  addition  targets  are
        !          1293:        added to the Makefile:
        !          1294: 
        !          1295:          make coverage
        !          1296: 
        !          1297:        This  creates  a  fresh  coverage report for the PCRE test suite. It is
        !          1298:        equivalent to running "make coverage-reset", "make  coverage-baseline",
        !          1299:        "make check", and then "make coverage-report".
        !          1300: 
        !          1301:          make coverage-reset
        !          1302: 
        !          1303:        This zeroes the coverage counters, but does nothing else.
        !          1304: 
        !          1305:          make coverage-baseline
        !          1306: 
        !          1307:        This captures baseline coverage information.
        !          1308: 
        !          1309:          make coverage-report
        !          1310: 
        !          1311:        This creates the coverage report.
        !          1312: 
        !          1313:          make coverage-clean-report
        !          1314: 
        !          1315:        This  removes the generated coverage report without cleaning the cover-
        !          1316:        age data itself.
        !          1317: 
        !          1318:          make coverage-clean-data
        !          1319: 
        !          1320:        This removes the captured coverage data without removing  the  coverage
        !          1321:        files created at compile time (*.gcno).
        !          1322: 
        !          1323:          make coverage-clean
        !          1324: 
        !          1325:        This  cleans all coverage data including the generated coverage report.
        !          1326:        For more information about code coverage, see the gcov and  lcov  docu-
        !          1327:        mentation.
        !          1328: 
        !          1329: 
1.1       misho    1330: SEE ALSO
                   1331: 
1.1.1.4 ! misho    1332:        pcreapi(3), pcre16, pcre32, pcre_config(3).
1.1       misho    1333: 
                   1334: 
                   1335: AUTHOR
                   1336: 
                   1337:        Philip Hazel
                   1338:        University Computing Service
                   1339:        Cambridge CB2 3QH, England.
                   1340: 
                   1341: 
                   1342: REVISION
                   1343: 
1.1.1.4 ! misho    1344:        Last updated: 12 May 2013
        !          1345:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    1346: ------------------------------------------------------------------------------
                   1347: 
                   1348: 
1.1.1.4 ! misho    1349: PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)
        !          1350: 
1.1       misho    1351: 
                   1352: 
                   1353: NAME
                   1354:        PCRE - Perl-compatible regular expressions
                   1355: 
                   1356: PCRE MATCHING ALGORITHMS
                   1357: 
                   1358:        This document describes the two different algorithms that are available
                   1359:        in PCRE for matching a compiled regular expression against a given sub-
                   1360:        ject  string.  The  "standard"  algorithm  is  the  one provided by the
1.1.1.4 ! misho    1361:        pcre_exec(), pcre16_exec() and pcre32_exec() functions. These  work  in
        !          1362:        the  same as as Perl's matching function, and provide a Perl-compatible
        !          1363:        matching  operation.   The  just-in-time  (JIT)  optimization  that  is
        !          1364:        described  in  the pcrejit documentation is compatible with these func-
        !          1365:        tions.
        !          1366: 
        !          1367:        An  alternative  algorithm  is   provided   by   the   pcre_dfa_exec(),
        !          1368:        pcre16_dfa_exec()  and  pcre32_dfa_exec()  functions; they operate in a
        !          1369:        different way, and are not Perl-compatible. This alternative has advan-
        !          1370:        tages and disadvantages compared with the standard algorithm, and these
        !          1371:        are described below.
1.1       misho    1372: 
                   1373:        When there is only one possible way in which a given subject string can
                   1374:        match  a pattern, the two algorithms give the same answer. A difference
                   1375:        arises, however, when there are multiple possibilities. For example, if
                   1376:        the pattern
                   1377: 
                   1378:          ^<.*>
                   1379: 
                   1380:        is matched against the string
                   1381: 
                   1382:          <something> <something else> <something further>
                   1383: 
                   1384:        there are three possible answers. The standard algorithm finds only one
                   1385:        of them, whereas the alternative algorithm finds all three.
                   1386: 
                   1387: 
                   1388: REGULAR EXPRESSIONS AS TREES
                   1389: 
                   1390:        The set of strings that are matched by a regular expression can be rep-
                   1391:        resented  as  a  tree structure. An unlimited repetition in the pattern
                   1392:        makes the tree of infinite size, but it is still a tree.  Matching  the
                   1393:        pattern  to a given subject string (from a given starting point) can be
                   1394:        thought of as a search of the tree.  There are two  ways  to  search  a
                   1395:        tree:  depth-first  and  breadth-first, and these correspond to the two
                   1396:        matching algorithms provided by PCRE.
                   1397: 
                   1398: 
                   1399: THE STANDARD MATCHING ALGORITHM
                   1400: 
                   1401:        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
                   1402:        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
                   1403:        depth-first search of the pattern tree. That is, it  proceeds  along  a
                   1404:        single path through the tree, checking that the subject matches what is
                   1405:        required. When there is a mismatch, the algorithm  tries  any  alterna-
                   1406:        tives  at  the  current point, and if they all fail, it backs up to the
                   1407:        previous branch point in the  tree,  and  tries  the  next  alternative
                   1408:        branch  at  that  level.  This often involves backing up (moving to the
                   1409:        left) in the subject string as well.  The  order  in  which  repetition
                   1410:        branches  are  tried  is controlled by the greedy or ungreedy nature of
                   1411:        the quantifier.
                   1412: 
                   1413:        If a leaf node is reached, a matching string has  been  found,  and  at
                   1414:        that  point the algorithm stops. Thus, if there is more than one possi-
                   1415:        ble match, this algorithm returns the first one that it finds.  Whether
                   1416:        this  is the shortest, the longest, or some intermediate length depends
                   1417:        on the way the greedy and ungreedy repetition quantifiers are specified
                   1418:        in the pattern.
                   1419: 
                   1420:        Because  it  ends  up  with a single path through the tree, it is rela-
                   1421:        tively straightforward for this algorithm to keep  track  of  the  sub-
                   1422:        strings  that  are  matched  by portions of the pattern in parentheses.
                   1423:        This provides support for capturing parentheses and back references.
                   1424: 
                   1425: 
                   1426: THE ALTERNATIVE MATCHING ALGORITHM
                   1427: 
                   1428:        This algorithm conducts a breadth-first search of  the  tree.  Starting
                   1429:        from  the  first  matching  point  in the subject, it scans the subject
                   1430:        string from left to right, once, character by character, and as it does
                   1431:        this,  it remembers all the paths through the tree that represent valid
                   1432:        matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
                   1433:        though  it is not implemented as a traditional finite state machine (it
                   1434:        keeps multiple states active simultaneously).
                   1435: 
                   1436:        Although the general principle of this matching algorithm  is  that  it
                   1437:        scans  the subject string only once, without backtracking, there is one
                   1438:        exception: when a lookaround assertion is encountered,  the  characters
                   1439:        following  or  preceding  the  current  point  have to be independently
                   1440:        inspected.
                   1441: 
                   1442:        The scan continues until either the end of the subject is  reached,  or
                   1443:        there  are  no more unterminated paths. At this point, terminated paths
                   1444:        represent the different matching possibilities (if there are none,  the
                   1445:        match  has  failed).   Thus,  if there is more than one possible match,
                   1446:        this algorithm finds all of them, and in particular, it finds the long-
                   1447:        est.  The  matches are returned in decreasing order of length. There is
                   1448:        an option to stop the algorithm after the first match (which is  neces-
                   1449:        sarily the shortest) is found.
                   1450: 
                   1451:        Note that all the matches that are found start at the same point in the
                   1452:        subject. If the pattern
                   1453: 
                   1454:          cat(er(pillar)?)?
                   1455: 
                   1456:        is matched against the string "the caterpillar catchment",  the  result
                   1457:        will  be the three strings "caterpillar", "cater", and "cat" that start
                   1458:        at the fifth character of the subject. The algorithm does not automati-
                   1459:        cally move on to find matches that start at later positions.
                   1460: 
                   1461:        There are a number of features of PCRE regular expressions that are not
                   1462:        supported by the alternative matching algorithm. They are as follows:
                   1463: 
                   1464:        1. Because the algorithm finds all  possible  matches,  the  greedy  or
                   1465:        ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
                   1466:        ungreedy quantifiers are treated in exactly the same way. However, pos-
                   1467:        sessive  quantifiers can make a difference when what follows could also
                   1468:        match what is quantified, for example in a pattern like this:
                   1469: 
                   1470:          ^a++\w!
                   1471: 
                   1472:        This pattern matches "aaab!" but not "aaa!", which would be matched  by
                   1473:        a  non-possessive quantifier. Similarly, if an atomic group is present,
                   1474:        it is matched as if it were a standalone pattern at the current  point,
                   1475:        and  the  longest match is then "locked in" for the rest of the overall
                   1476:        pattern.
                   1477: 
                   1478:        2. When dealing with multiple paths through the tree simultaneously, it
                   1479:        is  not  straightforward  to  keep track of captured substrings for the
                   1480:        different matching possibilities, and  PCRE's  implementation  of  this
                   1481:        algorithm does not attempt to do this. This means that no captured sub-
                   1482:        strings are available.
                   1483: 
                   1484:        3. Because no substrings are captured, back references within the  pat-
                   1485:        tern are not supported, and cause errors if encountered.
                   1486: 
                   1487:        4.  For  the same reason, conditional expressions that use a backrefer-
                   1488:        ence as the condition or test for a specific group  recursion  are  not
                   1489:        supported.
                   1490: 
                   1491:        5.  Because  many  paths  through the tree may be active, the \K escape
                   1492:        sequence, which resets the start of the match when encountered (but may
                   1493:        be  on  some  paths  and not on others), is not supported. It causes an
                   1494:        error if encountered.
                   1495: 
                   1496:        6. Callouts are supported, but the value of the  capture_top  field  is
                   1497:        always 1, and the value of the capture_last field is always -1.
                   1498: 
1.1.1.2   misho    1499:        7.  The  \C  escape  sequence, which (in the standard algorithm) always
1.1.1.4 ! misho    1500:        matches a single data unit, even in UTF-8, UTF-16 or UTF-32  modes,  is
        !          1501:        not  supported  in these modes, because the alternative algorithm moves
        !          1502:        through the subject string one character (not data unit) at a time, for
        !          1503:        all active paths through the tree.
1.1       misho    1504: 
1.1.1.2   misho    1505:        8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
                   1506:        are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1.1       misho    1507:        negative assertion.
                   1508: 
                   1509: 
                   1510: ADVANTAGES OF THE ALTERNATIVE ALGORITHM
                   1511: 
1.1.1.2   misho    1512:        Using  the alternative matching algorithm provides the following advan-
1.1       misho    1513:        tages:
                   1514: 
                   1515:        1. All possible matches (at a single point in the subject) are automat-
1.1.1.2   misho    1516:        ically  found,  and  in particular, the longest match is found. To find
1.1       misho    1517:        more than one match using the standard algorithm, you have to do kludgy
                   1518:        things with callouts.
                   1519: 
1.1.1.2   misho    1520:        2.  Because  the  alternative  algorithm  scans the subject string just
                   1521:        once, and never needs to backtrack (except for lookbehinds), it is pos-
                   1522:        sible  to  pass  very  long subject strings to the matching function in
                   1523:        several pieces, checking for partial matching each time. Although it is
                   1524:        possible  to  do multi-segment matching using the standard algorithm by
                   1525:        retaining partially matched substrings, it  is  more  complicated.  The
                   1526:        pcrepartial  documentation  gives  details of partial matching and dis-
                   1527:        cusses multi-segment matching.
1.1       misho    1528: 
                   1529: 
                   1530: DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
                   1531: 
                   1532:        The alternative algorithm suffers from a number of disadvantages:
                   1533: 
1.1.1.2   misho    1534:        1. It is substantially slower than  the  standard  algorithm.  This  is
                   1535:        partly  because  it has to search for all possible matches, but is also
1.1       misho    1536:        because it is less susceptible to optimization.
                   1537: 
                   1538:        2. Capturing parentheses and back references are not supported.
                   1539: 
                   1540:        3. Although atomic groups are supported, their use does not provide the
                   1541:        performance advantage that it does for the standard algorithm.
                   1542: 
                   1543: 
                   1544: AUTHOR
                   1545: 
                   1546:        Philip Hazel
                   1547:        University Computing Service
                   1548:        Cambridge CB2 3QH, England.
                   1549: 
                   1550: 
                   1551: REVISION
                   1552: 
1.1.1.2   misho    1553:        Last updated: 08 January 2012
                   1554:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    1555: ------------------------------------------------------------------------------
                   1556: 
                   1557: 
1.1.1.4 ! misho    1558: PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)
        !          1559: 
1.1       misho    1560: 
                   1561: 
                   1562: NAME
                   1563:        PCRE - Perl-compatible regular expressions
                   1564: 
1.1.1.2   misho    1565:        #include <pcre.h>
1.1       misho    1566: 
                   1567: 
1.1.1.2   misho    1568: PCRE NATIVE API BASIC FUNCTIONS
1.1       misho    1569: 
                   1570:        pcre *pcre_compile(const char *pattern, int options,
                   1571:             const char **errptr, int *erroffset,
                   1572:             const unsigned char *tableptr);
                   1573: 
                   1574:        pcre *pcre_compile2(const char *pattern, int options,
                   1575:             int *errorcodeptr,
                   1576:             const char **errptr, int *erroffset,
                   1577:             const unsigned char *tableptr);
                   1578: 
                   1579:        pcre_extra *pcre_study(const pcre *code, int options,
                   1580:             const char **errptr);
                   1581: 
                   1582:        void pcre_free_study(pcre_extra *extra);
                   1583: 
                   1584:        int pcre_exec(const pcre *code, const pcre_extra *extra,
                   1585:             const char *subject, int length, int startoffset,
                   1586:             int options, int *ovector, int ovecsize);
                   1587: 
                   1588:        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
                   1589:             const char *subject, int length, int startoffset,
                   1590:             int options, int *ovector, int ovecsize,
                   1591:             int *workspace, int wscount);
                   1592: 
1.1.1.2   misho    1593: 
                   1594: PCRE NATIVE API STRING EXTRACTION FUNCTIONS
                   1595: 
1.1       misho    1596:        int pcre_copy_named_substring(const pcre *code,
                   1597:             const char *subject, int *ovector,
                   1598:             int stringcount, const char *stringname,
                   1599:             char *buffer, int buffersize);
                   1600: 
                   1601:        int pcre_copy_substring(const char *subject, int *ovector,
                   1602:             int stringcount, int stringnumber, char *buffer,
                   1603:             int buffersize);
                   1604: 
                   1605:        int pcre_get_named_substring(const pcre *code,
                   1606:             const char *subject, int *ovector,
                   1607:             int stringcount, const char *stringname,
                   1608:             const char **stringptr);
                   1609: 
                   1610:        int pcre_get_stringnumber(const pcre *code,
                   1611:             const char *name);
                   1612: 
                   1613:        int pcre_get_stringtable_entries(const pcre *code,
                   1614:             const char *name, char **first, char **last);
                   1615: 
                   1616:        int pcre_get_substring(const char *subject, int *ovector,
                   1617:             int stringcount, int stringnumber,
                   1618:             const char **stringptr);
                   1619: 
                   1620:        int pcre_get_substring_list(const char *subject,
                   1621:             int *ovector, int stringcount, const char ***listptr);
                   1622: 
                   1623:        void pcre_free_substring(const char *stringptr);
                   1624: 
                   1625:        void pcre_free_substring_list(const char **stringptr);
                   1626: 
1.1.1.2   misho    1627: 
                   1628: PCRE NATIVE API AUXILIARY FUNCTIONS
                   1629: 
1.1.1.4 ! misho    1630:        int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
        !          1631:             const char *subject, int length, int startoffset,
        !          1632:             int options, int *ovector, int ovecsize,
        !          1633:             pcre_jit_stack *jstack);
        !          1634: 
1.1.1.2   misho    1635:        pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
                   1636: 
                   1637:        void pcre_jit_stack_free(pcre_jit_stack *stack);
                   1638: 
                   1639:        void pcre_assign_jit_stack(pcre_extra *extra,
                   1640:             pcre_jit_callback callback, void *data);
                   1641: 
1.1       misho    1642:        const unsigned char *pcre_maketables(void);
                   1643: 
                   1644:        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
                   1645:             int what, void *where);
                   1646: 
                   1647:        int pcre_refcount(pcre *code, int adjust);
                   1648: 
                   1649:        int pcre_config(int what, void *where);
                   1650: 
1.1.1.2   misho    1651:        const char *pcre_version(void);
                   1652: 
                   1653:        int pcre_pattern_to_host_byte_order(pcre *code,
                   1654:             pcre_extra *extra, const unsigned char *tables);
1.1       misho    1655: 
                   1656: 
                   1657: PCRE NATIVE API INDIRECTED FUNCTIONS
                   1658: 
                   1659:        void *(*pcre_malloc)(size_t);
                   1660: 
                   1661:        void (*pcre_free)(void *);
                   1662: 
                   1663:        void *(*pcre_stack_malloc)(size_t);
                   1664: 
                   1665:        void (*pcre_stack_free)(void *);
                   1666: 
                   1667:        int (*pcre_callout)(pcre_callout_block *);
                   1668: 
                   1669: 
1.1.1.4 ! misho    1670: PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1.1.1.2   misho    1671: 
1.1.1.4 ! misho    1672:        As  well  as  support  for  8-bit character strings, PCRE also supports
        !          1673:        16-bit strings (from release 8.30) and  32-bit  strings  (from  release
        !          1674:        8.32),  by means of two additional libraries. They can be built as well
        !          1675:        as, or instead of, the 8-bit library. To avoid too  much  complication,
        !          1676:        this  document describes the 8-bit versions of the functions, with only
        !          1677:        occasional references to the 16-bit and 32-bit libraries.
        !          1678: 
        !          1679:        The 16-bit and 32-bit functions operate in the same way as their  8-bit
        !          1680:        counterparts;  they  just  use different data types for their arguments
        !          1681:        and results, and their names start with pcre16_ or pcre32_  instead  of
        !          1682:        pcre_.  For  every  option  that  has  UTF8  in  its name (for example,
        !          1683:        PCRE_UTF8), there are corresponding 16-bit and 32-bit names  with  UTF8
        !          1684:        replaced by UTF16 or UTF32, respectively. This facility is in fact just
        !          1685:        cosmetic; the 16-bit and 32-bit option names define the same  bit  val-
1.1.1.2   misho    1686:        ues.
                   1687: 
                   1688:        References to bytes and UTF-8 in this document should be read as refer-
1.1.1.4 ! misho    1689:        ences to 16-bit data units and UTF-16 when using the 16-bit library, or
        !          1690:        32-bit  data  units  and  UTF-32  when using the 32-bit library, unless
        !          1691:        specified otherwise.  More details of the specific differences for  the
        !          1692:        16-bit and 32-bit libraries are given in the pcre16 and pcre32 pages.
1.1.1.2   misho    1693: 
                   1694: 
1.1       misho    1695: PCRE API OVERVIEW
                   1696: 
                   1697:        PCRE has its own native API, which is described in this document. There
1.1.1.4 ! misho    1698:        are also some wrapper functions (for the 8-bit library only) that  cor-
        !          1699:        respond  to  the  POSIX  regular  expression  API, but they do not give
        !          1700:        access to all the functionality. They are described  in  the  pcreposix
        !          1701:        documentation.  Both  of these APIs define a set of C function calls. A
1.1.1.2   misho    1702:        C++ wrapper (again for the 8-bit library only) is also distributed with
                   1703:        PCRE. It is documented in the pcrecpp page.
1.1       misho    1704: 
1.1.1.4 ! misho    1705:        The  native  API  C  function prototypes are defined in the header file
        !          1706:        pcre.h, and on Unix-like systems the (8-bit) library itself  is  called
        !          1707:        libpcre.  It  can  normally be accessed by adding -lpcre to the command
        !          1708:        for linking an application that uses PCRE. The header file defines  the
1.1.1.2   misho    1709:        macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1.1.1.4 ! misho    1710:        numbers for the library. Applications can use these to include  support
1.1       misho    1711:        for different releases of PCRE.
                   1712: 
                   1713:        In a Windows environment, if you want to statically link an application
1.1.1.4 ! misho    1714:        program against a non-dll pcre.a  file,  you  must  define  PCRE_STATIC
        !          1715:        before  including  pcre.h or pcrecpp.h, because otherwise the pcre_mal-
1.1       misho    1716:        loc()   and   pcre_free()   exported   functions   will   be   declared
                   1717:        __declspec(dllimport), with unwanted results.
                   1718: 
1.1.1.4 ! misho    1719:        The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
        !          1720:        pcre_exec() are used for compiling and matching regular expressions  in
        !          1721:        a  Perl-compatible  manner. A sample program that demonstrates the sim-
        !          1722:        plest way of using them is provided in the file  called  pcredemo.c  in
1.1       misho    1723:        the PCRE source distribution. A listing of this program is given in the
1.1.1.4 ! misho    1724:        pcredemo documentation, and the pcresample documentation describes  how
1.1       misho    1725:        to compile and run it.
                   1726: 
1.1.1.4 ! misho    1727:        Just-in-time  compiler  support is an optional feature of PCRE that can
1.1       misho    1728:        be built in appropriate hardware environments. It greatly speeds up the
1.1.1.4 ! misho    1729:        matching  performance  of  many  patterns.  Simple  programs can easily
        !          1730:        request that it be used if available, by  setting  an  option  that  is
        !          1731:        ignored  when  it is not relevant. More complicated programs might need
        !          1732:        to    make    use    of    the    functions     pcre_jit_stack_alloc(),
        !          1733:        pcre_jit_stack_free(),  and pcre_assign_jit_stack() in order to control
        !          1734:        the JIT code's memory usage.
        !          1735: 
        !          1736:        From release 8.32 there is also a direct interface for  JIT  execution,
        !          1737:        which  gives  improved performance. The JIT-specific functions are dis-
        !          1738:        cussed in the pcrejit documentation.
1.1       misho    1739: 
                   1740:        A second matching function, pcre_dfa_exec(), which is not Perl-compati-
                   1741:        ble,  is  also provided. This uses a different algorithm for the match-
                   1742:        ing. The alternative algorithm finds all possible matches (at  a  given
                   1743:        point  in  the  subject), and scans the subject just once (unless there
                   1744:        are lookbehind assertions). However, this  algorithm  does  not  return
                   1745:        captured  substrings.  A description of the two matching algorithms and
                   1746:        their advantages and disadvantages is given in the  pcrematching  docu-
                   1747:        mentation.
                   1748: 
                   1749:        In  addition  to  the  main compiling and matching functions, there are
                   1750:        convenience functions for extracting captured substrings from a subject
                   1751:        string that is matched by pcre_exec(). They are:
                   1752: 
                   1753:          pcre_copy_substring()
                   1754:          pcre_copy_named_substring()
                   1755:          pcre_get_substring()
                   1756:          pcre_get_named_substring()
                   1757:          pcre_get_substring_list()
                   1758:          pcre_get_stringnumber()
                   1759:          pcre_get_stringtable_entries()
                   1760: 
                   1761:        pcre_free_substring() and pcre_free_substring_list() are also provided,
                   1762:        to free the memory used for extracted strings.
                   1763: 
                   1764:        The function pcre_maketables() is used to  build  a  set  of  character
                   1765:        tables   in   the   current   locale  for  passing  to  pcre_compile(),
                   1766:        pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
                   1767:        provided  for  specialist  use.  Most  commonly,  no special tables are
                   1768:        passed, in which case internal tables that are generated when  PCRE  is
                   1769:        built are used.
                   1770: 
                   1771:        The  function  pcre_fullinfo()  is used to find out information about a
1.1.1.2   misho    1772:        compiled pattern. The function pcre_version() returns a  pointer  to  a
                   1773:        string containing the version of PCRE and its date of release.
1.1       misho    1774: 
                   1775:        The  function  pcre_refcount()  maintains  a  reference count in a data
                   1776:        block containing a compiled pattern. This is provided for  the  benefit
                   1777:        of object-oriented applications.
                   1778: 
                   1779:        The  global  variables  pcre_malloc and pcre_free initially contain the
                   1780:        entry points of the standard malloc()  and  free()  functions,  respec-
                   1781:        tively. PCRE calls the memory management functions via these variables,
                   1782:        so a calling program can replace them if it  wishes  to  intercept  the
                   1783:        calls. This should be done before calling any PCRE functions.
                   1784: 
                   1785:        The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
                   1786:        indirections to memory management functions.  These  special  functions
                   1787:        are  used  only  when  PCRE is compiled to use the heap for remembering
                   1788:        data, instead of recursive function calls, when running the pcre_exec()
                   1789:        function.  See  the  pcrebuild  documentation  for details of how to do
                   1790:        this. It is a non-standard way of building PCRE, for  use  in  environ-
                   1791:        ments  that  have  limited stacks. Because of the greater use of memory
                   1792:        management, it runs more slowly. Separate  functions  are  provided  so
                   1793:        that  special-purpose  external  code  can  be used for this case. When
                   1794:        used, these functions are always called in a  stack-like  manner  (last
                   1795:        obtained,  first freed), and always for memory blocks of the same size.
                   1796:        There is a discussion about PCRE's stack usage in the  pcrestack  docu-
                   1797:        mentation.
                   1798: 
                   1799:        The global variable pcre_callout initially contains NULL. It can be set
                   1800:        by the caller to a "callout" function, which PCRE  will  then  call  at
                   1801:        specified  points during a matching operation. Details are given in the
                   1802:        pcrecallout documentation.
                   1803: 
                   1804: 
                   1805: NEWLINES
                   1806: 
                   1807:        PCRE supports five different conventions for indicating line breaks  in
                   1808:        strings:  a  single  CR (carriage return) character, a single LF (line-
                   1809:        feed) character, the two-character sequence CRLF, any of the three pre-
                   1810:        ceding,  or any Unicode newline sequence. The Unicode newline sequences
                   1811:        are the three just mentioned, plus the single characters  VT  (vertical
1.1.1.3   misho    1812:        tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1.1       misho    1813:        separator, U+2028), and PS (paragraph separator, U+2029).
                   1814: 
                   1815:        Each of the first three conventions is used by at least  one  operating
                   1816:        system  as its standard newline sequence. When PCRE is built, a default
                   1817:        can be specified.  The default default is LF, which is the  Unix  stan-
                   1818:        dard.  When  PCRE  is run, the default can be overridden, either when a
                   1819:        pattern is compiled, or when it is matched.
                   1820: 
                   1821:        At compile time, the newline convention can be specified by the options
                   1822:        argument  of  pcre_compile(), or it can be specified by special text at
                   1823:        the start of the pattern itself; this overrides any other settings. See
                   1824:        the pcrepattern page for details of the special character sequences.
                   1825: 
                   1826:        In the PCRE documentation the word "newline" is used to mean "the char-
                   1827:        acter or pair of characters that indicate a line break". The choice  of
                   1828:        newline  convention  affects  the  handling of the dot, circumflex, and
                   1829:        dollar metacharacters, the handling of #-comments in /x mode, and, when
                   1830:        CRLF  is a recognized line ending sequence, the match position advance-
                   1831:        ment for a non-anchored pattern. There is more detail about this in the
                   1832:        section on pcre_exec() options below.
                   1833: 
                   1834:        The  choice of newline convention does not affect the interpretation of
                   1835:        the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
                   1836:        which is controlled in a similar way, but by separate options.
                   1837: 
                   1838: 
                   1839: MULTITHREADING
                   1840: 
                   1841:        The  PCRE  functions  can be used in multi-threading applications, with
                   1842:        the  proviso  that  the  memory  management  functions  pointed  to  by
                   1843:        pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
                   1844:        callout function pointed to by pcre_callout, are shared by all threads.
                   1845: 
                   1846:        The compiled form of a regular expression is not altered during  match-
                   1847:        ing, so the same compiled pattern can safely be used by several threads
                   1848:        at once.
                   1849: 
                   1850:        If the just-in-time optimization feature is being used, it needs  sepa-
                   1851:        rate  memory stack areas for each thread. See the pcrejit documentation
                   1852:        for more details.
                   1853: 
                   1854: 
                   1855: SAVING PRECOMPILED PATTERNS FOR LATER USE
                   1856: 
                   1857:        The compiled form of a regular expression can be saved and re-used at a
                   1858:        later  time,  possibly by a different program, and even on a host other
                   1859:        than the one on which  it  was  compiled.  Details  are  given  in  the
1.1.1.2   misho    1860:        pcreprecompile  documentation,  which  includes  a  description  of the
                   1861:        pcre_pattern_to_host_byte_order() function. However, compiling a  regu-
                   1862:        lar  expression  with one version of PCRE for use with a different ver-
                   1863:        sion is not guaranteed to work and may cause crashes.
1.1       misho    1864: 
                   1865: 
                   1866: CHECKING BUILD-TIME OPTIONS
                   1867: 
                   1868:        int pcre_config(int what, void *where);
                   1869: 
1.1.1.2   misho    1870:        The function pcre_config() makes it possible for a PCRE client to  dis-
1.1       misho    1871:        cover which optional features have been compiled into the PCRE library.
1.1.1.2   misho    1872:        The pcrebuild documentation has more details about these optional  fea-
1.1       misho    1873:        tures.
                   1874: 
1.1.1.2   misho    1875:        The  first  argument  for pcre_config() is an integer, specifying which
1.1       misho    1876:        information is required; the second argument is a pointer to a variable
1.1.1.2   misho    1877:        into  which  the  information  is placed. The returned value is zero on
                   1878:        success, or the negative error code PCRE_ERROR_BADOPTION if  the  value
                   1879:        in  the  first argument is not recognized. The following information is
1.1       misho    1880:        available:
                   1881: 
                   1882:          PCRE_CONFIG_UTF8
                   1883: 
1.1.1.2   misho    1884:        The output is an integer that is set to one if UTF-8 support is  avail-
1.1.1.4 ! misho    1885:        able;  otherwise it is set to zero. This value should normally be given
        !          1886:        to the 8-bit version of this function, pcre_config(). If it is given to
        !          1887:        the   16-bit  or  32-bit  version  of  this  function,  the  result  is
1.1.1.2   misho    1888:        PCRE_ERROR_BADOPTION.
                   1889: 
                   1890:          PCRE_CONFIG_UTF16
                   1891: 
                   1892:        The output is an integer that is set to one if UTF-16 support is avail-
1.1.1.4 ! misho    1893:        able;  otherwise it is set to zero. This value should normally be given
1.1.1.2   misho    1894:        to the 16-bit version of this function, pcre16_config(). If it is given
1.1.1.4 ! misho    1895:        to  the  8-bit  or  32-bit  version  of  this  function,  the result is
        !          1896:        PCRE_ERROR_BADOPTION.
        !          1897: 
        !          1898:          PCRE_CONFIG_UTF32
        !          1899: 
        !          1900:        The output is an integer that is set to one if UTF-32 support is avail-
        !          1901:        able;  otherwise it is set to zero. This value should normally be given
        !          1902:        to the 32-bit version of this function, pcre32_config(). If it is given
        !          1903:        to  the  8-bit  or  16-bit  version  of  this  function,  the result is
        !          1904:        PCRE_ERROR_BADOPTION.
1.1       misho    1905: 
                   1906:          PCRE_CONFIG_UNICODE_PROPERTIES
                   1907: 
1.1.1.4 ! misho    1908:        The output is an integer that is set to  one  if  support  for  Unicode
1.1       misho    1909:        character properties is available; otherwise it is set to zero.
                   1910: 
                   1911:          PCRE_CONFIG_JIT
                   1912: 
                   1913:        The output is an integer that is set to one if support for just-in-time
                   1914:        compiling is available; otherwise it is set to zero.
                   1915: 
1.1.1.2   misho    1916:          PCRE_CONFIG_JITTARGET
                   1917: 
1.1.1.4 ! misho    1918:        The output is a pointer to a zero-terminated "const char *" string.  If
1.1.1.2   misho    1919:        JIT support is available, the string contains the name of the architec-
1.1.1.4 ! misho    1920:        ture for which the JIT compiler is configured, for example  "x86  32bit
        !          1921:        (little  endian  +  unaligned)".  If  JIT support is not available, the
1.1.1.2   misho    1922:        result is NULL.
                   1923: 
1.1       misho    1924:          PCRE_CONFIG_NEWLINE
                   1925: 
1.1.1.4 ! misho    1926:        The output is an integer whose value specifies  the  default  character
        !          1927:        sequence  that  is recognized as meaning "newline". The values that are
        !          1928:        supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
        !          1929:        for  CRLF,  -2 for ANYCRLF, and -1 for ANY. In EBCDIC environments, CR,
        !          1930:        ANYCRLF, and ANY yield the same values. However, the value  for  LF  is
        !          1931:        normally  21, though some EBCDIC environments use 37. The corresponding
        !          1932:        values for CRLF are 3349 and 3365. The default should  normally  corre-
1.1       misho    1933:        spond to the standard sequence for your operating system.
                   1934: 
                   1935:          PCRE_CONFIG_BSR
                   1936: 
                   1937:        The output is an integer whose value indicates what character sequences
1.1.1.4 ! misho    1938:        the \R escape sequence matches by default. A value of 0 means  that  \R
        !          1939:        matches  any  Unicode  line ending sequence; a value of 1 means that \R
1.1       misho    1940:        matches only CR, LF, or CRLF. The default can be overridden when a pat-
                   1941:        tern is compiled or matched.
                   1942: 
                   1943:          PCRE_CONFIG_LINK_SIZE
                   1944: 
1.1.1.4 ! misho    1945:        The  output  is  an  integer that contains the number of bytes used for
1.1.1.2   misho    1946:        internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
                   1947:        library, the value can be 2, 3, or 4. For the 16-bit library, the value
1.1.1.4 ! misho    1948:        is either 2 or 4 and is  still  a  number  of  bytes.  For  the  32-bit
        !          1949:        library, the value is either 2 or 4 and is still a number of bytes. The
        !          1950:        default value of 2 is sufficient for all but the most massive patterns,
        !          1951:        since  it  allows  the compiled pattern to be up to 64K in size. Larger
        !          1952:        values allow larger regular expressions to be compiled, at the  expense
        !          1953:        of slower matching.
1.1       misho    1954: 
                   1955:          PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
                   1956: 
1.1.1.2   misho    1957:        The  output  is  an integer that contains the threshold above which the
                   1958:        POSIX interface uses malloc() for output vectors. Further  details  are
1.1       misho    1959:        given in the pcreposix documentation.
                   1960: 
                   1961:          PCRE_CONFIG_MATCH_LIMIT
                   1962: 
1.1.1.2   misho    1963:        The  output is a long integer that gives the default limit for the num-
                   1964:        ber of internal matching function calls  in  a  pcre_exec()  execution.
1.1       misho    1965:        Further details are given with pcre_exec() below.
                   1966: 
                   1967:          PCRE_CONFIG_MATCH_LIMIT_RECURSION
                   1968: 
                   1969:        The output is a long integer that gives the default limit for the depth
1.1.1.2   misho    1970:        of  recursion  when  calling  the  internal  matching  function  in   a
                   1971:        pcre_exec()  execution.  Further  details  are  given  with pcre_exec()
1.1       misho    1972:        below.
                   1973: 
                   1974:          PCRE_CONFIG_STACKRECURSE
                   1975: 
1.1.1.2   misho    1976:        The output is an integer that is set to one if internal recursion  when
1.1       misho    1977:        running pcre_exec() is implemented by recursive function calls that use
1.1.1.2   misho    1978:        the stack to remember their state. This is the usual way that  PCRE  is
1.1       misho    1979:        compiled. The output is zero if PCRE was compiled to use blocks of data
1.1.1.2   misho    1980:        on the  heap  instead  of  recursive  function  calls.  In  this  case,
                   1981:        pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1.1       misho    1982:        blocks on the heap, thus avoiding the use of the stack.
                   1983: 
                   1984: 
                   1985: COMPILING A PATTERN
                   1986: 
                   1987:        pcre *pcre_compile(const char *pattern, int options,
                   1988:             const char **errptr, int *erroffset,
                   1989:             const unsigned char *tableptr);
                   1990: 
                   1991:        pcre *pcre_compile2(const char *pattern, int options,
                   1992:             int *errorcodeptr,
                   1993:             const char **errptr, int *erroffset,
                   1994:             const unsigned char *tableptr);
                   1995: 
                   1996:        Either of the functions pcre_compile() or pcre_compile2() can be called
                   1997:        to compile a pattern into an internal form. The only difference between
1.1.1.2   misho    1998:        the two interfaces is that pcre_compile2() has an additional  argument,
                   1999:        errorcodeptr,  via  which  a  numerical  error code can be returned. To
                   2000:        avoid too much repetition, we refer just to pcre_compile()  below,  but
1.1       misho    2001:        the information applies equally to pcre_compile2().
                   2002: 
                   2003:        The pattern is a C string terminated by a binary zero, and is passed in
1.1.1.2   misho    2004:        the pattern argument. A pointer to a single block  of  memory  that  is
                   2005:        obtained  via  pcre_malloc is returned. This contains the compiled code
1.1       misho    2006:        and related data. The pcre type is defined for the returned block; this
                   2007:        is a typedef for a structure whose contents are not externally defined.
                   2008:        It is up to the caller to free the memory (via pcre_free) when it is no
                   2009:        longer required.
                   2010: 
1.1.1.2   misho    2011:        Although  the compiled code of a PCRE regex is relocatable, that is, it
1.1       misho    2012:        does not depend on memory location, the complete pcre data block is not
1.1.1.2   misho    2013:        fully  relocatable, because it may contain a copy of the tableptr argu-
1.1       misho    2014:        ment, which is an address (see below).
                   2015: 
                   2016:        The options argument contains various bit settings that affect the com-
1.1.1.2   misho    2017:        pilation.  It  should be zero if no options are required. The available
                   2018:        options are described below. Some of them (in  particular,  those  that
                   2019:        are  compatible with Perl, but some others as well) can also be set and
                   2020:        unset from within the pattern (see  the  detailed  description  in  the
                   2021:        pcrepattern  documentation). For those options that can be different in
                   2022:        different parts of the pattern, the contents of  the  options  argument
1.1       misho    2023:        specifies their settings at the start of compilation and execution. The
1.1.1.2   misho    2024:        PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK,  and
1.1.1.3   misho    2025:        PCRE_NO_START_OPTIMIZE  options  can  be set at the time of matching as
                   2026:        well as at compile time.
1.1       misho    2027: 
                   2028:        If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1.1.1.2   misho    2029:        if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1.1       misho    2030:        sets the variable pointed to by errptr to point to a textual error mes-
                   2031:        sage. This is a static string that is part of the library. You must not
1.1.1.2   misho    2032:        try to free it. Normally, the offset from the start of the  pattern  to
1.1.1.4 ! misho    2033:        the data unit that was being processed when the error was discovered is
1.1.1.2   misho    2034:        placed in the variable pointed to by erroffset, which must not be  NULL
                   2035:        (if  it is, an immediate error is given). However, for an invalid UTF-8
1.1.1.4 ! misho    2036:        or UTF-16 string, the offset is that of the  first  data  unit  of  the
        !          2037:        failing character.
1.1       misho    2038: 
1.1.1.4 ! misho    2039:        Some  errors are not detected until the whole pattern has been scanned;
        !          2040:        in these cases, the offset passed back is the length  of  the  pattern.
        !          2041:        Note  that  the  offset is in data units, not characters, even in a UTF
        !          2042:        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
        !          2043:        acter.
1.1       misho    2044: 
                   2045:        If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
                   2046:        codeptr argument is not NULL, a non-zero error code number is  returned
                   2047:        via  this argument in the event of an error. This is in addition to the
                   2048:        textual error message. Error codes and messages are listed below.
                   2049: 
                   2050:        If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
                   2051:        character  tables  that  are  built  when  PCRE  is compiled, using the
                   2052:        default C locale. Otherwise, tableptr must be an address  that  is  the
                   2053:        result  of  a  call to pcre_maketables(). This value is stored with the
                   2054:        compiled pattern, and used again by pcre_exec(), unless  another  table
                   2055:        pointer is passed to it. For more discussion, see the section on locale
                   2056:        support below.
                   2057: 
                   2058:        This code fragment shows a typical straightforward  call  to  pcre_com-
                   2059:        pile():
                   2060: 
                   2061:          pcre *re;
                   2062:          const char *error;
                   2063:          int erroffset;
                   2064:          re = pcre_compile(
                   2065:            "^A.*Z",          /* the pattern */
                   2066:            0,                /* default options */
                   2067:            &error,           /* for error message */
                   2068:            &erroffset,       /* for error offset */
                   2069:            NULL);            /* use default character tables */
                   2070: 
                   2071:        The  following  names  for option bits are defined in the pcre.h header
                   2072:        file:
                   2073: 
                   2074:          PCRE_ANCHORED
                   2075: 
                   2076:        If this bit is set, the pattern is forced to be "anchored", that is, it
                   2077:        is  constrained to match only at the first matching point in the string
                   2078:        that is being searched (the "subject string"). This effect can also  be
                   2079:        achieved  by appropriate constructs in the pattern itself, which is the
                   2080:        only way to do it in Perl.
                   2081: 
                   2082:          PCRE_AUTO_CALLOUT
                   2083: 
                   2084:        If this bit is set, pcre_compile() automatically inserts callout items,
                   2085:        all  with  number  255, before each pattern item. For discussion of the
                   2086:        callout facility, see the pcrecallout documentation.
                   2087: 
                   2088:          PCRE_BSR_ANYCRLF
                   2089:          PCRE_BSR_UNICODE
                   2090: 
                   2091:        These options (which are mutually exclusive) control what the \R escape
                   2092:        sequence  matches.  The choice is either to match only CR, LF, or CRLF,
                   2093:        or to match any Unicode newline sequence. The default is specified when
                   2094:        PCRE is built. It can be overridden from within the pattern, or by set-
                   2095:        ting an option when a compiled pattern is matched.
                   2096: 
                   2097:          PCRE_CASELESS
                   2098: 
                   2099:        If this bit is set, letters in the pattern match both upper  and  lower
                   2100:        case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
                   2101:        changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
                   2102:        always  understands the concept of case for characters whose values are
                   2103:        less than 128, so caseless matching is always possible. For  characters
                   2104:        with  higher  values,  the concept of case is supported if PCRE is com-
                   2105:        piled with Unicode property support, but not otherwise. If you want  to
                   2106:        use  caseless  matching  for  characters 128 and above, you must ensure
                   2107:        that PCRE is compiled with Unicode property support  as  well  as  with
                   2108:        UTF-8 support.
                   2109: 
                   2110:          PCRE_DOLLAR_ENDONLY
                   2111: 
                   2112:        If  this bit is set, a dollar metacharacter in the pattern matches only
                   2113:        at the end of the subject string. Without this option,  a  dollar  also
                   2114:        matches  immediately before a newline at the end of the string (but not
                   2115:        before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
                   2116:        if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
                   2117:        Perl, and no way to set it within a pattern.
                   2118: 
                   2119:          PCRE_DOTALL
                   2120: 
                   2121:        If this bit is set, a dot metacharacter in the pattern matches a  char-
                   2122:        acter of any value, including one that indicates a newline. However, it
                   2123:        only ever matches one character, even if newlines are  coded  as  CRLF.
                   2124:        Without  this option, a dot does not match when the current position is
                   2125:        at a newline. This option is equivalent to Perl's /s option, and it can
                   2126:        be  changed within a pattern by a (?s) option setting. A negative class
                   2127:        such as [^a] always matches newline characters, independent of the set-
                   2128:        ting of this option.
                   2129: 
                   2130:          PCRE_DUPNAMES
                   2131: 
                   2132:        If  this  bit is set, names used to identify capturing subpatterns need
                   2133:        not be unique. This can be helpful for certain types of pattern when it
                   2134:        is  known  that  only  one instance of the named subpattern can ever be
                   2135:        matched. There are more details of named subpatterns  below;  see  also
                   2136:        the pcrepattern documentation.
                   2137: 
                   2138:          PCRE_EXTENDED
                   2139: 
1.1.1.3   misho    2140:        If  this  bit  is  set,  white space data characters in the pattern are
                   2141:        totally ignored except when escaped or inside a character class.  White
1.1       misho    2142:        space does not include the VT character (code 11). In addition, charac-
                   2143:        ters between an unescaped # outside a character class and the next new-
                   2144:        line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
                   2145:        option, and it can be changed within a pattern by a  (?x)  option  set-
                   2146:        ting.
                   2147: 
                   2148:        Which  characters  are  interpreted  as  newlines  is controlled by the
                   2149:        options passed to pcre_compile() or by a special sequence at the  start
                   2150:        of  the  pattern, as described in the section entitled "Newline conven-
                   2151:        tions" in the pcrepattern documentation. Note that the end of this type
                   2152:        of  comment  is  a  literal  newline  sequence  in  the pattern; escape
                   2153:        sequences that happen to represent a newline do not count.
                   2154: 
                   2155:        This option makes it possible to include  comments  inside  complicated
                   2156:        patterns.   Note,  however,  that this applies only to data characters.
1.1.1.3   misho    2157:        White space  characters  may  never  appear  within  special  character
1.1       misho    2158:        sequences in a pattern, for example within the sequence (?( that intro-
                   2159:        duces a conditional subpattern.
                   2160: 
                   2161:          PCRE_EXTRA
                   2162: 
                   2163:        This option was invented in order to turn on  additional  functionality
                   2164:        of  PCRE  that  is  incompatible with Perl, but it is currently of very
                   2165:        little use. When set, any backslash in a pattern that is followed by  a
                   2166:        letter  that  has  no  special  meaning causes an error, thus reserving
                   2167:        these combinations for future expansion. By  default,  as  in  Perl,  a
                   2168:        backslash  followed by a letter with no special meaning is treated as a
                   2169:        literal. (Perl can, however, be persuaded to give an error for this, by
                   2170:        running  it with the -w option.) There are at present no other features
                   2171:        controlled by this option. It can also be set by a (?X) option  setting
                   2172:        within a pattern.
                   2173: 
                   2174:          PCRE_FIRSTLINE
                   2175: 
                   2176:        If  this  option  is  set,  an  unanchored pattern is required to match
                   2177:        before or at the first  newline  in  the  subject  string,  though  the
                   2178:        matched text may continue over the newline.
                   2179: 
                   2180:          PCRE_JAVASCRIPT_COMPAT
                   2181: 
                   2182:        If this option is set, PCRE's behaviour is changed in some ways so that
                   2183:        it is compatible with JavaScript rather than Perl. The changes  are  as
                   2184:        follows:
                   2185: 
                   2186:        (1)  A  lone  closing square bracket in a pattern causes a compile-time
                   2187:        error, because this is illegal in JavaScript (by default it is  treated
                   2188:        as a data character). Thus, the pattern AB]CD becomes illegal when this
                   2189:        option is set.
                   2190: 
                   2191:        (2) At run time, a back reference to an unset subpattern group  matches
                   2192:        an  empty  string (by default this causes the current matching alterna-
                   2193:        tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
                   2194:        set  (assuming  it can find an "a" in the subject), whereas it fails by
                   2195:        default, for Perl compatibility.
                   2196: 
                   2197:        (3) \U matches an upper case "U" character; by default \U causes a com-
                   2198:        pile time error (Perl uses \U to upper case subsequent characters).
                   2199: 
                   2200:        (4) \u matches a lower case "u" character unless it is followed by four
                   2201:        hexadecimal digits, in which case the hexadecimal  number  defines  the
                   2202:        code  point  to match. By default, \u causes a compile time error (Perl
                   2203:        uses it to upper case the following character).
                   2204: 
                   2205:        (5) \x matches a lower case "x" character unless it is followed by  two
                   2206:        hexadecimal  digits,  in  which case the hexadecimal number defines the
                   2207:        code point to match. By default, as in Perl, a  hexadecimal  number  is
                   2208:        always expected after \x, but it may have zero, one, or two digits (so,
                   2209:        for example, \xz matches a binary zero character followed by z).
                   2210: 
                   2211:          PCRE_MULTILINE
                   2212: 
1.1.1.4 ! misho    2213:        By default, for the purposes of matching "start of line"  and  "end  of
        !          2214:        line", PCRE treats the subject string as consisting of a single line of
        !          2215:        characters, even if it actually contains newlines. The "start of  line"
        !          2216:        metacharacter (^) matches only at the start of the string, and the "end
        !          2217:        of line" metacharacter ($) matches only at the end of  the  string,  or
        !          2218:        before  a terminating newline (except when PCRE_DOLLAR_ENDONLY is set).
        !          2219:        Note, however, that unless PCRE_DOTALL  is  set,  the  "any  character"
        !          2220:        metacharacter  (.)  does not match at a newline. This behaviour (for ^,
        !          2221:        $, and dot) is the same as Perl.
        !          2222: 
        !          2223:        When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
        !          2224:        constructs  match  immediately following or immediately before internal
        !          2225:        newlines in the subject string, respectively, as well as  at  the  very
        !          2226:        start  and  end.  This is equivalent to Perl's /m option, and it can be
1.1       misho    2227:        changed within a pattern by a (?m) option setting. If there are no new-
1.1.1.4 ! misho    2228:        lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,
1.1       misho    2229:        setting PCRE_MULTILINE has no effect.
                   2230: 
1.1.1.4 ! misho    2231:          PCRE_NEVER_UTF
        !          2232: 
        !          2233:        This option locks out interpretation of the pattern as UTF-8 (or UTF-16
        !          2234:        or  UTF-32  in the 16-bit and 32-bit libraries). In particular, it pre-
        !          2235:        vents the creator of the pattern from switching to  UTF  interpretation
        !          2236:        by starting the pattern with (*UTF). This may be useful in applications
        !          2237:        that  process  patterns  from  external  sources.  The  combination  of
        !          2238:        PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
        !          2239: 
1.1       misho    2240:          PCRE_NEWLINE_CR
                   2241:          PCRE_NEWLINE_LF
                   2242:          PCRE_NEWLINE_CRLF
                   2243:          PCRE_NEWLINE_ANYCRLF
                   2244:          PCRE_NEWLINE_ANY
                   2245: 
                   2246:        These  options  override the default newline definition that was chosen
                   2247:        when PCRE was built. Setting the first or the second specifies  that  a
                   2248:        newline  is  indicated  by a single character (CR or LF, respectively).
                   2249:        Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
                   2250:        two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
                   2251:        that any of the three preceding sequences should be recognized. Setting
                   2252:        PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1.1.1.4 ! misho    2253:        recognized.
1.1       misho    2254: 
1.1.1.4 ! misho    2255:        In an ASCII/Unicode environment, the Unicode newline sequences are  the
        !          2256:        three  just  mentioned,  plus  the  single characters VT (vertical tab,
        !          2257:        U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
        !          2258:        arator,  U+2028),  and  PS (paragraph separator, U+2029). For the 8-bit
        !          2259:        library, the last two are recognized only in UTF-8 mode.
        !          2260: 
        !          2261:        When PCRE is compiled to run in an EBCDIC (mainframe) environment,  the
        !          2262:        code for CR is 0x0d, the same as ASCII. However, the character code for
        !          2263:        LF is normally 0x15, though in some EBCDIC environments 0x25  is  used.
        !          2264:        Whichever  of  these  is  not LF is made to correspond to Unicode's NEL
        !          2265:        character. EBCDIC codes are all less than 256. For  more  details,  see
        !          2266:        the pcrebuild documentation.
        !          2267: 
        !          2268:        The  newline  setting  in  the  options  word  uses three bits that are
1.1       misho    2269:        treated as a number, giving eight possibilities. Currently only six are
1.1.1.4 ! misho    2270:        used  (default  plus the five values above). This means that if you set
        !          2271:        more than one newline option, the combination may or may not be  sensi-
1.1       misho    2272:        ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1.1.1.4 ! misho    2273:        PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and
1.1       misho    2274:        cause an error.
                   2275: 
1.1.1.4 ! misho    2276:        The  only  time  that a line break in a pattern is specially recognized
        !          2277:        when compiling is when PCRE_EXTENDED is set. CR and LF are white  space
        !          2278:        characters,  and so are ignored in this mode. Also, an unescaped # out-
        !          2279:        side a character class indicates a comment that lasts until  after  the
        !          2280:        next  line break sequence. In other circumstances, line break sequences
1.1       misho    2281:        in patterns are treated as literal data.
                   2282: 
                   2283:        The newline option that is set at compile time becomes the default that
                   2284:        is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
                   2285: 
                   2286:          PCRE_NO_AUTO_CAPTURE
                   2287: 
                   2288:        If this option is set, it disables the use of numbered capturing paren-
1.1.1.4 ! misho    2289:        theses in the pattern. Any opening parenthesis that is not followed  by
        !          2290:        ?  behaves as if it were followed by ?: but named parentheses can still
        !          2291:        be used for capturing (and they acquire  numbers  in  the  usual  way).
1.1       misho    2292:        There is no equivalent of this option in Perl.
                   2293: 
1.1.1.4 ! misho    2294:          PCRE_NO_START_OPTIMIZE
1.1       misho    2295: 
1.1.1.4 ! misho    2296:        This  is an option that acts at matching time; that is, it is really an
        !          2297:        option for pcre_exec() or pcre_dfa_exec(). If  it  is  set  at  compile
        !          2298:        time,  it is remembered with the compiled pattern and assumed at match-
        !          2299:        ing time. This is necessary if you want to use JIT  execution,  because
        !          2300:        the  JIT  compiler needs to know whether or not this option is set. For
        !          2301:        details see the discussion of PCRE_NO_START_OPTIMIZE below.
1.1       misho    2302: 
                   2303:          PCRE_UCP
                   2304: 
                   2305:        This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
                   2306:        \w,  and  some  of  the POSIX character classes. By default, only ASCII
                   2307:        characters are recognized, but if PCRE_UCP is set,  Unicode  properties
                   2308:        are  used instead to classify characters. More details are given in the
                   2309:        section on generic character types in the pcrepattern page. If you  set
                   2310:        PCRE_UCP,  matching  one of the items it affects takes much longer. The
                   2311:        option is available only if PCRE has been compiled with  Unicode  prop-
                   2312:        erty support.
                   2313: 
                   2314:          PCRE_UNGREEDY
                   2315: 
                   2316:        This  option  inverts  the "greediness" of the quantifiers so that they
                   2317:        are not greedy by default, but become greedy if followed by "?". It  is
                   2318:        not  compatible  with Perl. It can also be set by a (?U) option setting
                   2319:        within the pattern.
                   2320: 
                   2321:          PCRE_UTF8
                   2322: 
                   2323:        This option causes PCRE to regard both the pattern and the  subject  as
1.1.1.2   misho    2324:        strings of UTF-8 characters instead of single-byte strings. However, it
                   2325:        is available only when PCRE is built to include UTF  support.  If  not,
                   2326:        the  use  of  this option provokes an error. Details of how this option
                   2327:        changes the behaviour of PCRE are given in the pcreunicode page.
1.1       misho    2328: 
                   2329:          PCRE_NO_UTF8_CHECK
                   2330: 
                   2331:        When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1.1.1.2   misho    2332:        automatically  checked.  There  is  a  discussion about the validity of
                   2333:        UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
                   2334:        found,  pcre_compile()  returns an error. If you already know that your
                   2335:        pattern is valid, and you want to skip this check for performance  rea-
                   2336:        sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
                   2337:        effect of passing an invalid UTF-8 string as a pattern is undefined. It
                   2338:        may  cause  your  program  to  crash. Note that this option can also be
                   2339:        passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
1.1.1.4 ! misho    2340:        checking  of  subject strings only. If the same string is being matched
        !          2341:        many times, the option can be safely set for the second and  subsequent
        !          2342:        matchings to improve performance.
1.1       misho    2343: 
                   2344: 
                   2345: COMPILATION ERROR CODES
                   2346: 
1.1.1.2   misho    2347:        The  following  table  lists  the  error  codes than may be returned by
                   2348:        pcre_compile2(), along with the error messages that may be returned  by
                   2349:        both  compiling  functions.  Note  that error messages are always 8-bit
1.1.1.4 ! misho    2350:        ASCII strings, even in 16-bit or 32-bit mode. As  PCRE  has  developed,
        !          2351:        some  error codes have fallen out of use. To avoid confusion, they have
        !          2352:        not been re-used.
1.1       misho    2353: 
                   2354:           0  no error
                   2355:           1  \ at end of pattern
                   2356:           2  \c at end of pattern
                   2357:           3  unrecognized character follows \
                   2358:           4  numbers out of order in {} quantifier
                   2359:           5  number too big in {} quantifier
                   2360:           6  missing terminating ] for character class
                   2361:           7  invalid escape sequence in character class
                   2362:           8  range out of order in character class
                   2363:           9  nothing to repeat
                   2364:          10  [this code is not in use]
                   2365:          11  internal error: unexpected repeat
                   2366:          12  unrecognized character after (? or (?-
                   2367:          13  POSIX named classes are supported only within a class
                   2368:          14  missing )
                   2369:          15  reference to non-existent subpattern
                   2370:          16  erroffset passed as NULL
                   2371:          17  unknown option bit(s) set
                   2372:          18  missing ) after comment
                   2373:          19  [this code is not in use]
                   2374:          20  regular expression is too large
                   2375:          21  failed to get memory
                   2376:          22  unmatched parentheses
                   2377:          23  internal error: code overflow
                   2378:          24  unrecognized character after (?<
                   2379:          25  lookbehind assertion is not fixed length
                   2380:          26  malformed number or name after (?(
                   2381:          27  conditional group contains more than two branches
                   2382:          28  assertion expected after (?(
                   2383:          29  (?R or (?[+-]digits must be followed by )
                   2384:          30  unknown POSIX class name
                   2385:          31  POSIX collating elements are not supported
1.1.1.2   misho    2386:          32  this version of PCRE is compiled without UTF support
1.1       misho    2387:          33  [this code is not in use]
                   2388:          34  character value in \x{...} sequence is too large
                   2389:          35  invalid condition (?(0)
                   2390:          36  \C not allowed in lookbehind assertion
                   2391:          37  PCRE does not support \L, \l, \N{name}, \U, or \u
                   2392:          38  number after (?C is > 255
                   2393:          39  closing ) for (?C expected
                   2394:          40  recursive call could loop indefinitely
                   2395:          41  unrecognized character after (?P
                   2396:          42  syntax error in subpattern name (missing terminator)
                   2397:          43  two named subpatterns have the same name
1.1.1.2   misho    2398:          44  invalid UTF-8 string (specifically UTF-8)
1.1       misho    2399:          45  support for \P, \p, and \X has not been compiled
                   2400:          46  malformed \P or \p sequence
                   2401:          47  unknown property name after \P or \p
                   2402:          48  subpattern name is too long (maximum 32 characters)
                   2403:          49  too many named subpatterns (maximum 10000)
                   2404:          50  [this code is not in use]
1.1.1.2   misho    2405:          51  octal value is greater than \377 in 8-bit non-UTF-8 mode
1.1       misho    2406:          52  internal error: overran compiling workspace
                   2407:          53  internal error: previously-checked referenced subpattern
                   2408:                not found
                   2409:          54  DEFINE group contains more than one branch
                   2410:          55  repeating a DEFINE group is not allowed
                   2411:          56  inconsistent NEWLINE options
                   2412:          57  \g is not followed by a braced, angle-bracketed, or quoted
                   2413:                name/number or by a plain number
                   2414:          58  a numbered reference must not be zero
                   2415:          59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
1.1.1.4 ! misho    2416:          60  (*VERB) not recognized or malformed
1.1       misho    2417:          61  number is too big
                   2418:          62  subpattern name expected
                   2419:          63  digit expected after (?+
                   2420:          64  ] is an invalid data character in JavaScript compatibility mode
                   2421:          65  different names for subpatterns of the same number are
                   2422:                not allowed
                   2423:          66  (*MARK) must have an argument
1.1.1.2   misho    2424:          67  this version of PCRE is not compiled with Unicode property
                   2425:                support
1.1       misho    2426:          68  \c must be followed by an ASCII character
                   2427:          69  \k is not followed by a braced, angle-bracketed, or quoted name
1.1.1.2   misho    2428:          70  internal error: unknown opcode in find_fixedlength()
                   2429:          71  \N is not supported in a class
                   2430:          72  too many forward references
                   2431:          73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
                   2432:          74  invalid UTF-16 string (specifically UTF-16)
1.1.1.3   misho    2433:          75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
                   2434:          76  character value in \u.... sequence is too large
1.1.1.4 ! misho    2435:          77  invalid UTF-32 string (specifically UTF-32)
1.1       misho    2436: 
1.1.1.2   misho    2437:        The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1.1       misho    2438:        values may be used if the limits were changed when PCRE was built.
                   2439: 
                   2440: 
                   2441: STUDYING A PATTERN
                   2442: 
                   2443:        pcre_extra *pcre_study(const pcre *code, int options
                   2444:             const char **errptr);
                   2445: 
1.1.1.2   misho    2446:        If  a  compiled  pattern is going to be used several times, it is worth
1.1       misho    2447:        spending more time analyzing it in order to speed up the time taken for
1.1.1.2   misho    2448:        matching.  The function pcre_study() takes a pointer to a compiled pat-
1.1       misho    2449:        tern as its first argument. If studying the pattern produces additional
1.1.1.2   misho    2450:        information  that  will  help speed up matching, pcre_study() returns a
                   2451:        pointer to a pcre_extra block, in which the study_data field points  to
1.1       misho    2452:        the results of the study.
                   2453: 
                   2454:        The  returned  value  from  pcre_study()  can  be  passed  directly  to
1.1.1.2   misho    2455:        pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
                   2456:        tains  other  fields  that can be set by the caller before the block is
1.1       misho    2457:        passed; these are described below in the section on matching a pattern.
                   2458: 
1.1.1.2   misho    2459:        If studying the  pattern  does  not  produce  any  useful  information,
1.1.1.4 ! misho    2460:        pcre_study()  returns  NULL  by  default.  In that circumstance, if the
        !          2461:        calling program wants to pass any of the other fields to pcre_exec() or
        !          2462:        pcre_dfa_exec(),  it  must set up its own pcre_extra block. However, if
        !          2463:        pcre_study() is called  with  the  PCRE_STUDY_EXTRA_NEEDED  option,  it
        !          2464:        returns a pcre_extra block even if studying did not find any additional
        !          2465:        information. It may still return NULL, however, if an error  occurs  in
        !          2466:        pcre_study().
1.1       misho    2467: 
1.1.1.3   misho    2468:        The  second  argument  of  pcre_study() contains option bits. There are
1.1.1.4 ! misho    2469:        three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
1.1.1.3   misho    2470: 
                   2471:          PCRE_STUDY_JIT_COMPILE
                   2472:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
                   2473:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
                   2474: 
                   2475:        If any of these are set, and the just-in-time  compiler  is  available,
                   2476:        the  pattern  is  further compiled into machine code that executes much
                   2477:        faster than the pcre_exec()  interpretive  matching  function.  If  the
                   2478:        just-in-time  compiler is not available, these options are ignored. All
1.1.1.4 ! misho    2479:        undefined bits in the options argument must be zero.
1.1       misho    2480: 
1.1.1.2   misho    2481:        JIT compilation is a heavyweight optimization. It can  take  some  time
                   2482:        for  patterns  to  be analyzed, and for one-off matches and simple pat-
                   2483:        terns the benefit of faster execution might be offset by a much  slower
1.1       misho    2484:        study time.  Not all patterns can be optimized by the JIT compiler. For
1.1.1.2   misho    2485:        those that cannot be handled, matching automatically falls back to  the
                   2486:        pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
1.1       misho    2487:        tion.
                   2488: 
1.1.1.2   misho    2489:        The third argument for pcre_study() is a pointer for an error  message.
                   2490:        If  studying  succeeds  (even  if no data is returned), the variable it
                   2491:        points to is set to NULL. Otherwise it is set to  point  to  a  textual
1.1       misho    2492:        error message. This is a static string that is part of the library. You
1.1.1.2   misho    2493:        must not try to free it. You should test the  error  pointer  for  NULL
1.1       misho    2494:        after calling pcre_study(), to be sure that it has run successfully.
                   2495: 
1.1.1.2   misho    2496:        When  you are finished with a pattern, you can free the memory used for
1.1       misho    2497:        the study data by calling pcre_free_study(). This function was added to
1.1.1.2   misho    2498:        the  API  for  release  8.20. For earlier versions, the memory could be
                   2499:        freed with pcre_free(), just like the pattern itself. This  will  still
1.1.1.3   misho    2500:        work  in  cases where JIT optimization is not used, but it is advisable
                   2501:        to change to the new function when convenient.
1.1       misho    2502: 
1.1.1.2   misho    2503:        This is a typical way in which pcre_study() is used (except that  in  a
1.1       misho    2504:        real application there should be tests for errors):
                   2505: 
                   2506:          int rc;
                   2507:          pcre *re;
                   2508:          pcre_extra *sd;
                   2509:          re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
                   2510:          sd = pcre_study(
                   2511:            re,             /* result of pcre_compile() */
                   2512:            0,              /* no options */
                   2513:            &error);        /* set to NULL or points to a message */
                   2514:          rc = pcre_exec(   /* see below for details of pcre_exec() options */
                   2515:            re, sd, "subject", 7, 0, 0, ovector, 30);
                   2516:          ...
                   2517:          pcre_free_study(sd);
                   2518:          pcre_free(re);
                   2519: 
                   2520:        Studying a pattern does two things: first, a lower bound for the length
                   2521:        of subject string that is needed to match the pattern is computed. This
                   2522:        does not mean that there are any strings of that length that match, but
1.1.1.4 ! misho    2523:        it does guarantee that no shorter strings match. The value is  used  to
        !          2524:        avoid wasting time by trying to match strings that are shorter than the
        !          2525:        lower bound. You can find out the value in a calling  program  via  the
        !          2526:        pcre_fullinfo() function.
1.1       misho    2527: 
                   2528:        Studying a pattern is also useful for non-anchored patterns that do not
1.1.1.2   misho    2529:        have a single fixed starting character. A bitmap of  possible  starting
                   2530:        bytes  is  created. This speeds up finding a position in the subject at
                   2531:        which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
1.1.1.4 ! misho    2532:        values  less  than  256.  In 32-bit mode, the bitmap is used for 32-bit
1.1.1.2   misho    2533:        values less than 256.)
1.1       misho    2534: 
1.1.1.4 ! misho    2535:        These two optimizations apply to both pcre_exec() and  pcre_dfa_exec(),
        !          2536:        and  the  information  is also used by the JIT compiler.  The optimiza-
        !          2537:        tions can be disabled by  setting  the  PCRE_NO_START_OPTIMIZE  option.
        !          2538:        You  might want to do this if your pattern contains callouts or (*MARK)
        !          2539:        and you want to make use of these facilities in  cases  where  matching
        !          2540:        fails.
        !          2541: 
        !          2542:        PCRE_NO_START_OPTIMIZE  can be specified at either compile time or exe-
        !          2543:        cution  time.  However,  if   PCRE_NO_START_OPTIMIZE   is   passed   to
        !          2544:        pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-
        !          2545:        cution is disabled. For JIT execution to work with  PCRE_NO_START_OPTI-
        !          2546:        MIZE, the option must be set at compile time.
        !          2547: 
        !          2548:        There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
1.1       misho    2549: 
                   2550: 
                   2551: LOCALE SUPPORT
                   2552: 
1.1.1.4 ! misho    2553:        PCRE  handles  caseless matching, and determines whether characters are
        !          2554:        letters, digits, or whatever, by reference to a set of tables,  indexed
        !          2555:        by  character  value.  When running in UTF-8 mode, this applies only to
        !          2556:        characters with codes less than 128. By  default,  higher-valued  codes
1.1       misho    2557:        never match escapes such as \w or \d, but they can be tested with \p if
1.1.1.4 ! misho    2558:        PCRE is built with Unicode character property  support.  Alternatively,
        !          2559:        the  PCRE_UCP  option  can  be  set at compile time; this causes \w and
1.1       misho    2560:        friends to use Unicode property support instead of built-in tables. The
                   2561:        use of locales with Unicode is discouraged. If you are handling charac-
1.1.1.4 ! misho    2562:        ters with codes greater than 128, you should either use UTF-8 and  Uni-
1.1       misho    2563:        code, or use locales, but not try to mix the two.
                   2564: 
1.1.1.4 ! misho    2565:        PCRE  contains  an  internal set of tables that are used when the final
        !          2566:        argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1.1       misho    2567:        applications.  Normally, the internal tables recognize only ASCII char-
                   2568:        acters. However, when PCRE is built, it is possible to cause the inter-
                   2569:        nal tables to be rebuilt in the default "C" locale of the local system,
                   2570:        which may cause them to be different.
                   2571: 
1.1.1.4 ! misho    2572:        The internal tables can always be overridden by tables supplied by  the
1.1       misho    2573:        application that calls PCRE. These may be created in a different locale
1.1.1.4 ! misho    2574:        from the default. As more and more applications change  to  using  Uni-
1.1       misho    2575:        code, the need for this locale support is expected to die away.
                   2576: 
1.1.1.4 ! misho    2577:        External  tables  are  built by calling the pcre_maketables() function,
        !          2578:        which has no arguments, in the relevant locale. The result can then  be
        !          2579:        passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
        !          2580:        example, to build and use tables that are appropriate  for  the  French
        !          2581:        locale  (where  accented  characters  with  values greater than 128 are
1.1       misho    2582:        treated as letters), the following code could be used:
                   2583: 
                   2584:          setlocale(LC_CTYPE, "fr_FR");
                   2585:          tables = pcre_maketables();
                   2586:          re = pcre_compile(..., tables);
                   2587: 
1.1.1.4 ! misho    2588:        The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1.1       misho    2589:        if you are using Windows, the name for the French locale is "french".
                   2590: 
1.1.1.4 ! misho    2591:        When  pcre_maketables()  runs,  the  tables are built in memory that is
        !          2592:        obtained via pcre_malloc. It is the caller's responsibility  to  ensure
        !          2593:        that  the memory containing the tables remains available for as long as
1.1       misho    2594:        it is needed.
                   2595: 
                   2596:        The pointer that is passed to pcre_compile() is saved with the compiled
1.1.1.4 ! misho    2597:        pattern,  and the same tables are used via this pointer by pcre_study()
1.1       misho    2598:        and normally also by pcre_exec(). Thus, by default, for any single pat-
                   2599:        tern, compilation, studying and matching all happen in the same locale,
                   2600:        but different patterns can be compiled in different locales.
                   2601: 
1.1.1.4 ! misho    2602:        It is possible to pass a table pointer or NULL (indicating the  use  of
        !          2603:        the  internal  tables)  to  pcre_exec(). Although not intended for this
        !          2604:        purpose, this facility could be used to match a pattern in a  different
1.1       misho    2605:        locale from the one in which it was compiled. Passing table pointers at
                   2606:        run time is discussed below in the section on matching a pattern.
                   2607: 
                   2608: 
                   2609: INFORMATION ABOUT A PATTERN
                   2610: 
                   2611:        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
                   2612:             int what, void *where);
                   2613: 
1.1.1.4 ! misho    2614:        The pcre_fullinfo() function returns information about a compiled  pat-
        !          2615:        tern.  It replaces the pcre_info() function, which was removed from the
1.1.1.2   misho    2616:        library at version 8.30, after more than 10 years of obsolescence.
1.1       misho    2617: 
1.1.1.4 ! misho    2618:        The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
        !          2619:        pattern.  The second argument is the result of pcre_study(), or NULL if
        !          2620:        the pattern was not studied. The third argument specifies  which  piece
        !          2621:        of  information  is required, and the fourth argument is a pointer to a
        !          2622:        variable to receive the data. The yield of the  function  is  zero  for
1.1       misho    2623:        success, or one of the following negative numbers:
                   2624: 
1.1.1.2   misho    2625:          PCRE_ERROR_NULL           the argument code was NULL
                   2626:                                    the argument where was NULL
                   2627:          PCRE_ERROR_BADMAGIC       the "magic number" was not found
                   2628:          PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
                   2629:                                    endianness
                   2630:          PCRE_ERROR_BADOPTION      the value of what was invalid
1.1.1.4 ! misho    2631:          PCRE_ERROR_UNSET          the requested field is not set
1.1       misho    2632: 
1.1.1.4 ! misho    2633:        The  "magic  number" is placed at the start of each compiled pattern as
        !          2634:        an simple check against passing an arbitrary memory pointer. The  endi-
1.1.1.2   misho    2635:        anness error can occur if a compiled pattern is saved and reloaded on a
1.1.1.4 ! misho    2636:        different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
1.1.1.2   misho    2637:        the length of the compiled pattern:
1.1       misho    2638: 
                   2639:          int rc;
                   2640:          size_t length;
                   2641:          rc = pcre_fullinfo(
                   2642:            re,               /* result of pcre_compile() */
                   2643:            sd,               /* result of pcre_study(), or NULL */
                   2644:            PCRE_INFO_SIZE,   /* what is required */
                   2645:            &length);         /* where to put the data */
                   2646: 
1.1.1.4 ! misho    2647:        The  possible  values for the third argument are defined in pcre.h, and
1.1       misho    2648:        are as follows:
                   2649: 
                   2650:          PCRE_INFO_BACKREFMAX
                   2651: 
1.1.1.4 ! misho    2652:        Return the number of the highest back reference  in  the  pattern.  The
        !          2653:        fourth  argument  should  point to an int variable. Zero is returned if
1.1       misho    2654:        there are no back references.
                   2655: 
                   2656:          PCRE_INFO_CAPTURECOUNT
                   2657: 
1.1.1.4 ! misho    2658:        Return the number of capturing subpatterns in the pattern.  The  fourth
1.1       misho    2659:        argument should point to an int variable.
                   2660: 
                   2661:          PCRE_INFO_DEFAULT_TABLES
                   2662: 
1.1.1.4 ! misho    2663:        Return  a pointer to the internal default character tables within PCRE.
        !          2664:        The fourth argument should point to an unsigned char *  variable.  This
1.1       misho    2665:        information call is provided for internal use by the pcre_study() func-
1.1.1.4 ! misho    2666:        tion. External callers can cause PCRE to use  its  internal  tables  by
1.1       misho    2667:        passing a NULL table pointer.
                   2668: 
                   2669:          PCRE_INFO_FIRSTBYTE
                   2670: 
1.1.1.2   misho    2671:        Return information about the first data unit of any matched string, for
1.1.1.4 ! misho    2672:        a non-anchored pattern. (The name of this option refers  to  the  8-bit
        !          2673:        library,  where data units are bytes.) The fourth argument should point
1.1.1.2   misho    2674:        to an int variable.
                   2675: 
1.1.1.4 ! misho    2676:        If there is a fixed first value, for example, the  letter  "c"  from  a
        !          2677:        pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
        !          2678:        library, the value is always less than 256. In the 16-bit  library  the
        !          2679:        value can be up to 0xffff. In the 32-bit library the value can be up to
        !          2680:        0x10ffff.
1.1       misho    2681: 
1.1.1.2   misho    2682:        If there is no fixed first value, and if either
1.1       misho    2683: 
1.1.1.3   misho    2684:        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1.1       misho    2685:        branch starts with "^", or
                   2686: 
                   2687:        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
                   2688:        set (if it were set, the pattern would be anchored),
                   2689: 
1.1.1.3   misho    2690:        -1 is returned, indicating that the pattern matches only at  the  start
                   2691:        of  a  subject string or after any newline within the string. Otherwise
1.1       misho    2692:        -2 is returned. For anchored patterns, -2 is returned.
                   2693: 
1.1.1.4 ! misho    2694:        Since for the 32-bit library using the non-UTF-32 mode,  this  function
        !          2695:        is  unable to return the full 32-bit range of the character, this value
        !          2696:        is   deprecated;   instead   the   PCRE_INFO_FIRSTCHARACTERFLAGS    and
        !          2697:        PCRE_INFO_FIRSTCHARACTER values should be used.
        !          2698: 
1.1       misho    2699:          PCRE_INFO_FIRSTTABLE
                   2700: 
1.1.1.4 ! misho    2701:        If  the pattern was studied, and this resulted in the construction of a
        !          2702:        256-bit table indicating a fixed set of values for the first data  unit
        !          2703:        in  any  matching string, a pointer to the table is returned. Otherwise
        !          2704:        NULL is returned. The fourth argument should point to an unsigned  char
1.1.1.2   misho    2705:        * variable.
1.1       misho    2706: 
                   2707:          PCRE_INFO_HASCRORLF
                   2708: 
1.1.1.4 ! misho    2709:        Return  1  if  the  pattern  contains any explicit matches for CR or LF
        !          2710:        characters, otherwise 0. The fourth argument should  point  to  an  int
        !          2711:        variable.  An explicit match is either a literal CR or LF character, or
1.1       misho    2712:        \r or \n.
                   2713: 
                   2714:          PCRE_INFO_JCHANGED
                   2715: 
1.1.1.4 ! misho    2716:        Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
        !          2717:        otherwise  0. The fourth argument should point to an int variable. (?J)
1.1       misho    2718:        and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
                   2719: 
                   2720:          PCRE_INFO_JIT
                   2721: 
1.1.1.4 ! misho    2722:        Return 1 if the pattern was studied with one of the  JIT  options,  and
1.1.1.3   misho    2723:        just-in-time compiling was successful. The fourth argument should point
1.1.1.4 ! misho    2724:        to an int variable. A return value of 0 means that JIT support  is  not
        !          2725:        available  in this version of PCRE, or that the pattern was not studied
        !          2726:        with a JIT option, or that the JIT compiler could not handle this  par-
        !          2727:        ticular  pattern. See the pcrejit documentation for details of what can
1.1.1.3   misho    2728:        and cannot be handled.
1.1       misho    2729: 
                   2730:          PCRE_INFO_JITSIZE
                   2731: 
1.1.1.4 ! misho    2732:        If the pattern was successfully studied with a JIT option,  return  the
        !          2733:        size  of the JIT compiled code, otherwise return zero. The fourth argu-
1.1.1.3   misho    2734:        ment should point to a size_t variable.
1.1       misho    2735: 
                   2736:          PCRE_INFO_LASTLITERAL
                   2737: 
1.1.1.4 ! misho    2738:        Return the value of the rightmost literal data unit that must exist  in
        !          2739:        any  matched  string, other than at its start, if such a value has been
1.1       misho    2740:        recorded. The fourth argument should point to an int variable. If there
1.1.1.2   misho    2741:        is no such value, -1 is returned. For anchored patterns, a last literal
1.1.1.4 ! misho    2742:        value is recorded only if it follows something of variable length.  For
1.1       misho    2743:        example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
                   2744:        /^a\dz\d/ the returned value is -1.
                   2745: 
1.1.1.4 ! misho    2746:        Since for the 32-bit library using the non-UTF-32 mode,  this  function
        !          2747:        is  unable to return the full 32-bit range of the character, this value
        !          2748:        is   deprecated;   instead    the    PCRE_INFO_REQUIREDCHARFLAGS    and
        !          2749:        PCRE_INFO_REQUIREDCHAR values should be used.
        !          2750: 
        !          2751:          PCRE_INFO_MATCHLIMIT
        !          2752: 
        !          2753:        If  the  pattern  set  a  match  limit by including an item of the form
        !          2754:        (*LIMIT_MATCH=nnnn) at the start, the value  is  returned.  The  fourth
        !          2755:        argument  should  point to an unsigned 32-bit integer. If no such value
        !          2756:        has  been  set,  the  call  to  pcre_fullinfo()   returns   the   error
        !          2757:        PCRE_ERROR_UNSET.
        !          2758: 
1.1.1.3   misho    2759:          PCRE_INFO_MAXLOOKBEHIND
                   2760: 
1.1.1.4 ! misho    2761:        Return  the  number  of  characters  (NB not data units) in the longest
        !          2762:        lookbehind assertion in the pattern. This information  is  useful  when
        !          2763:        doing  multi-segment  matching  using  the partial matching facilities.
        !          2764:        Note that the simple assertions \b and \B require a one-character look-
        !          2765:        behind.  \A  also  registers a one-character lookbehind, though it does
        !          2766:        not actually inspect the previous character. This is to ensure that  at
        !          2767:        least one character from the old segment is retained when a new segment
        !          2768:        is processed. Otherwise, if there are no lookbehinds in the pattern, \A
        !          2769:        might match incorrectly at the start of a new segment.
1.1.1.3   misho    2770: 
1.1       misho    2771:          PCRE_INFO_MINLENGTH
                   2772: 
1.1.1.4 ! misho    2773:        If  the  pattern  was studied and a minimum length for matching subject
        !          2774:        strings was computed, its value is  returned.  Otherwise  the  returned
        !          2775:        value is -1. The value is a number of characters, which in UTF mode may
        !          2776:        be different from the number of data units. The fourth argument  should
        !          2777:        point  to an int variable. A non-negative value is a lower bound to the
        !          2778:        length of any matching string. There may not be  any  strings  of  that
        !          2779:        length  that  do actually match, but every string that does match is at
1.1.1.2   misho    2780:        least that long.
1.1       misho    2781: 
                   2782:          PCRE_INFO_NAMECOUNT
                   2783:          PCRE_INFO_NAMEENTRYSIZE
                   2784:          PCRE_INFO_NAMETABLE
                   2785: 
1.1.1.4 ! misho    2786:        PCRE supports the use of named as well as numbered capturing  parenthe-
        !          2787:        ses.  The names are just an additional way of identifying the parenthe-
1.1       misho    2788:        ses, which still acquire numbers. Several convenience functions such as
1.1.1.4 ! misho    2789:        pcre_get_named_substring()  are  provided  for extracting captured sub-
        !          2790:        strings by name. It is also possible to extract the data  directly,  by
        !          2791:        first  converting  the  name to a number in order to access the correct
1.1       misho    2792:        pointers in the output vector (described with pcre_exec() below). To do
1.1.1.4 ! misho    2793:        the  conversion,  you  need  to  use  the  name-to-number map, which is
1.1       misho    2794:        described by these three values.
                   2795: 
                   2796:        The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
                   2797:        gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1.1.1.4 ! misho    2798:        of each entry; both of these  return  an  int  value.  The  entry  size
        !          2799:        depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns
1.1.1.2   misho    2800:        a pointer to the first entry of the table. This is a pointer to char in
                   2801:        the 8-bit library, where the first two bytes of each entry are the num-
1.1.1.4 ! misho    2802:        ber of the capturing parenthesis, most significant byte first.  In  the
        !          2803:        16-bit  library,  the pointer points to 16-bit data units, the first of
        !          2804:        which contains the parenthesis  number.  In  the  32-bit  library,  the
        !          2805:        pointer  points  to  32-bit data units, the first of which contains the
        !          2806:        parenthesis number. The rest of the entry is  the  corresponding  name,
        !          2807:        zero terminated.
1.1       misho    2808: 
1.1.1.4 ! misho    2809:        The  names are in alphabetical order. Duplicate names may appear if (?|
1.1       misho    2810:        is used to create multiple groups with the same number, as described in
1.1.1.4 ! misho    2811:        the  section  on  duplicate subpattern numbers in the pcrepattern page.
        !          2812:        Duplicate names for subpatterns with different  numbers  are  permitted
        !          2813:        only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they
        !          2814:        appear in the table in the order in which they were found in  the  pat-
        !          2815:        tern.  In  the  absence  of (?| this is the order of increasing number;
1.1       misho    2816:        when (?| is used this is not necessarily the case because later subpat-
                   2817:        terns may have lower numbers.
                   2818: 
1.1.1.4 ! misho    2819:        As  a  simple  example of the name/number table, consider the following
1.1.1.2   misho    2820:        pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
                   2821:        set, so white space - including newlines - is ignored):
1.1       misho    2822: 
                   2823:          (?<date> (?<year>(\d\d)?\d\d) -
                   2824:          (?<month>\d\d) - (?<day>\d\d) )
                   2825: 
1.1.1.4 ! misho    2826:        There  are  four  named subpatterns, so the table has four entries, and
        !          2827:        each entry in the table is eight bytes long. The table is  as  follows,
1.1       misho    2828:        with non-printing bytes shows in hexadecimal, and undefined bytes shown
                   2829:        as ??:
                   2830: 
                   2831:          00 01 d  a  t  e  00 ??
                   2832:          00 05 d  a  y  00 ?? ??
                   2833:          00 04 m  o  n  t  h  00
                   2834:          00 02 y  e  a  r  00 ??
                   2835: 
1.1.1.4 ! misho    2836:        When writing code to extract data  from  named  subpatterns  using  the
        !          2837:        name-to-number  map,  remember that the length of the entries is likely
1.1       misho    2838:        to be different for each compiled pattern.
                   2839: 
                   2840:          PCRE_INFO_OKPARTIAL
                   2841: 
1.1.1.4 ! misho    2842:        Return 1  if  the  pattern  can  be  used  for  partial  matching  with
        !          2843:        pcre_exec(),  otherwise  0.  The fourth argument should point to an int
        !          2844:        variable. From  release  8.00,  this  always  returns  1,  because  the
        !          2845:        restrictions  that  previously  applied  to  partial matching have been
        !          2846:        lifted. The pcrepartial documentation gives details of  partial  match-
1.1       misho    2847:        ing.
                   2848: 
                   2849:          PCRE_INFO_OPTIONS
                   2850: 
1.1.1.4 ! misho    2851:        Return  a  copy of the options with which the pattern was compiled. The
        !          2852:        fourth argument should point to an unsigned long  int  variable.  These
1.1       misho    2853:        option bits are those specified in the call to pcre_compile(), modified
                   2854:        by any top-level option settings at the start of the pattern itself. In
1.1.1.4 ! misho    2855:        other  words,  they are the options that will be in force when matching
        !          2856:        starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
        !          2857:        the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1.1       misho    2858:        and PCRE_EXTENDED.
                   2859: 
1.1.1.4 ! misho    2860:        A pattern is automatically anchored by PCRE if  all  of  its  top-level
1.1       misho    2861:        alternatives begin with one of the following:
                   2862: 
                   2863:          ^     unless PCRE_MULTILINE is set
                   2864:          \A    always
                   2865:          \G    always
                   2866:          .*    if PCRE_DOTALL is set and there are no back
                   2867:                  references to the subpattern in which .* appears
                   2868: 
                   2869:        For such patterns, the PCRE_ANCHORED bit is set in the options returned
                   2870:        by pcre_fullinfo().
                   2871: 
1.1.1.4 ! misho    2872:          PCRE_INFO_RECURSIONLIMIT
        !          2873: 
        !          2874:        If the pattern set a recursion limit by including an item of  the  form
        !          2875:        (*LIMIT_RECURSION=nnnn) at the start, the value is returned. The fourth
        !          2876:        argument should point to an unsigned 32-bit integer. If no  such  value
        !          2877:        has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
        !          2878:        PCRE_ERROR_UNSET.
        !          2879: 
1.1       misho    2880:          PCRE_INFO_SIZE
                   2881: 
1.1.1.4 ! misho    2882:        Return the size of  the  compiled  pattern  in  bytes  (for  all  three
        !          2883:        libraries). The fourth argument should point to a size_t variable. This
        !          2884:        value does not include the size of the pcre structure that is  returned
        !          2885:        by  pcre_compile().  The  value  that  is  passed  as  the  argument to
        !          2886:        pcre_malloc() when pcre_compile() is getting memory in which  to  place
        !          2887:        the compiled data is the value returned by this option plus the size of
        !          2888:        the pcre structure. Studying a compiled pattern, with or  without  JIT,
        !          2889:        does not alter the value returned by this option.
1.1       misho    2890: 
                   2891:          PCRE_INFO_STUDYSIZE
                   2892: 
1.1.1.4 ! misho    2893:        Return  the  size  in bytes (for all three libraries) of the data block
        !          2894:        pointed to by the study_data field in a pcre_extra block. If pcre_extra
        !          2895:        is  NULL, or there is no study data, zero is returned. The fourth argu-
        !          2896:        ment should point to a size_t variable. The study_data field is set  by
        !          2897:        pcre_study() to record information that will speed up matching (see the
        !          2898:        section entitled  "Studying  a  pattern"  above).  The  format  of  the
        !          2899:        study_data  block is private, but its length is made available via this
        !          2900:        option so that it can be saved and  restored  (see  the  pcreprecompile
        !          2901:        documentation for details).
        !          2902: 
        !          2903:          PCRE_INFO_FIRSTCHARACTERFLAGS
        !          2904: 
        !          2905:        Return information about the first data unit of any matched string, for
        !          2906:        a non-anchored pattern. The fourth argument  should  point  to  an  int
        !          2907:        variable.
        !          2908: 
        !          2909:        If  there  is  a  fixed first value, for example, the letter "c" from a
        !          2910:        pattern such as (cat|cow|coyote), 1  is  returned,  and  the  character
        !          2911:        value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
        !          2912: 
        !          2913:        If there is no fixed first value, and if either
        !          2914: 
        !          2915:        (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
        !          2916:        branch starts with "^", or
        !          2917: 
        !          2918:        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
        !          2919:        set (if it were set, the pattern would be anchored),
        !          2920: 
        !          2921:        2 is returned, indicating that the pattern matches only at the start of
        !          2922:        a subject string or after any newline within the string. Otherwise 0 is
        !          2923:        returned. For anchored patterns, 0 is returned.
        !          2924: 
        !          2925:          PCRE_INFO_FIRSTCHARACTER
        !          2926: 
        !          2927:        Return  the  fixed  first character value, if PCRE_INFO_FIRSTCHARACTER-
        !          2928:        FLAGS returned 1; otherwise returns 0. The fourth argument should point
        !          2929:        to an uint_t variable.
        !          2930: 
        !          2931:        In  the 8-bit library, the value is always less than 256. In the 16-bit
        !          2932:        library the value can be up to 0xffff. In the 32-bit library in  UTF-32
        !          2933:        mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not
        !          2934:        using UTF-32 mode.
        !          2935: 
        !          2936:        If there is no fixed first value, and if either
        !          2937: 
        !          2938:        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
        !          2939:        branch starts with "^", or
        !          2940: 
        !          2941:        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
        !          2942:        set (if it were set, the pattern would be anchored),
        !          2943: 
        !          2944:        -1 is returned, indicating that the pattern matches only at  the  start
        !          2945:        of  a  subject string or after any newline within the string. Otherwise
        !          2946:        -2 is returned. For anchored patterns, -2 is returned.
        !          2947: 
        !          2948:          PCRE_INFO_REQUIREDCHARFLAGS
        !          2949: 
        !          2950:        Returns 1 if there is a rightmost literal data unit that must exist  in
        !          2951:        any matched string, other than at its start. The fourth argument should
        !          2952:        point to an int variable. If there is no such value, 0 is returned.  If
        !          2953:        returning  1,  the  character  value  itself  can  be  retrieved  using
        !          2954:        PCRE_INFO_REQUIREDCHAR.
        !          2955: 
        !          2956:        For anchored patterns, a last literal value is recorded only if it fol-
        !          2957:        lows  something  of  variable  length.  For  example,  for  the pattern
        !          2958:        /^a\d+z\d+/  the   returned   value   1   (with   "z"   returned   from
        !          2959:        PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
        !          2960: 
        !          2961:          PCRE_INFO_REQUIREDCHAR
        !          2962: 
        !          2963:        Return  the value of the rightmost literal data unit that must exist in
        !          2964:        any matched string, other than at its start, if such a value  has  been
        !          2965:        recorded.  The fourth argument should point to an uint32_t variable. If
        !          2966:        there is no such value, 0 is returned.
1.1       misho    2967: 
                   2968: 
                   2969: REFERENCE COUNTS
                   2970: 
                   2971:        int pcre_refcount(pcre *code, int adjust);
                   2972: 
1.1.1.2   misho    2973:        The pcre_refcount() function is used to maintain a reference  count  in
1.1       misho    2974:        the data block that contains a compiled pattern. It is provided for the
1.1.1.2   misho    2975:        benefit of applications that  operate  in  an  object-oriented  manner,
1.1       misho    2976:        where different parts of the application may be using the same compiled
                   2977:        pattern, but you want to free the block when they are all done.
                   2978: 
                   2979:        When a pattern is compiled, the reference count field is initialized to
1.1.1.2   misho    2980:        zero.   It is changed only by calling this function, whose action is to
                   2981:        add the adjust value (which may be positive or  negative)  to  it.  The
1.1       misho    2982:        yield of the function is the new value. However, the value of the count
1.1.1.2   misho    2983:        is constrained to lie between 0 and 65535, inclusive. If the new  value
1.1       misho    2984:        is outside these limits, it is forced to the appropriate limit value.
                   2985: 
1.1.1.2   misho    2986:        Except  when it is zero, the reference count is not correctly preserved
                   2987:        if a pattern is compiled on one host and then  transferred  to  a  host
1.1       misho    2988:        whose byte-order is different. (This seems a highly unlikely scenario.)
                   2989: 
                   2990: 
                   2991: MATCHING A PATTERN: THE TRADITIONAL FUNCTION
                   2992: 
                   2993:        int pcre_exec(const pcre *code, const pcre_extra *extra,
                   2994:             const char *subject, int length, int startoffset,
                   2995:             int options, int *ovector, int ovecsize);
                   2996: 
1.1.1.2   misho    2997:        The  function pcre_exec() is called to match a subject string against a
                   2998:        compiled pattern, which is passed in the code argument. If the  pattern
                   2999:        was  studied,  the  result  of  the study should be passed in the extra
                   3000:        argument. You can call pcre_exec() with the same code and  extra  argu-
                   3001:        ments  as  many  times as you like, in order to match different subject
1.1       misho    3002:        strings with the same pattern.
                   3003: 
1.1.1.2   misho    3004:        This function is the main matching facility  of  the  library,  and  it
                   3005:        operates  in  a  Perl-like  manner. For specialist use there is also an
                   3006:        alternative matching function, which is described below in the  section
1.1       misho    3007:        about the pcre_dfa_exec() function.
                   3008: 
1.1.1.2   misho    3009:        In  most applications, the pattern will have been compiled (and option-
                   3010:        ally studied) in the same process that calls pcre_exec().  However,  it
1.1       misho    3011:        is possible to save compiled patterns and study data, and then use them
1.1.1.2   misho    3012:        later in different processes, possibly even on different hosts.  For  a
1.1       misho    3013:        discussion about this, see the pcreprecompile documentation.
                   3014: 
                   3015:        Here is an example of a simple call to pcre_exec():
                   3016: 
                   3017:          int rc;
                   3018:          int ovector[30];
                   3019:          rc = pcre_exec(
                   3020:            re,             /* result of pcre_compile() */
                   3021:            NULL,           /* we didn't study the pattern */
                   3022:            "some string",  /* the subject string */
                   3023:            11,             /* the length of the subject string */
                   3024:            0,              /* start at offset 0 in the subject */
                   3025:            0,              /* default options */
                   3026:            ovector,        /* vector of integers for substring information */
                   3027:            30);            /* number of elements (NOT size in bytes) */
                   3028: 
                   3029:    Extra data for pcre_exec()
                   3030: 
1.1.1.2   misho    3031:        If  the  extra argument is not NULL, it must point to a pcre_extra data
                   3032:        block. The pcre_study() function returns such a block (when it  doesn't
                   3033:        return  NULL), but you can also create one for yourself, and pass addi-
                   3034:        tional information in it. The pcre_extra block contains  the  following
1.1       misho    3035:        fields (not necessarily in this order):
                   3036: 
                   3037:          unsigned long int flags;
                   3038:          void *study_data;
                   3039:          void *executable_jit;
                   3040:          unsigned long int match_limit;
                   3041:          unsigned long int match_limit_recursion;
                   3042:          void *callout_data;
                   3043:          const unsigned char *tables;
                   3044:          unsigned char **mark;
                   3045: 
1.1.1.2   misho    3046:        In  the  16-bit  version  of  this  structure,  the mark field has type
                   3047:        "PCRE_UCHAR16 **".
                   3048: 
1.1.1.4 ! misho    3049:        In the 32-bit version of  this  structure,  the  mark  field  has  type
        !          3050:        "PCRE_UCHAR32 **".
        !          3051: 
        !          3052:        The  flags  field is used to specify which of the other fields are set.
1.1.1.3   misho    3053:        The flag bits are:
1.1       misho    3054: 
1.1.1.3   misho    3055:          PCRE_EXTRA_CALLOUT_DATA
1.1       misho    3056:          PCRE_EXTRA_EXECUTABLE_JIT
1.1.1.3   misho    3057:          PCRE_EXTRA_MARK
1.1       misho    3058:          PCRE_EXTRA_MATCH_LIMIT
                   3059:          PCRE_EXTRA_MATCH_LIMIT_RECURSION
1.1.1.3   misho    3060:          PCRE_EXTRA_STUDY_DATA
1.1       misho    3061:          PCRE_EXTRA_TABLES
                   3062: 
1.1.1.4 ! misho    3063:        Other flag bits should be set to zero. The study_data field  and  some-
        !          3064:        times  the executable_jit field are set in the pcre_extra block that is
        !          3065:        returned by pcre_study(), together with the appropriate flag bits.  You
        !          3066:        should  not set these yourself, but you may add to the block by setting
1.1.1.3   misho    3067:        other fields and their corresponding flag bits.
1.1       misho    3068: 
                   3069:        The match_limit field provides a means of preventing PCRE from using up
1.1.1.4 ! misho    3070:        a  vast amount of resources when running patterns that are not going to
        !          3071:        match, but which have a very large number  of  possibilities  in  their
        !          3072:        search  trees. The classic example is a pattern that uses nested unlim-
1.1       misho    3073:        ited repeats.
                   3074: 
1.1.1.4 ! misho    3075:        Internally, pcre_exec() uses a function called match(), which it  calls
        !          3076:        repeatedly  (sometimes  recursively).  The  limit set by match_limit is
        !          3077:        imposed on the number of times this function is called during a  match,
        !          3078:        which  has  the  effect of limiting the amount of backtracking that can
1.1       misho    3079:        take place. For patterns that are not anchored, the count restarts from
                   3080:        zero for each position in the subject string.
                   3081: 
                   3082:        When pcre_exec() is called with a pattern that was successfully studied
1.1.1.4 ! misho    3083:        with a JIT option, the way that the matching is  executed  is  entirely
1.1.1.3   misho    3084:        different.  However, there is still the possibility of runaway matching
                   3085:        that goes on for a very long time, and so the match_limit value is also
                   3086:        used in this case (but in a different way) to limit how long the match-
                   3087:        ing can continue.
1.1       misho    3088: 
1.1.1.4 ! misho    3089:        The default value for the limit can be set  when  PCRE  is  built;  the
        !          3090:        default  default  is 10 million, which handles all but the most extreme
        !          3091:        cases. You can override the default  by  suppling  pcre_exec()  with  a
        !          3092:        pcre_extra     block    in    which    match_limit    is    set,    and
        !          3093:        PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1.1       misho    3094:        exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
                   3095: 
1.1.1.4 ! misho    3096:        A  value  for  the  match  limit may also be supplied by an item at the
        !          3097:        start of a pattern of the form
        !          3098: 
        !          3099:          (*LIMIT_MATCH=d)
        !          3100: 
        !          3101:        where d is a decimal number. However, such a setting is ignored  unless
        !          3102:        d  is  less  than  the limit set by the caller of pcre_exec() or, if no
        !          3103:        such limit is set, less than the default.
        !          3104: 
1.1       misho    3105:        The match_limit_recursion field is similar to match_limit, but  instead
                   3106:        of limiting the total number of times that match() is called, it limits
                   3107:        the depth of recursion. The recursion depth is a  smaller  number  than
                   3108:        the  total number of calls, because not all calls to match() are recur-
                   3109:        sive.  This limit is of use only if it is set smaller than match_limit.
                   3110: 
                   3111:        Limiting the recursion depth limits the amount of  machine  stack  that
                   3112:        can  be used, or, when PCRE has been compiled to use memory on the heap
                   3113:        instead of the stack, the amount of heap memory that can be used.  This
1.1.1.3   misho    3114:        limit  is not relevant, and is ignored, when matching is done using JIT
                   3115:        compiled code.
1.1       misho    3116: 
                   3117:        The default value for match_limit_recursion can be  set  when  PCRE  is
                   3118:        built;  the  default  default  is  the  same  value  as the default for
                   3119:        match_limit. You can override the default by suppling pcre_exec()  with
                   3120:        a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
                   3121:        PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
                   3122:        limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
                   3123: 
1.1.1.4 ! misho    3124:        A  value for the recursion limit may also be supplied by an item at the
        !          3125:        start of a pattern of the form
        !          3126: 
        !          3127:          (*LIMIT_RECURSION=d)
        !          3128: 
        !          3129:        where d is a decimal number. However, such a setting is ignored  unless
        !          3130:        d  is  less  than  the limit set by the caller of pcre_exec() or, if no
        !          3131:        such limit is set, less than the default.
        !          3132: 
        !          3133:        The callout_data field is used in conjunction with the  "callout"  fea-
1.1       misho    3134:        ture, and is described in the pcrecallout documentation.
                   3135: 
1.1.1.4 ! misho    3136:        The  tables  field  is  used  to  pass  a  character  tables pointer to
        !          3137:        pcre_exec(); this overrides the value that is stored with the  compiled
        !          3138:        pattern.  A  non-NULL value is stored with the compiled pattern only if
        !          3139:        custom tables were supplied to pcre_compile() via  its  tableptr  argu-
1.1       misho    3140:        ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1.1.1.4 ! misho    3141:        PCRE's internal tables to be used. This facility is  helpful  when  re-
        !          3142:        using  patterns  that  have been saved after compiling with an external
        !          3143:        set of tables, because the external tables  might  be  at  a  different
        !          3144:        address  when  pcre_exec() is called. See the pcreprecompile documenta-
1.1       misho    3145:        tion for a discussion of saving compiled patterns for later use.
                   3146: 
1.1.1.4 ! misho    3147:        If PCRE_EXTRA_MARK is set in the flags field, the mark  field  must  be
        !          3148:        set  to point to a suitable variable. If the pattern contains any back-
        !          3149:        tracking control verbs such as (*MARK:NAME), and the execution ends  up
        !          3150:        with  a  name  to  pass back, a pointer to the name string (zero termi-
        !          3151:        nated) is placed in the variable pointed to  by  the  mark  field.  The
        !          3152:        names  are  within  the  compiled pattern; if you wish to retain such a
        !          3153:        name you must copy it before freeing the memory of a compiled  pattern.
        !          3154:        If  there  is no name to pass back, the variable pointed to by the mark
        !          3155:        field is set to NULL. For details of the  backtracking  control  verbs,
1.1.1.2   misho    3156:        see the section entitled "Backtracking control" in the pcrepattern doc-
                   3157:        umentation.
1.1       misho    3158: 
                   3159:    Option bits for pcre_exec()
                   3160: 
1.1.1.4 ! misho    3161:        The unused bits of the options argument for pcre_exec() must  be  zero.
        !          3162:        The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
        !          3163:        PCRE_NOTBOL,   PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NOTEMPTY_ATSTART,
        !          3164:        PCRE_NO_START_OPTIMIZE,   PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,  and
1.1.1.3   misho    3165:        PCRE_PARTIAL_SOFT.
1.1       misho    3166: 
1.1.1.4 ! misho    3167:        If the pattern was successfully studied with one  of  the  just-in-time
1.1.1.3   misho    3168:        (JIT) compile options, the only supported options for JIT execution are
1.1.1.4 ! misho    3169:        PCRE_NO_UTF8_CHECK,    PCRE_NOTBOL,     PCRE_NOTEOL,     PCRE_NOTEMPTY,
        !          3170:        PCRE_NOTEMPTY_ATSTART,  PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
        !          3171:        unsupported option is used, JIT execution is disabled  and  the  normal
1.1.1.3   misho    3172:        interpretive code in pcre_exec() is run.
1.1       misho    3173: 
                   3174:          PCRE_ANCHORED
                   3175: 
1.1.1.4 ! misho    3176:        The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
        !          3177:        matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
        !          3178:        turned  out to be anchored by virtue of its contents, it cannot be made
1.1       misho    3179:        unachored at matching time.
                   3180: 
                   3181:          PCRE_BSR_ANYCRLF
                   3182:          PCRE_BSR_UNICODE
                   3183: 
                   3184:        These options (which are mutually exclusive) control what the \R escape
1.1.1.4 ! misho    3185:        sequence  matches.  The choice is either to match only CR, LF, or CRLF,
        !          3186:        or to match any Unicode newline sequence. These  options  override  the
1.1       misho    3187:        choice that was made or defaulted when the pattern was compiled.
                   3188: 
                   3189:          PCRE_NEWLINE_CR
                   3190:          PCRE_NEWLINE_LF
                   3191:          PCRE_NEWLINE_CRLF
                   3192:          PCRE_NEWLINE_ANYCRLF
                   3193:          PCRE_NEWLINE_ANY
                   3194: 
1.1.1.4 ! misho    3195:        These  options  override  the  newline  definition  that  was chosen or
        !          3196:        defaulted when the pattern was compiled. For details, see the  descrip-
        !          3197:        tion  of  pcre_compile()  above.  During  matching,  the newline choice
        !          3198:        affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
        !          3199:        ters.  It may also alter the way the match position is advanced after a
1.1       misho    3200:        match failure for an unanchored pattern.
                   3201: 
1.1.1.4 ! misho    3202:        When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is
        !          3203:        set,  and a match attempt for an unanchored pattern fails when the cur-
        !          3204:        rent position is at a  CRLF  sequence,  and  the  pattern  contains  no
        !          3205:        explicit  matches  for  CR  or  LF  characters,  the  match position is
1.1       misho    3206:        advanced by two characters instead of one, in other words, to after the
                   3207:        CRLF.
                   3208: 
                   3209:        The above rule is a compromise that makes the most common cases work as
1.1.1.4 ! misho    3210:        expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL
1.1       misho    3211:        option is not set), it does not match the string "\r\nA" because, after
1.1.1.4 ! misho    3212:        failing at the start, it skips both the CR and the LF before  retrying.
        !          3213:        However,  the  pattern  [\r\n]A does match that string, because it con-
1.1       misho    3214:        tains an explicit CR or LF reference, and so advances only by one char-
                   3215:        acter after the first failure.
                   3216: 
                   3217:        An explicit match for CR of LF is either a literal appearance of one of
1.1.1.4 ! misho    3218:        those characters, or one of the \r or  \n  escape  sequences.  Implicit
        !          3219:        matches  such  as [^X] do not count, nor does \s (which includes CR and
1.1       misho    3220:        LF in the characters that it matches).
                   3221: 
1.1.1.4 ! misho    3222:        Notwithstanding the above, anomalous effects may still occur when  CRLF
1.1       misho    3223:        is a valid newline sequence and explicit \r or \n escapes appear in the
                   3224:        pattern.
                   3225: 
                   3226:          PCRE_NOTBOL
                   3227: 
                   3228:        This option specifies that first character of the subject string is not
1.1.1.4 ! misho    3229:        the  beginning  of  a  line, so the circumflex metacharacter should not
        !          3230:        match before it. Setting this without PCRE_MULTILINE (at compile  time)
        !          3231:        causes  circumflex  never to match. This option affects only the behav-
1.1       misho    3232:        iour of the circumflex metacharacter. It does not affect \A.
                   3233: 
                   3234:          PCRE_NOTEOL
                   3235: 
                   3236:        This option specifies that the end of the subject string is not the end
1.1.1.4 ! misho    3237:        of  a line, so the dollar metacharacter should not match it nor (except
        !          3238:        in multiline mode) a newline immediately before it. Setting this  with-
1.1       misho    3239:        out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1.1.1.4 ! misho    3240:        option affects only the behaviour of the dollar metacharacter. It  does
1.1       misho    3241:        not affect \Z or \z.
                   3242: 
                   3243:          PCRE_NOTEMPTY
                   3244: 
                   3245:        An empty string is not considered to be a valid match if this option is
1.1.1.4 ! misho    3246:        set. If there are alternatives in the pattern, they are tried.  If  all
        !          3247:        the  alternatives  match  the empty string, the entire match fails. For
1.1       misho    3248:        example, if the pattern
                   3249: 
                   3250:          a?b?
                   3251: 
1.1.1.4 ! misho    3252:        is applied to a string not beginning with "a" or  "b",  it  matches  an
        !          3253:        empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
1.1       misho    3254:        match is not valid, so PCRE searches further into the string for occur-
                   3255:        rences of "a" or "b".
                   3256: 
                   3257:          PCRE_NOTEMPTY_ATSTART
                   3258: 
1.1.1.4 ! misho    3259:        This  is  like PCRE_NOTEMPTY, except that an empty string match that is
        !          3260:        not at the start of  the  subject  is  permitted.  If  the  pattern  is
1.1       misho    3261:        anchored, such a match can occur only if the pattern contains \K.
                   3262: 
1.1.1.4 ! misho    3263:        Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
        !          3264:        PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
        !          3265:        match  of  the empty string within its split() function, and when using
        !          3266:        the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
1.1       misho    3267:        matching a null string by first trying the match again at the same off-
1.1.1.4 ! misho    3268:        set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
1.1       misho    3269:        fails, by advancing the starting offset (see below) and trying an ordi-
1.1.1.4 ! misho    3270:        nary match again. There is some code that demonstrates how to  do  this
        !          3271:        in  the  pcredemo sample program. In the most general case, you have to
        !          3272:        check to see if the newline convention recognizes CRLF  as  a  newline,
        !          3273:        and  if so, and the current character is CR followed by LF, advance the
1.1       misho    3274:        starting offset by two characters instead of one.
                   3275: 
                   3276:          PCRE_NO_START_OPTIMIZE
                   3277: 
1.1.1.4 ! misho    3278:        There are a number of optimizations that pcre_exec() uses at the  start
        !          3279:        of  a  match,  in  order to speed up the process. For example, if it is
1.1       misho    3280:        known that an unanchored match must start with a specific character, it
1.1.1.4 ! misho    3281:        searches  the  subject  for that character, and fails immediately if it
        !          3282:        cannot find it, without actually running the  main  matching  function.
1.1       misho    3283:        This means that a special item such as (*COMMIT) at the start of a pat-
1.1.1.4 ! misho    3284:        tern is not considered until after a suitable starting  point  for  the
        !          3285:        match  has been found. Also, when callouts or (*MARK) items are in use,
        !          3286:        these "start-up" optimizations can cause them to be skipped if the pat-
        !          3287:        tern is never actually used. The start-up optimizations are in effect a
        !          3288:        pre-scan of the subject that takes place before the pattern is run.
        !          3289: 
        !          3290:        The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,
        !          3291:        possibly  causing  performance  to  suffer,  but ensuring that in cases
        !          3292:        where the result is "no match", the callouts do occur, and  that  items
1.1       misho    3293:        such as (*COMMIT) and (*MARK) are considered at every possible starting
1.1.1.4 ! misho    3294:        position in the subject string. If  PCRE_NO_START_OPTIMIZE  is  set  at
        !          3295:        compile  time,  it  cannot  be  unset  at  matching  time.  The  use of
        !          3296:        PCRE_NO_START_OPTIMIZE  at  matching  time  (that  is,  passing  it  to
        !          3297:        pcre_exec())  disables  JIT  execution;  in this situation, matching is
        !          3298:        always done using interpretively.
1.1       misho    3299: 
                   3300:        Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
                   3301:        operation.  Consider the pattern
                   3302: 
                   3303:          (*COMMIT)ABC
                   3304: 
                   3305:        When  this  is  compiled, PCRE records the fact that a match must start
                   3306:        with the character "A". Suppose the subject  string  is  "DEFABC".  The
                   3307:        start-up  optimization  scans along the subject, finds "A" and runs the
                   3308:        first match attempt from there. The (*COMMIT) item means that the  pat-
                   3309:        tern  must  match the current starting position, which in this case, it
                   3310:        does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
                   3311:        set,  the  initial  scan  along the subject string does not happen. The
                   3312:        first match attempt is run starting  from  "D"  and  when  this  fails,
                   3313:        (*COMMIT)  prevents  any  further  matches  being tried, so the overall
                   3314:        result is "no match". If the pattern is studied,  more  start-up  opti-
                   3315:        mizations  may  be  used. For example, a minimum length for the subject
                   3316:        may be recorded. Consider the pattern
                   3317: 
                   3318:          (*MARK:A)(X|Y)
                   3319: 
                   3320:        The minimum length for a match is one  character.  If  the  subject  is
                   3321:        "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
                   3322:        finally an empty string.  If the pattern is studied, the final  attempt
                   3323:        does  not take place, because PCRE knows that the subject is too short,
                   3324:        and so the (*MARK) is never encountered.  In this  case,  studying  the
                   3325:        pattern  does  not  affect the overall match result, which is still "no
                   3326:        match", but it does affect the auxiliary information that is returned.
                   3327: 
                   3328:          PCRE_NO_UTF8_CHECK
                   3329: 
                   3330:        When PCRE_UTF8 is set at compile time, the validity of the subject as a
                   3331:        UTF-8  string is automatically checked when pcre_exec() is subsequently
1.1.1.3   misho    3332:        called.  The entire string is checked before any other processing takes
                   3333:        place.  The  value  of  startoffset  is  also checked to ensure that it
                   3334:        points to the start of a UTF-8 character. There is a  discussion  about
                   3335:        the  validity  of  UTF-8 strings in the pcreunicode page. If an invalid
                   3336:        sequence  of  bytes   is   found,   pcre_exec()   returns   the   error
1.1.1.2   misho    3337:        PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
                   3338:        truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
1.1.1.3   misho    3339:        both  cases, information about the precise nature of the error may also
                   3340:        be returned (see the descriptions of these errors in the section  enti-
                   3341:        tled  Error return values from pcre_exec() below).  If startoffset con-
1.1.1.2   misho    3342:        tains a value that does not point to the start of a UTF-8 character (or
                   3343:        to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
                   3344: 
1.1.1.3   misho    3345:        If  you  already  know that your subject is valid, and you want to skip
                   3346:        these   checks   for   performance   reasons,   you   can    set    the
                   3347:        PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
                   3348:        do this for the second and subsequent calls to pcre_exec() if  you  are
                   3349:        making  repeated  calls  to  find  all  the matches in a single subject
                   3350:        string. However, you should be  sure  that  the  value  of  startoffset
                   3351:        points  to  the  start of a character (or the end of the subject). When
1.1.1.2   misho    3352:        PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
1.1.1.3   misho    3353:        subject  or  an invalid value of startoffset is undefined. Your program
1.1.1.2   misho    3354:        may crash.
1.1       misho    3355: 
                   3356:          PCRE_PARTIAL_HARD
                   3357:          PCRE_PARTIAL_SOFT
                   3358: 
1.1.1.3   misho    3359:        These options turn on the partial matching feature. For backwards  com-
                   3360:        patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
                   3361:        match occurs if the end of the subject string is reached  successfully,
                   3362:        but  there  are not enough subject characters to complete the match. If
1.1       misho    3363:        this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
1.1.1.3   misho    3364:        matching  continues  by  testing any remaining alternatives. Only if no
                   3365:        complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
                   3366:        PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
                   3367:        caller is prepared to handle a partial match, but only if  no  complete
1.1       misho    3368:        match can be found.
                   3369: 
1.1.1.3   misho    3370:        If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
                   3371:        case, if a partial match  is  found,  pcre_exec()  immediately  returns
                   3372:        PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
                   3373:        other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
1.1       misho    3374:        ered to be more important that an alternative complete match.
                   3375: 
1.1.1.3   misho    3376:        In  both  cases,  the portion of the string that was inspected when the
1.1       misho    3377:        partial match was found is set as the first matching string. There is a
1.1.1.3   misho    3378:        more  detailed  discussion  of partial and multi-segment matching, with
1.1       misho    3379:        examples, in the pcrepartial documentation.
                   3380: 
                   3381:    The string to be matched by pcre_exec()
                   3382: 
1.1.1.3   misho    3383:        The subject string is passed to pcre_exec() as a pointer in subject,  a
1.1.1.4 ! misho    3384:        length  in  length, and a starting offset in startoffset. The units for
        !          3385:        length and startoffset are bytes for the  8-bit  library,  16-bit  data
        !          3386:        items  for  the  16-bit  library,  and 32-bit data items for the 32-bit
        !          3387:        library.
        !          3388: 
        !          3389:        If startoffset is negative or greater than the length of  the  subject,
1.1.1.3   misho    3390:        pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
                   3391:        zero, the search for a match starts at the beginning  of  the  subject,
1.1.1.4 ! misho    3392:        and  this  is by far the most common case. In UTF-8 or UTF-16 mode, the
        !          3393:        offset must point to the start of a character, or the end of  the  sub-
        !          3394:        ject  (in  UTF-32 mode, one data unit equals one character, so all off-
        !          3395:        sets are valid). Unlike the pattern string,  the  subject  may  contain
        !          3396:        binary zeroes.
        !          3397: 
        !          3398:        A  non-zero  starting offset is useful when searching for another match
        !          3399:        in the same subject by calling pcre_exec() again after a previous  suc-
        !          3400:        cess.   Setting  startoffset differs from just passing over a shortened
        !          3401:        string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
1.1       misho    3402:        with any kind of lookbehind. For example, consider the pattern
                   3403: 
                   3404:          \Biss\B
                   3405: 
1.1.1.4 ! misho    3406:        which  finds  occurrences  of "iss" in the middle of words. (\B matches
        !          3407:        only if the current position in the subject is not  a  word  boundary.)
        !          3408:        When  applied  to the string "Mississipi" the first call to pcre_exec()
        !          3409:        finds the first occurrence. If pcre_exec() is called  again  with  just
        !          3410:        the  remainder  of  the  subject,  namely  "issipi", it does not match,
1.1       misho    3411:        because \B is always false at the start of the subject, which is deemed
1.1.1.4 ! misho    3412:        to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1.1       misho    3413:        string again, but with startoffset set to 4, it finds the second occur-
1.1.1.4 ! misho    3414:        rence  of "iss" because it is able to look behind the starting point to
1.1       misho    3415:        discover that it is preceded by a letter.
                   3416: 
1.1.1.4 ! misho    3417:        Finding all the matches in a subject is tricky  when  the  pattern  can
1.1       misho    3418:        match an empty string. It is possible to emulate Perl's /g behaviour by
1.1.1.4 ! misho    3419:        first  trying  the  match  again  at  the   same   offset,   with   the
        !          3420:        PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that
        !          3421:        fails, advancing the starting  offset  and  trying  an  ordinary  match
1.1       misho    3422:        again. There is some code that demonstrates how to do this in the pcre-
                   3423:        demo sample program. In the most general case, you have to check to see
1.1.1.4 ! misho    3424:        if  the newline convention recognizes CRLF as a newline, and if so, and
1.1       misho    3425:        the current character is CR followed by LF, advance the starting offset
                   3426:        by two characters instead of one.
                   3427: 
1.1.1.4 ! misho    3428:        If  a  non-zero starting offset is passed when the pattern is anchored,
1.1       misho    3429:        one attempt to match at the given offset is made. This can only succeed
1.1.1.4 ! misho    3430:        if  the  pattern  does  not require the match to be at the start of the
1.1       misho    3431:        subject.
                   3432: 
                   3433:    How pcre_exec() returns captured substrings
                   3434: 
1.1.1.4 ! misho    3435:        In general, a pattern matches a certain portion of the subject, and  in
        !          3436:        addition,  further  substrings  from  the  subject may be picked out by
        !          3437:        parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
        !          3438:        this  is  called "capturing" in what follows, and the phrase "capturing
        !          3439:        subpattern" is used for a fragment of a pattern that picks out  a  sub-
        !          3440:        string.  PCRE  supports several other kinds of parenthesized subpattern
1.1       misho    3441:        that do not cause substrings to be captured.
                   3442: 
                   3443:        Captured substrings are returned to the caller via a vector of integers
1.1.1.4 ! misho    3444:        whose  address is passed in ovector. The number of elements in the vec-
        !          3445:        tor is passed in ovecsize, which must be a non-negative  number.  Note:
1.1       misho    3446:        this argument is NOT the size of ovector in bytes.
                   3447: 
1.1.1.4 ! misho    3448:        The  first  two-thirds of the vector is used to pass back captured sub-
        !          3449:        strings, each substring using a pair of integers. The  remaining  third
        !          3450:        of  the  vector is used as workspace by pcre_exec() while matching cap-
        !          3451:        turing subpatterns, and is not available for passing back  information.
        !          3452:        The  number passed in ovecsize should always be a multiple of three. If
1.1       misho    3453:        it is not, it is rounded down.
                   3454: 
1.1.1.4 ! misho    3455:        When a match is successful, information about  captured  substrings  is
        !          3456:        returned  in  pairs  of integers, starting at the beginning of ovector,
        !          3457:        and continuing up to two-thirds of its length at the  most.  The  first
        !          3458:        element  of  each pair is set to the offset of the first character in a
        !          3459:        substring, and the second is set to the offset of the  first  character
        !          3460:        after  the  end  of a substring. These values are always data unit off-
        !          3461:        sets, even in UTF mode. They are byte offsets  in  the  8-bit  library,
        !          3462:        16-bit  data  item  offsets in the 16-bit library, and 32-bit data item
        !          3463:        offsets in the 32-bit library. Note: they are not character counts.
        !          3464: 
        !          3465:        The first pair of integers, ovector[0]  and  ovector[1],  identify  the
        !          3466:        portion  of  the subject string matched by the entire pattern. The next
        !          3467:        pair is used for the first capturing subpattern, and so on.  The  value
1.1       misho    3468:        returned by pcre_exec() is one more than the highest numbered pair that
1.1.1.4 ! misho    3469:        has been set.  For example, if two substrings have been  captured,  the
        !          3470:        returned  value is 3. If there are no capturing subpatterns, the return
1.1       misho    3471:        value from a successful match is 1, indicating that just the first pair
                   3472:        of offsets has been set.
                   3473: 
                   3474:        If a capturing subpattern is matched repeatedly, it is the last portion
                   3475:        of the string that it matched that is returned.
                   3476: 
1.1.1.4 ! misho    3477:        If the vector is too small to hold all the captured substring  offsets,
1.1       misho    3478:        it is used as far as possible (up to two-thirds of its length), and the
1.1.1.4 ! misho    3479:        function returns a value of zero. If neither the actual string  matched
        !          3480:        nor  any captured substrings are of interest, pcre_exec() may be called
        !          3481:        with ovector passed as NULL and ovecsize as zero. However, if the  pat-
        !          3482:        tern  contains  back  references  and  the ovector is not big enough to
        !          3483:        remember the related substrings, PCRE has to get additional memory  for
        !          3484:        use  during matching. Thus it is usually advisable to supply an ovector
1.1       misho    3485:        of reasonable size.
                   3486: 
1.1.1.4 ! misho    3487:        There are some cases where zero is returned  (indicating  vector  over-
        !          3488:        flow)  when  in fact the vector is exactly the right size for the final
1.1       misho    3489:        match. For example, consider the pattern
                   3490: 
                   3491:          (a)(?:(b)c|bd)
                   3492: 
1.1.1.4 ! misho    3493:        If a vector of 6 elements (allowing for only 1 captured  substring)  is
1.1       misho    3494:        given with subject string "abd", pcre_exec() will try to set the second
                   3495:        captured string, thereby recording a vector overflow, before failing to
1.1.1.4 ! misho    3496:        match  "c"  and  backing  up  to  try  the second alternative. The zero
        !          3497:        return, however, does correctly indicate that  the  maximum  number  of
1.1       misho    3498:        slots (namely 2) have been filled. In similar cases where there is tem-
1.1.1.4 ! misho    3499:        porary overflow, but the final number of used slots  is  actually  less
1.1       misho    3500:        than the maximum, a non-zero value is returned.
                   3501: 
                   3502:        The pcre_fullinfo() function can be used to find out how many capturing
1.1.1.4 ! misho    3503:        subpatterns there are in a compiled  pattern.  The  smallest  size  for
        !          3504:        ovector  that  will allow for n captured substrings, in addition to the
1.1       misho    3505:        offsets of the substring matched by the whole pattern, is (n+1)*3.
                   3506: 
1.1.1.4 ! misho    3507:        It is possible for capturing subpattern number n+1 to match  some  part
1.1       misho    3508:        of the subject when subpattern n has not been used at all. For example,
1.1.1.4 ! misho    3509:        if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
1.1       misho    3510:        return from the function is 4, and subpatterns 1 and 3 are matched, but
1.1.1.4 ! misho    3511:        2 is not. When this happens, both values in  the  offset  pairs  corre-
1.1       misho    3512:        sponding to unused subpatterns are set to -1.
                   3513: 
1.1.1.4 ! misho    3514:        Offset  values  that correspond to unused subpatterns at the end of the
        !          3515:        expression are also set to -1. For example,  if  the  string  "abc"  is
        !          3516:        matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
        !          3517:        matched. The return from the function is 2, because  the  highest  used
        !          3518:        capturing  subpattern  number  is 1, and the offsets for for the second
        !          3519:        and third capturing subpatterns (assuming the vector is  large  enough,
1.1       misho    3520:        of course) are set to -1.
                   3521: 
1.1.1.4 ! misho    3522:        Note:  Elements  in  the first two-thirds of ovector that do not corre-
        !          3523:        spond to capturing parentheses in the pattern are never  changed.  That
        !          3524:        is,  if  a pattern contains n capturing parentheses, no more than ovec-
        !          3525:        tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements  (in
1.1       misho    3526:        the first two-thirds) retain whatever values they previously had.
                   3527: 
1.1.1.4 ! misho    3528:        Some  convenience  functions  are  provided for extracting the captured
1.1       misho    3529:        substrings as separate strings. These are described below.
                   3530: 
                   3531:    Error return values from pcre_exec()
                   3532: 
1.1.1.4 ! misho    3533:        If pcre_exec() fails, it returns a negative number. The  following  are
1.1       misho    3534:        defined in the header file:
                   3535: 
                   3536:          PCRE_ERROR_NOMATCH        (-1)
                   3537: 
                   3538:        The subject string did not match the pattern.
                   3539: 
                   3540:          PCRE_ERROR_NULL           (-2)
                   3541: 
1.1.1.4 ! misho    3542:        Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1.1       misho    3543:        ovecsize was not zero.
                   3544: 
                   3545:          PCRE_ERROR_BADOPTION      (-3)
                   3546: 
                   3547:        An unrecognized bit was set in the options argument.
                   3548: 
                   3549:          PCRE_ERROR_BADMAGIC       (-4)
                   3550: 
1.1.1.4 ! misho    3551:        PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1.1       misho    3552:        to catch the case when it is passed a junk pointer and to detect when a
                   3553:        pattern that was compiled in an environment of one endianness is run in
1.1.1.4 ! misho    3554:        an  environment  with the other endianness. This is the error that PCRE
1.1       misho    3555:        gives when the magic number is not present.
                   3556: 
                   3557:          PCRE_ERROR_UNKNOWN_OPCODE (-5)
                   3558: 
                   3559:        While running the pattern match, an unknown item was encountered in the
1.1.1.4 ! misho    3560:        compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1.1       misho    3561:        overwriting of the compiled pattern.
                   3562: 
                   3563:          PCRE_ERROR_NOMEMORY       (-6)
                   3564: 
1.1.1.4 ! misho    3565:        If a pattern contains back references, but the ovector that  is  passed
1.1       misho    3566:        to pcre_exec() is not big enough to remember the referenced substrings,
1.1.1.4 ! misho    3567:        PCRE gets a block of memory at the start of matching to  use  for  this
        !          3568:        purpose.  If the call via pcre_malloc() fails, this error is given. The
1.1       misho    3569:        memory is automatically freed at the end of matching.
                   3570: 
1.1.1.4 ! misho    3571:        This error is also given if pcre_stack_malloc() fails  in  pcre_exec().
        !          3572:        This  can happen only when PCRE has been compiled with --disable-stack-
1.1       misho    3573:        for-recursion.
                   3574: 
                   3575:          PCRE_ERROR_NOSUBSTRING    (-7)
                   3576: 
1.1.1.4 ! misho    3577:        This error is used by the pcre_copy_substring(),  pcre_get_substring(),
1.1       misho    3578:        and  pcre_get_substring_list()  functions  (see  below).  It  is  never
                   3579:        returned by pcre_exec().
                   3580: 
                   3581:          PCRE_ERROR_MATCHLIMIT     (-8)
                   3582: 
1.1.1.4 ! misho    3583:        The backtracking limit, as specified by  the  match_limit  field  in  a
        !          3584:        pcre_extra  structure  (or  defaulted) was reached. See the description
1.1       misho    3585:        above.
                   3586: 
                   3587:          PCRE_ERROR_CALLOUT        (-9)
                   3588: 
                   3589:        This error is never generated by pcre_exec() itself. It is provided for
1.1.1.4 ! misho    3590:        use  by  callout functions that want to yield a distinctive error code.
1.1       misho    3591:        See the pcrecallout documentation for details.
                   3592: 
                   3593:          PCRE_ERROR_BADUTF8        (-10)
                   3594: 
1.1.1.4 ! misho    3595:        A string that contains an invalid UTF-8 byte sequence was passed  as  a
        !          3596:        subject,  and the PCRE_NO_UTF8_CHECK option was not set. If the size of
        !          3597:        the output vector (ovecsize) is at least 2,  the  byte  offset  to  the
        !          3598:        start  of  the  the invalid UTF-8 character is placed in the first ele-
        !          3599:        ment, and a reason code is placed in the  second  element.  The  reason
1.1       misho    3600:        codes are listed in the following section.  For backward compatibility,
1.1.1.4 ! misho    3601:        if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8  char-
        !          3602:        acter   at   the   end   of   the   subject  (reason  codes  1  to  5),
1.1       misho    3603:        PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
                   3604: 
                   3605:          PCRE_ERROR_BADUTF8_OFFSET (-11)
                   3606: 
1.1.1.4 ! misho    3607:        The UTF-8 byte sequence that was passed as a subject  was  checked  and
        !          3608:        found  to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the
        !          3609:        value of startoffset did not point to the beginning of a UTF-8  charac-
1.1       misho    3610:        ter or the end of the subject.
                   3611: 
                   3612:          PCRE_ERROR_PARTIAL        (-12)
                   3613: 
1.1.1.4 ! misho    3614:        The  subject  string did not match, but it did match partially. See the
1.1       misho    3615:        pcrepartial documentation for details of partial matching.
                   3616: 
                   3617:          PCRE_ERROR_BADPARTIAL     (-13)
                   3618: 
1.1.1.4 ! misho    3619:        This code is no longer in  use.  It  was  formerly  returned  when  the
        !          3620:        PCRE_PARTIAL  option  was used with a compiled pattern containing items
        !          3621:        that were  not  supported  for  partial  matching.  From  release  8.00
1.1       misho    3622:        onwards, there are no restrictions on partial matching.
                   3623: 
                   3624:          PCRE_ERROR_INTERNAL       (-14)
                   3625: 
1.1.1.4 ! misho    3626:        An  unexpected  internal error has occurred. This error could be caused
1.1       misho    3627:        by a bug in PCRE or by overwriting of the compiled pattern.
                   3628: 
                   3629:          PCRE_ERROR_BADCOUNT       (-15)
                   3630: 
                   3631:        This error is given if the value of the ovecsize argument is negative.
                   3632: 
                   3633:          PCRE_ERROR_RECURSIONLIMIT (-21)
                   3634: 
                   3635:        The internal recursion limit, as specified by the match_limit_recursion
1.1.1.4 ! misho    3636:        field  in  a  pcre_extra  structure (or defaulted) was reached. See the
1.1       misho    3637:        description above.
                   3638: 
                   3639:          PCRE_ERROR_BADNEWLINE     (-23)
                   3640: 
                   3641:        An invalid combination of PCRE_NEWLINE_xxx options was given.
                   3642: 
                   3643:          PCRE_ERROR_BADOFFSET      (-24)
                   3644: 
                   3645:        The value of startoffset was negative or greater than the length of the
                   3646:        subject, that is, the value in length.
                   3647: 
                   3648:          PCRE_ERROR_SHORTUTF8      (-25)
                   3649: 
1.1.1.4 ! misho    3650:        This  error  is returned instead of PCRE_ERROR_BADUTF8 when the subject
        !          3651:        string ends with a truncated UTF-8 character and the  PCRE_PARTIAL_HARD
        !          3652:        option  is  set.   Information  about  the  failure  is returned as for
        !          3653:        PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this  case,  but
        !          3654:        this  special error code for PCRE_PARTIAL_HARD precedes the implementa-
        !          3655:        tion of returned information; it is retained for backwards  compatibil-
1.1       misho    3656:        ity.
                   3657: 
                   3658:          PCRE_ERROR_RECURSELOOP    (-26)
                   3659: 
                   3660:        This error is returned when pcre_exec() detects a recursion loop within
1.1.1.4 ! misho    3661:        the pattern. Specifically, it means that either the whole pattern or  a
        !          3662:        subpattern  has been called recursively for the second time at the same
1.1       misho    3663:        position in the subject string. Some simple patterns that might do this
1.1.1.4 ! misho    3664:        are  detected  and faulted at compile time, but more complicated cases,
1.1       misho    3665:        in particular mutual recursions between two different subpatterns, can-
                   3666:        not be detected until run time.
                   3667: 
                   3668:          PCRE_ERROR_JIT_STACKLIMIT (-27)
                   3669: 
1.1.1.4 ! misho    3670:        This  error  is  returned  when a pattern that was successfully studied
        !          3671:        using a JIT compile option is being matched, but the  memory  available
        !          3672:        for  the  just-in-time  processing  stack  is not large enough. See the
1.1.1.3   misho    3673:        pcrejit documentation for more details.
1.1       misho    3674: 
1.1.1.3   misho    3675:          PCRE_ERROR_BADMODE        (-28)
1.1.1.2   misho    3676: 
                   3677:        This error is given if a pattern that was compiled by the 8-bit library
1.1.1.4 ! misho    3678:        is passed to a 16-bit or 32-bit library function, or vice versa.
1.1.1.2   misho    3679: 
1.1.1.3   misho    3680:          PCRE_ERROR_BADENDIANNESS  (-29)
1.1.1.2   misho    3681: 
1.1.1.4 ! misho    3682:        This  error  is  given  if  a  pattern  that  was compiled and saved is
        !          3683:        reloaded on a host with  different  endianness.  The  utility  function
1.1.1.2   misho    3684:        pcre_pattern_to_host_byte_order() can be used to convert such a pattern
                   3685:        so that it runs on the new host.
                   3686: 
1.1.1.4 ! misho    3687:          PCRE_ERROR_JIT_BADOPTION
        !          3688: 
        !          3689:        This error is returned when a pattern  that  was  successfully  studied
        !          3690:        using  a  JIT  compile  option  is being matched, but the matching mode
        !          3691:        (partial or complete match) does not correspond to any JIT  compilation
        !          3692:        mode.  When  the JIT fast path function is used, this error may be also
        !          3693:        given for invalid options.  See  the  pcrejit  documentation  for  more
        !          3694:        details.
        !          3695: 
        !          3696:          PCRE_ERROR_BADLENGTH      (-32)
        !          3697: 
        !          3698:        This  error is given if pcre_exec() is called with a negative value for
        !          3699:        the length argument.
        !          3700: 
        !          3701:        Error numbers -16 to -20, -22, and 30 are not used by pcre_exec().
1.1       misho    3702: 
                   3703:    Reason codes for invalid UTF-8 strings
                   3704: 
1.1.1.4 ! misho    3705:        This section applies only  to  the  8-bit  library.  The  corresponding
        !          3706:        information  for the 16-bit and 32-bit libraries is given in the pcre16
        !          3707:        and pcre32 pages.
1.1.1.2   misho    3708: 
1.1       misho    3709:        When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
1.1.1.3   misho    3710:        UTF8,  and  the size of the output vector (ovecsize) is at least 2, the
                   3711:        offset of the start of the invalid UTF-8 character  is  placed  in  the
1.1       misho    3712:        first output vector element (ovector[0]) and a reason code is placed in
1.1.1.3   misho    3713:        the second element (ovector[1]). The reason codes are  given  names  in
1.1       misho    3714:        the pcre.h header file:
                   3715: 
                   3716:          PCRE_UTF8_ERR1
                   3717:          PCRE_UTF8_ERR2
                   3718:          PCRE_UTF8_ERR3
                   3719:          PCRE_UTF8_ERR4
                   3720:          PCRE_UTF8_ERR5
                   3721: 
1.1.1.3   misho    3722:        The  string  ends  with a truncated UTF-8 character; the code specifies
                   3723:        how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
                   3724:        characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
                   3725:        nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
1.1       misho    3726:        checked first; hence the possibility of 4 or 5 missing bytes.
                   3727: 
                   3728:          PCRE_UTF8_ERR6
                   3729:          PCRE_UTF8_ERR7
                   3730:          PCRE_UTF8_ERR8
                   3731:          PCRE_UTF8_ERR9
                   3732:          PCRE_UTF8_ERR10
                   3733: 
                   3734:        The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
1.1.1.3   misho    3735:        the character do not have the binary value 0b10 (that  is,  either  the
1.1       misho    3736:        most significant bit is 0, or the next bit is 1).
                   3737: 
                   3738:          PCRE_UTF8_ERR11
                   3739:          PCRE_UTF8_ERR12
                   3740: 
1.1.1.3   misho    3741:        A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
1.1       misho    3742:        long; these code points are excluded by RFC 3629.
                   3743: 
                   3744:          PCRE_UTF8_ERR13
                   3745: 
1.1.1.3   misho    3746:        A 4-byte character has a value greater than 0x10fff; these code  points
1.1       misho    3747:        are excluded by RFC 3629.
                   3748: 
                   3749:          PCRE_UTF8_ERR14
                   3750: 
1.1.1.3   misho    3751:        A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
                   3752:        range of code points are reserved by RFC 3629 for use with UTF-16,  and
1.1       misho    3753:        so are excluded from UTF-8.
                   3754: 
                   3755:          PCRE_UTF8_ERR15
                   3756:          PCRE_UTF8_ERR16
                   3757:          PCRE_UTF8_ERR17
                   3758:          PCRE_UTF8_ERR18
                   3759:          PCRE_UTF8_ERR19
                   3760: 
1.1.1.3   misho    3761:        A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
                   3762:        for a value that can be represented by fewer bytes, which  is  invalid.
                   3763:        For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
1.1       misho    3764:        rect coding uses just one byte.
                   3765: 
                   3766:          PCRE_UTF8_ERR20
                   3767: 
                   3768:        The two most significant bits of the first byte of a character have the
1.1.1.3   misho    3769:        binary  value 0b10 (that is, the most significant bit is 1 and the sec-
                   3770:        ond is 0). Such a byte can only validly occur as the second  or  subse-
1.1       misho    3771:        quent byte of a multi-byte character.
                   3772: 
                   3773:          PCRE_UTF8_ERR21
                   3774: 
1.1.1.3   misho    3775:        The  first byte of a character has the value 0xfe or 0xff. These values
1.1       misho    3776:        can never occur in a valid UTF-8 string.
                   3777: 
1.1.1.4 ! misho    3778:          PCRE_UTF8_ERR22
        !          3779: 
        !          3780:        This error code was formerly used when  the  presence  of  a  so-called
        !          3781:        "non-character"  caused an error. Unicode corrigendum #9 makes it clear
        !          3782:        that such characters should not cause a string to be rejected,  and  so
        !          3783:        this code is no longer in use and is never returned.
        !          3784: 
1.1       misho    3785: 
                   3786: EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
                   3787: 
                   3788:        int pcre_copy_substring(const char *subject, int *ovector,
                   3789:             int stringcount, int stringnumber, char *buffer,
                   3790:             int buffersize);
                   3791: 
                   3792:        int pcre_get_substring(const char *subject, int *ovector,
                   3793:             int stringcount, int stringnumber,
                   3794:             const char **stringptr);
                   3795: 
                   3796:        int pcre_get_substring_list(const char *subject,
                   3797:             int *ovector, int stringcount, const char ***listptr);
                   3798: 
1.1.1.4 ! misho    3799:        Captured  substrings  can  be  accessed  directly  by using the offsets
        !          3800:        returned by pcre_exec() in  ovector.  For  convenience,  the  functions
1.1       misho    3801:        pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1.1.1.4 ! misho    3802:        string_list() are provided for extracting captured substrings  as  new,
        !          3803:        separate,  zero-terminated strings. These functions identify substrings
        !          3804:        by number. The next section describes functions  for  extracting  named
1.1       misho    3805:        substrings.
                   3806: 
1.1.1.4 ! misho    3807:        A  substring that contains a binary zero is correctly extracted and has
        !          3808:        a further zero added on the end, but the result is not, of course, a  C
        !          3809:        string.   However,  you  can  process such a string by referring to the
        !          3810:        length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
1.1       misho    3811:        string().  Unfortunately, the interface to pcre_get_substring_list() is
1.1.1.4 ! misho    3812:        not adequate for handling strings containing binary zeros, because  the
1.1       misho    3813:        end of the final string is not independently indicated.
                   3814: 
1.1.1.4 ! misho    3815:        The  first  three  arguments  are the same for all three of these func-
        !          3816:        tions: subject is the subject string that has  just  been  successfully
1.1       misho    3817:        matched, ovector is a pointer to the vector of integer offsets that was
                   3818:        passed to pcre_exec(), and stringcount is the number of substrings that
1.1.1.4 ! misho    3819:        were  captured  by  the match, including the substring that matched the
1.1       misho    3820:        entire regular expression. This is the value returned by pcre_exec() if
1.1.1.4 ! misho    3821:        it  is greater than zero. If pcre_exec() returned zero, indicating that
        !          3822:        it ran out of space in ovector, the value passed as stringcount  should
1.1       misho    3823:        be the number of elements in the vector divided by three.
                   3824: 
1.1.1.4 ! misho    3825:        The  functions pcre_copy_substring() and pcre_get_substring() extract a
        !          3826:        single substring, whose number is given as  stringnumber.  A  value  of
        !          3827:        zero  extracts  the  substring that matched the entire pattern, whereas
        !          3828:        higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
        !          3829:        string(),  the  string  is  placed  in buffer, whose length is given by
        !          3830:        buffersize, while for pcre_get_substring() a new  block  of  memory  is
        !          3831:        obtained  via  pcre_malloc,  and its address is returned via stringptr.
        !          3832:        The yield of the function is the length of the  string,  not  including
1.1       misho    3833:        the terminating zero, or one of these error codes:
                   3834: 
                   3835:          PCRE_ERROR_NOMEMORY       (-6)
                   3836: 
1.1.1.4 ! misho    3837:        The  buffer  was too small for pcre_copy_substring(), or the attempt to
1.1       misho    3838:        get memory failed for pcre_get_substring().
                   3839: 
                   3840:          PCRE_ERROR_NOSUBSTRING    (-7)
                   3841: 
                   3842:        There is no substring whose number is stringnumber.
                   3843: 
1.1.1.4 ! misho    3844:        The pcre_get_substring_list()  function  extracts  all  available  sub-
        !          3845:        strings  and  builds  a list of pointers to them. All this is done in a
1.1       misho    3846:        single block of memory that is obtained via pcre_malloc. The address of
1.1.1.4 ! misho    3847:        the  memory  block  is returned via listptr, which is also the start of
        !          3848:        the list of string pointers. The end of the list is marked  by  a  NULL
        !          3849:        pointer.  The  yield  of  the function is zero if all went well, or the
1.1       misho    3850:        error code
                   3851: 
                   3852:          PCRE_ERROR_NOMEMORY       (-6)
                   3853: 
                   3854:        if the attempt to get the memory block failed.
                   3855: 
1.1.1.4 ! misho    3856:        When any of these functions encounter a substring that is unset,  which
        !          3857:        can  happen  when  capturing subpattern number n+1 matches some part of
        !          3858:        the subject, but subpattern n has not been used at all, they return  an
1.1       misho    3859:        empty string. This can be distinguished from a genuine zero-length sub-
1.1.1.4 ! misho    3860:        string by inspecting the appropriate offset in ovector, which is  nega-
1.1       misho    3861:        tive for unset substrings.
                   3862: 
1.1.1.4 ! misho    3863:        The  two convenience functions pcre_free_substring() and pcre_free_sub-
        !          3864:        string_list() can be used to free the memory  returned  by  a  previous
1.1       misho    3865:        call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1.1.1.4 ! misho    3866:        tively. They do nothing more than  call  the  function  pointed  to  by
        !          3867:        pcre_free,  which  of course could be called directly from a C program.
        !          3868:        However, PCRE is used in some situations where it is linked via a  spe-
        !          3869:        cial   interface  to  another  programming  language  that  cannot  use
        !          3870:        pcre_free directly; it is for these cases that the functions  are  pro-
1.1       misho    3871:        vided.
                   3872: 
                   3873: 
                   3874: EXTRACTING CAPTURED SUBSTRINGS BY NAME
                   3875: 
                   3876:        int pcre_get_stringnumber(const pcre *code,
                   3877:             const char *name);
                   3878: 
                   3879:        int pcre_copy_named_substring(const pcre *code,
                   3880:             const char *subject, int *ovector,
                   3881:             int stringcount, const char *stringname,
                   3882:             char *buffer, int buffersize);
                   3883: 
                   3884:        int pcre_get_named_substring(const pcre *code,
                   3885:             const char *subject, int *ovector,
                   3886:             int stringcount, const char *stringname,
                   3887:             const char **stringptr);
                   3888: 
1.1.1.4 ! misho    3889:        To  extract a substring by name, you first have to find associated num-
1.1       misho    3890:        ber.  For example, for this pattern
                   3891: 
                   3892:          (a+)b(?<xxx>\d+)...
                   3893: 
                   3894:        the number of the subpattern called "xxx" is 2. If the name is known to
                   3895:        be unique (PCRE_DUPNAMES was not set), you can find the number from the
                   3896:        name by calling pcre_get_stringnumber(). The first argument is the com-
                   3897:        piled pattern, and the second is the name. The yield of the function is
1.1.1.4 ! misho    3898:        the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no
1.1       misho    3899:        subpattern of that name.
                   3900: 
                   3901:        Given the number, you can extract the substring directly, or use one of
                   3902:        the functions described in the previous section. For convenience, there
                   3903:        are also two functions that do the whole job.
                   3904: 
1.1.1.4 ! misho    3905:        Most    of    the    arguments   of   pcre_copy_named_substring()   and
        !          3906:        pcre_get_named_substring() are the same  as  those  for  the  similarly
        !          3907:        named  functions  that extract by number. As these are described in the
        !          3908:        previous section, they are not re-described here. There  are  just  two
1.1       misho    3909:        differences:
                   3910: 
1.1.1.4 ! misho    3911:        First,  instead  of a substring number, a substring name is given. Sec-
1.1       misho    3912:        ond, there is an extra argument, given at the start, which is a pointer
1.1.1.4 ! misho    3913:        to  the compiled pattern. This is needed in order to gain access to the
1.1       misho    3914:        name-to-number translation table.
                   3915: 
1.1.1.4 ! misho    3916:        These functions call pcre_get_stringnumber(), and if it succeeds,  they
        !          3917:        then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
        !          3918:        ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the
1.1       misho    3919:        behaviour may not be what you want (see the next section).
                   3920: 
                   3921:        Warning: If the pattern uses the (?| feature to set up multiple subpat-
1.1.1.4 ! misho    3922:        terns with the same number, as described in the  section  on  duplicate
        !          3923:        subpattern  numbers  in  the  pcrepattern page, you cannot use names to
        !          3924:        distinguish the different subpatterns, because names are  not  included
        !          3925:        in  the compiled code. The matching process uses only numbers. For this
        !          3926:        reason, the use of different names for subpatterns of the  same  number
1.1       misho    3927:        causes an error at compile time.
                   3928: 
                   3929: 
                   3930: DUPLICATE SUBPATTERN NAMES
                   3931: 
                   3932:        int pcre_get_stringtable_entries(const pcre *code,
                   3933:             const char *name, char **first, char **last);
                   3934: 
1.1.1.4 ! misho    3935:        When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
        !          3936:        subpatterns are not required to be unique. (Duplicate names are  always
        !          3937:        allowed  for subpatterns with the same number, created by using the (?|
        !          3938:        feature. Indeed, if such subpatterns are named, they  are  required  to
1.1       misho    3939:        use the same names.)
                   3940: 
                   3941:        Normally, patterns with duplicate names are such that in any one match,
1.1.1.4 ! misho    3942:        only one of the named subpatterns participates. An example is shown  in
1.1       misho    3943:        the pcrepattern documentation.
                   3944: 
1.1.1.4 ! misho    3945:        When    duplicates   are   present,   pcre_copy_named_substring()   and
        !          3946:        pcre_get_named_substring() return the first substring corresponding  to
        !          3947:        the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING
        !          3948:        (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()
        !          3949:        function  returns one of the numbers that are associated with the name,
1.1       misho    3950:        but it is not defined which it is.
                   3951: 
1.1.1.4 ! misho    3952:        If you want to get full details of all captured substrings for a  given
        !          3953:        name,  you  must  use  the pcre_get_stringtable_entries() function. The
1.1       misho    3954:        first argument is the compiled pattern, and the second is the name. The
1.1.1.4 ! misho    3955:        third  and  fourth  are  pointers to variables which are updated by the
1.1       misho    3956:        function. After it has run, they point to the first and last entries in
1.1.1.4 ! misho    3957:        the  name-to-number  table  for  the  given  name.  The function itself
        !          3958:        returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if
        !          3959:        there  are none. The format of the table is described above in the sec-
        !          3960:        tion entitled Information about a pattern above.  Given all  the  rele-
        !          3961:        vant  entries  for the name, you can extract each of their numbers, and
1.1       misho    3962:        hence the captured data, if any.
                   3963: 
                   3964: 
                   3965: FINDING ALL POSSIBLE MATCHES
                   3966: 
1.1.1.4 ! misho    3967:        The traditional matching function uses a  similar  algorithm  to  Perl,
1.1       misho    3968:        which stops when it finds the first match, starting at a given point in
1.1.1.4 ! misho    3969:        the subject. If you want to find all possible matches, or  the  longest
        !          3970:        possible  match,  consider using the alternative matching function (see
        !          3971:        below) instead. If you cannot use the alternative function,  but  still
        !          3972:        need  to  find all possible matches, you can kludge it up by making use
1.1       misho    3973:        of the callout facility, which is described in the pcrecallout documen-
                   3974:        tation.
                   3975: 
                   3976:        What you have to do is to insert a callout right at the end of the pat-
1.1.1.4 ! misho    3977:        tern.  When your callout function is called, extract and save the  cur-
        !          3978:        rent  matched  substring.  Then  return  1, which forces pcre_exec() to
        !          3979:        backtrack and try other alternatives. Ultimately, when it runs  out  of
1.1       misho    3980:        matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
                   3981: 
                   3982: 
1.1.1.2   misho    3983: OBTAINING AN ESTIMATE OF STACK USAGE
                   3984: 
1.1.1.4 ! misho    3985:        Matching  certain  patterns  using pcre_exec() can use a lot of process
        !          3986:        stack, which in certain environments can be  rather  limited  in  size.
        !          3987:        Some  users  find it helpful to have an estimate of the amount of stack
        !          3988:        that is used by pcre_exec(), to help  them  set  recursion  limits,  as
        !          3989:        described  in  the pcrestack documentation. The estimate that is output
1.1.1.2   misho    3990:        by pcretest when called with the -m and -C options is obtained by call-
1.1.1.4 ! misho    3991:        ing  pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its
1.1.1.2   misho    3992:        first five arguments.
                   3993: 
1.1.1.4 ! misho    3994:        Normally, if  its  first  argument  is  NULL,  pcre_exec()  immediately
        !          3995:        returns  the negative error code PCRE_ERROR_NULL, but with this special
        !          3996:        combination of arguments, it returns instead a  negative  number  whose
        !          3997:        absolute  value  is the approximate stack frame size in bytes. (A nega-
        !          3998:        tive number is used so that it is clear that no  match  has  happened.)
        !          3999:        The  value  is  approximate  because  in some cases, recursive calls to
1.1.1.2   misho    4000:        pcre_exec() occur when there are one or two additional variables on the
                   4001:        stack.
                   4002: 
1.1.1.4 ! misho    4003:        If  PCRE  has  been  compiled  to use the heap instead of the stack for
        !          4004:        recursion, the value returned  is  the  size  of  each  block  that  is
1.1.1.2   misho    4005:        obtained from the heap.
                   4006: 
                   4007: 
1.1       misho    4008: MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
                   4009: 
                   4010:        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
                   4011:             const char *subject, int length, int startoffset,
                   4012:             int options, int *ovector, int ovecsize,
                   4013:             int *workspace, int wscount);
                   4014: 
1.1.1.4 ! misho    4015:        The  function  pcre_dfa_exec()  is  called  to  match  a subject string
        !          4016:        against a compiled pattern, using a matching algorithm that  scans  the
        !          4017:        subject  string  just  once, and does not backtrack. This has different
        !          4018:        characteristics to the normal algorithm, and  is  not  compatible  with
        !          4019:        Perl.  Some  of the features of PCRE patterns are not supported. Never-
        !          4020:        theless, there are times when this kind of matching can be useful.  For
        !          4021:        a  discussion  of  the  two matching algorithms, and a list of features
        !          4022:        that pcre_dfa_exec() does not support, see the pcrematching  documenta-
1.1       misho    4023:        tion.
                   4024: 
1.1.1.4 ! misho    4025:        The  arguments  for  the  pcre_dfa_exec()  function are the same as for
1.1       misho    4026:        pcre_exec(), plus two extras. The ovector argument is used in a differ-
1.1.1.4 ! misho    4027:        ent  way,  and  this is described below. The other common arguments are
        !          4028:        used in the same way as for pcre_exec(), so their  description  is  not
1.1       misho    4029:        repeated here.
                   4030: 
1.1.1.4 ! misho    4031:        The  two  additional  arguments provide workspace for the function. The
        !          4032:        workspace vector should contain at least 20 elements. It  is  used  for
1.1       misho    4033:        keeping  track  of  multiple  paths  through  the  pattern  tree.  More
1.1.1.4 ! misho    4034:        workspace will be needed for patterns and subjects where  there  are  a
1.1       misho    4035:        lot of potential matches.
                   4036: 
                   4037:        Here is an example of a simple call to pcre_dfa_exec():
                   4038: 
                   4039:          int rc;
                   4040:          int ovector[10];
                   4041:          int wspace[20];
                   4042:          rc = pcre_dfa_exec(
                   4043:            re,             /* result of pcre_compile() */
                   4044:            NULL,           /* we didn't study the pattern */
                   4045:            "some string",  /* the subject string */
                   4046:            11,             /* the length of the subject string */
                   4047:            0,              /* start at offset 0 in the subject */
                   4048:            0,              /* default options */
                   4049:            ovector,        /* vector of integers for substring information */
                   4050:            10,             /* number of elements (NOT size in bytes) */
                   4051:            wspace,         /* working space vector */
                   4052:            20);            /* number of elements (NOT size in bytes) */
                   4053: 
                   4054:    Option bits for pcre_dfa_exec()
                   4055: 
1.1.1.4 ! misho    4056:        The  unused  bits  of  the options argument for pcre_dfa_exec() must be
        !          4057:        zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
1.1       misho    4058:        LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
1.1.1.4 ! misho    4059:        PCRE_NOTEMPTY_ATSTART,      PCRE_NO_UTF8_CHECK,       PCRE_BSR_ANYCRLF,
        !          4060:        PCRE_BSR_UNICODE,  PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-
        !          4061:        TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.  All but  the  last
        !          4062:        four  of  these  are  exactly  the  same  as  for pcre_exec(), so their
1.1       misho    4063:        description is not repeated here.
                   4064: 
                   4065:          PCRE_PARTIAL_HARD
                   4066:          PCRE_PARTIAL_SOFT
                   4067: 
1.1.1.4 ! misho    4068:        These have the same general effect as they do for pcre_exec(), but  the
        !          4069:        details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for
        !          4070:        pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-
        !          4071:        ject  is  reached  and there is still at least one matching possibility
1.1       misho    4072:        that requires additional characters. This happens even if some complete
                   4073:        matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
                   4074:        code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
1.1.1.4 ! misho    4075:        of  the  subject  is  reached, there have been no complete matches, but
        !          4076:        there is still at least one matching possibility. The  portion  of  the
        !          4077:        string  that  was inspected when the longest partial match was found is
        !          4078:        set as the first matching string  in  both  cases.   There  is  a  more
        !          4079:        detailed  discussion  of partial and multi-segment matching, with exam-
1.1       misho    4080:        ples, in the pcrepartial documentation.
                   4081: 
                   4082:          PCRE_DFA_SHORTEST
                   4083: 
1.1.1.4 ! misho    4084:        Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
1.1       misho    4085:        stop as soon as it has found one match. Because of the way the alterna-
1.1.1.4 ! misho    4086:        tive algorithm works, this is necessarily the shortest  possible  match
1.1       misho    4087:        at the first possible matching point in the subject string.
                   4088: 
                   4089:          PCRE_DFA_RESTART
                   4090: 
                   4091:        When pcre_dfa_exec() returns a partial match, it is possible to call it
1.1.1.4 ! misho    4092:        again, with additional subject characters, and have  it  continue  with
        !          4093:        the  same match. The PCRE_DFA_RESTART option requests this action; when
        !          4094:        it is set, the workspace and wscount options must  reference  the  same
        !          4095:        vector  as  before  because data about the match so far is left in them
1.1       misho    4096:        after a partial match. There is more discussion of this facility in the
                   4097:        pcrepartial documentation.
                   4098: 
                   4099:    Successful returns from pcre_dfa_exec()
                   4100: 
1.1.1.4 ! misho    4101:        When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
1.1       misho    4102:        string in the subject. Note, however, that all the matches from one run
1.1.1.4 ! misho    4103:        of  the  function  start  at the same point in the subject. The shorter
        !          4104:        matches are all initial substrings of the longer matches. For  example,
1.1       misho    4105:        if the pattern
                   4106: 
                   4107:          <.*>
                   4108: 
                   4109:        is matched against the string
                   4110: 
                   4111:          This is <something> <something else> <something further> no more
                   4112: 
                   4113:        the three matched strings are
                   4114: 
                   4115:          <something>
                   4116:          <something> <something else>
                   4117:          <something> <something else> <something further>
                   4118: 
1.1.1.4 ! misho    4119:        On  success,  the  yield of the function is a number greater than zero,
        !          4120:        which is the number of matched substrings.  The  substrings  themselves
        !          4121:        are  returned  in  ovector. Each string uses two elements; the first is
        !          4122:        the offset to the start, and the second is the offset to  the  end.  In
        !          4123:        fact,  all  the  strings  have the same start offset. (Space could have
        !          4124:        been saved by giving this only once, but it was decided to retain  some
        !          4125:        compatibility  with  the  way pcre_exec() returns data, even though the
1.1       misho    4126:        meaning of the strings is different.)
                   4127: 
                   4128:        The strings are returned in reverse order of length; that is, the long-
1.1.1.4 ! misho    4129:        est  matching  string is given first. If there were too many matches to
        !          4130:        fit into ovector, the yield of the function is zero, and the vector  is
        !          4131:        filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()
1.1       misho    4132:        can use the entire ovector for returning matched strings.
                   4133: 
                   4134:    Error returns from pcre_dfa_exec()
                   4135: 
1.1.1.4 ! misho    4136:        The pcre_dfa_exec() function returns a negative number when  it  fails.
        !          4137:        Many  of  the  errors  are  the  same as for pcre_exec(), and these are
        !          4138:        described above.  There are in addition the following errors  that  are
1.1       misho    4139:        specific to pcre_dfa_exec():
                   4140: 
                   4141:          PCRE_ERROR_DFA_UITEM      (-16)
                   4142: 
1.1.1.4 ! misho    4143:        This  return is given if pcre_dfa_exec() encounters an item in the pat-
        !          4144:        tern that it does not support, for instance, the use of \C  or  a  back
1.1       misho    4145:        reference.
                   4146: 
                   4147:          PCRE_ERROR_DFA_UCOND      (-17)
                   4148: 
1.1.1.4 ! misho    4149:        This  return  is  given  if pcre_dfa_exec() encounters a condition item
        !          4150:        that uses a back reference for the condition, or a test  for  recursion
1.1       misho    4151:        in a specific group. These are not supported.
                   4152: 
                   4153:          PCRE_ERROR_DFA_UMLIMIT    (-18)
                   4154: 
1.1.1.4 ! misho    4155:        This  return  is given if pcre_dfa_exec() is called with an extra block
        !          4156:        that contains a setting of  the  match_limit  or  match_limit_recursion
        !          4157:        fields.  This  is  not  supported (these fields are meaningless for DFA
1.1       misho    4158:        matching).
                   4159: 
                   4160:          PCRE_ERROR_DFA_WSSIZE     (-19)
                   4161: 
1.1.1.4 ! misho    4162:        This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
1.1       misho    4163:        workspace vector.
                   4164: 
                   4165:          PCRE_ERROR_DFA_RECURSE    (-20)
                   4166: 
1.1.1.4 ! misho    4167:        When  a  recursive subpattern is processed, the matching function calls
        !          4168:        itself recursively, using private vectors for  ovector  and  workspace.
        !          4169:        This  error  is  given  if  the output vector is not large enough. This
1.1       misho    4170:        should be extremely rare, as a vector of size 1000 is used.
                   4171: 
1.1.1.3   misho    4172:          PCRE_ERROR_DFA_BADRESTART (-30)
                   4173: 
1.1.1.4 ! misho    4174:        When pcre_dfa_exec() is called with the PCRE_DFA_RESTART  option,  some
        !          4175:        plausibility  checks  are  made on the contents of the workspace, which
        !          4176:        should contain data about the previous partial match. If any  of  these
1.1.1.3   misho    4177:        checks fail, this error is given.
                   4178: 
1.1       misho    4179: 
                   4180: SEE ALSO
                   4181: 
1.1.1.4 ! misho    4182:        pcre16(3),   pcre32(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),
        !          4183:        pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
        !          4184:        sample(3), pcrestack(3).
1.1       misho    4185: 
                   4186: 
                   4187: AUTHOR
                   4188: 
                   4189:        Philip Hazel
                   4190:        University Computing Service
                   4191:        Cambridge CB2 3QH, England.
                   4192: 
                   4193: 
                   4194: REVISION
                   4195: 
1.1.1.4 ! misho    4196:        Last updated: 12 May 2013
        !          4197:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    4198: ------------------------------------------------------------------------------
                   4199: 
                   4200: 
1.1.1.4 ! misho    4201: PCRECALLOUT(3)             Library Functions Manual             PCRECALLOUT(3)
        !          4202: 
1.1       misho    4203: 
                   4204: 
                   4205: NAME
                   4206:        PCRE - Perl-compatible regular expressions
                   4207: 
1.1.1.4 ! misho    4208: SYNOPSIS
1.1       misho    4209: 
1.1.1.4 ! misho    4210:        #include <pcre.h>
1.1       misho    4211: 
                   4212:        int (*pcre_callout)(pcre_callout_block *);
                   4213: 
1.1.1.2   misho    4214:        int (*pcre16_callout)(pcre16_callout_block *);
                   4215: 
1.1.1.4 ! misho    4216:        int (*pcre32_callout)(pcre32_callout_block *);
        !          4217: 
        !          4218: 
        !          4219: DESCRIPTION
        !          4220: 
1.1       misho    4221:        PCRE provides a feature called "callout", which is a means of temporar-
                   4222:        ily passing control to the caller of PCRE  in  the  middle  of  pattern
                   4223:        matching.  The  caller of PCRE provides an external function by putting
1.1.1.2   misho    4224:        its entry point in the global variable pcre_callout (pcre16_callout for
1.1.1.4 ! misho    4225:        the 16-bit library, pcre32_callout for the 32-bit library). By default,
        !          4226:        this variable contains NULL, which disables all calling out.
1.1       misho    4227: 
1.1.1.2   misho    4228:        Within a regular expression, (?C) indicates the  points  at  which  the
                   4229:        external  function  is  to  be  called. Different callout points can be
                   4230:        identified by putting a number less than 256 after the  letter  C.  The
                   4231:        default  value  is  zero.   For  example,  this pattern has two callout
1.1       misho    4232:        points:
                   4233: 
                   4234:          (?C1)abc(?C2)def
                   4235: 
1.1.1.2   misho    4236:        If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
                   4237:        PCRE  automatically  inserts callouts, all with number 255, before each
                   4238:        item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
                   4239:        pattern
1.1       misho    4240: 
                   4241:          A(\d{2}|--)
                   4242: 
                   4243:        it is processed as if it were
                   4244: 
                   4245:        (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
                   4246: 
1.1.1.2   misho    4247:        Notice  that  there  is a callout before and after each parenthesis and
1.1.1.4 ! misho    4248:        alternation bar. If the pattern contains a conditional group whose con-
        !          4249:        dition  is  an  assertion, an automatic callout is inserted immediately
        !          4250:        before the condition. Such a callout may also be  inserted  explicitly,
        !          4251:        for example:
        !          4252: 
        !          4253:          (?(?C9)(?=a)ab|de)
        !          4254: 
        !          4255:        This  applies only to assertion conditions (because they are themselves
        !          4256:        independent groups).
        !          4257: 
        !          4258:        Automatic callouts can be used for tracking  the  progress  of  pattern
        !          4259:        matching.  The pcretest command has an option that sets automatic call-
        !          4260:        outs; when it is used, the output indicates how the pattern is matched.
        !          4261:        This  is useful information when you are trying to optimize the perfor-
        !          4262:        mance of a particular pattern.
1.1       misho    4263: 
                   4264: 
                   4265: MISSING CALLOUTS
                   4266: 
1.1.1.2   misho    4267:        You should be aware that, because of  optimizations  in  the  way  PCRE
                   4268:        matches  patterns  by  default,  callouts  sometimes do not happen. For
1.1       misho    4269:        example, if the pattern is
                   4270: 
                   4271:          ab(?C4)cd
                   4272: 
                   4273:        PCRE knows that any matching string must contain the letter "d". If the
1.1.1.2   misho    4274:        subject  string  is "abyz", the lack of "d" means that matching doesn't
                   4275:        ever start, and the callout is never  reached.  However,  with  "abyd",
1.1       misho    4276:        though the result is still no match, the callout is obeyed.
                   4277: 
1.1.1.2   misho    4278:        If  the pattern is studied, PCRE knows the minimum length of a matching
                   4279:        string, and will immediately give a "no match" return without  actually
                   4280:        running  a  match if the subject is not long enough, or, for unanchored
1.1       misho    4281:        patterns, if it has been scanned far enough.
                   4282: 
1.1.1.2   misho    4283:        You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
                   4284:        MIZE  option  to the matching function, or by starting the pattern with
                   4285:        (*NO_START_OPT). This slows down the matching process, but does  ensure
                   4286:        that callouts such as the example above are obeyed.
1.1       misho    4287: 
                   4288: 
                   4289: THE CALLOUT INTERFACE
                   4290: 
                   4291:        During  matching, when PCRE reaches a callout point, the external func-
1.1.1.4 ! misho    4292:        tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
        !          4293:        set).  This  applies to both normal and DFA matching. The only argument
        !          4294:        to  the  callout  function  is  a  pointer   to   a   pcre_callout   or
        !          4295:        pcre[16|32]_callout  block.   These  structures  contains the following
        !          4296:        fields:
1.1.1.2   misho    4297: 
                   4298:          int           version;
                   4299:          int           callout_number;
                   4300:          int          *offset_vector;
                   4301:          const char   *subject;           (8-bit version)
                   4302:          PCRE_SPTR16   subject;           (16-bit version)
1.1.1.4 ! misho    4303:          PCRE_SPTR32   subject;           (32-bit version)
1.1.1.2   misho    4304:          int           subject_length;
                   4305:          int           start_match;
                   4306:          int           current_position;
                   4307:          int           capture_top;
                   4308:          int           capture_last;
                   4309:          void         *callout_data;
                   4310:          int           pattern_position;
                   4311:          int           next_item_length;
                   4312:          const unsigned char *mark;       (8-bit version)
                   4313:          const PCRE_UCHAR16  *mark;       (16-bit version)
1.1.1.4 ! misho    4314:          const PCRE_UCHAR32  *mark;       (32-bit version)
1.1       misho    4315: 
1.1.1.4 ! misho    4316:        The version field is an integer containing the version  number  of  the
        !          4317:        block  format. The initial version was 0; the current version is 2. The
        !          4318:        version number will change again in future  if  additional  fields  are
1.1       misho    4319:        added, but the intention is never to remove any of the existing fields.
                   4320: 
1.1.1.4 ! misho    4321:        The  callout_number  field  contains the number of the callout, as com-
        !          4322:        piled into the pattern (that is, the number after ?C for  manual  call-
1.1       misho    4323:        outs, and 255 for automatically generated callouts).
                   4324: 
1.1.1.4 ! misho    4325:        The  offset_vector field is a pointer to the vector of offsets that was
        !          4326:        passed by the caller to the  matching  function.  When  pcre_exec()  or
        !          4327:        pcre[16|32]_exec()  is used, the contents can be inspected, in order to
        !          4328:        extract substrings that have been matched so far, in the  same  way  as
        !          4329:        for  extracting  substrings  after  a  match has completed. For the DFA
1.1.1.2   misho    4330:        matching functions, this field is not useful.
1.1       misho    4331: 
                   4332:        The subject and subject_length fields contain copies of the values that
1.1.1.2   misho    4333:        were passed to the matching function.
1.1       misho    4334: 
1.1.1.4 ! misho    4335:        The  start_match  field normally contains the offset within the subject
        !          4336:        at which the current match attempt  started.  However,  if  the  escape
        !          4337:        sequence  \K has been encountered, this value is changed to reflect the
        !          4338:        modified starting point. If the pattern is not  anchored,  the  callout
1.1       misho    4339:        function may be called several times from the same point in the pattern
                   4340:        for different starting points in the subject.
                   4341: 
1.1.1.4 ! misho    4342:        The current_position field contains the offset within  the  subject  of
1.1       misho    4343:        the current match pointer.
                   4344: 
1.1.1.4 ! misho    4345:        When  the  pcre_exec()  or  pcre[16|32]_exec() is used, the capture_top
        !          4346:        field contains one more than the number of the  highest  numbered  cap-
        !          4347:        tured  substring so far. If no substrings have been captured, the value
        !          4348:        of capture_top is one. This is always the case when the  DFA  functions
        !          4349:        are used, because they do not support captured substrings.
        !          4350: 
        !          4351:        The  capture_last  field  contains the number of the most recently cap-
        !          4352:        tured substring. However, when a recursion exits, the value reverts  to
        !          4353:        what  it  was  outside  the recursion, as do the values of all captured
        !          4354:        substrings. If no substrings have been  captured,  the  value  of  cap-
        !          4355:        ture_last  is  -1.  This  is always the case for the DFA matching func-
        !          4356:        tions.
1.1       misho    4357: 
1.1.1.2   misho    4358:        The callout_data field contains a value that is passed  to  a  matching
                   4359:        function  specifically so that it can be passed back in callouts. It is
1.1.1.4 ! misho    4360:        passed in the callout_data field of a pcre_extra  or  pcre[16|32]_extra
        !          4361:        data  structure.  If no such data was passed, the value of callout_data
        !          4362:        in a callout block is NULL. There is a description  of  the  pcre_extra
        !          4363:        structure in the pcreapi documentation.
1.1       misho    4364: 
1.1.1.2   misho    4365:        The  pattern_position  field  is  present from version 1 of the callout
                   4366:        structure. It contains the offset to the next item to be matched in the
                   4367:        pattern string.
                   4368: 
                   4369:        The  next_item_length  field  is  present from version 1 of the callout
                   4370:        structure. It contains the length of the next item to be matched in the
                   4371:        pattern  string.  When  the callout immediately precedes an alternation
                   4372:        bar, a closing parenthesis, or the end of the pattern,  the  length  is
                   4373:        zero.  When  the callout precedes an opening parenthesis, the length is
                   4374:        that of the entire subpattern.
1.1       misho    4375: 
                   4376:        The pattern_position and next_item_length fields are intended  to  help
                   4377:        in  distinguishing between different automatic callouts, which all have
                   4378:        the same callout number. However, they are set for all callouts.
                   4379: 
1.1.1.2   misho    4380:        The mark field is present from version 2 of the callout  structure.  In
1.1.1.4 ! misho    4381:        callouts  from  pcre_exec() or pcre[16|32]_exec() it contains a pointer
        !          4382:        to the zero-terminated  name  of  the  most  recently  passed  (*MARK),
        !          4383:        (*PRUNE),  or  (*THEN) item in the match, or NULL if no such items have
        !          4384:        been passed. Instances of (*PRUNE) or (*THEN) without  a  name  do  not
        !          4385:        obliterate  a previous (*MARK). In callouts from the DFA matching func-
        !          4386:        tions this field always contains NULL.
1.1       misho    4387: 
                   4388: 
                   4389: RETURN VALUES
                   4390: 
                   4391:        The external callout function returns an integer to PCRE. If the  value
                   4392:        is  zero,  matching  proceeds  as  normal. If the value is greater than
                   4393:        zero, matching fails at the current point, but  the  testing  of  other
                   4394:        matching possibilities goes ahead, just as if a lookahead assertion had
1.1.1.2   misho    4395:        failed. If the value is less than zero, the  match  is  abandoned,  the
                   4396:        matching function returns the negative value.
1.1       misho    4397: 
                   4398:        Negative   values   should   normally   be   chosen  from  the  set  of
                   4399:        PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
                   4400:        dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
                   4401:        reserved for use by callout functions; it will never be  used  by  PCRE
                   4402:        itself.
                   4403: 
                   4404: 
                   4405: AUTHOR
                   4406: 
                   4407:        Philip Hazel
                   4408:        University Computing Service
                   4409:        Cambridge CB2 3QH, England.
                   4410: 
                   4411: 
                   4412: REVISION
                   4413: 
1.1.1.4 ! misho    4414:        Last updated: 03 March 2013
        !          4415:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    4416: ------------------------------------------------------------------------------
                   4417: 
                   4418: 
1.1.1.4 ! misho    4419: PCRECOMPAT(3)              Library Functions Manual              PCRECOMPAT(3)
        !          4420: 
1.1       misho    4421: 
                   4422: 
                   4423: NAME
                   4424:        PCRE - Perl-compatible regular expressions
                   4425: 
                   4426: DIFFERENCES BETWEEN PCRE AND PERL
                   4427: 
                   4428:        This  document describes the differences in the ways that PCRE and Perl
                   4429:        handle regular expressions. The differences  described  here  are  with
                   4430:        respect to Perl versions 5.10 and above.
                   4431: 
1.1.1.2   misho    4432:        1. PCRE has only a subset of Perl's Unicode support. Details of what it
                   4433:        does have are given in the pcreunicode page.
1.1       misho    4434: 
                   4435:        2. PCRE allows repeat quantifiers only on parenthesized assertions, but
                   4436:        they  do  not mean what you might think. For example, (?!a){3} does not
                   4437:        assert that the next three characters are not "a". It just asserts that
                   4438:        the next character is not "a" three times (in principle: PCRE optimizes
                   4439:        this to run the assertion just once). Perl allows repeat quantifiers on
                   4440:        other assertions such as \b, but these do not seem to have any use.
                   4441: 
                   4442:        3.  Capturing  subpatterns  that occur inside negative lookahead asser-
                   4443:        tions are counted, but their entries in the offsets  vector  are  never
1.1.1.4 ! misho    4444:        set.  Perl sometimes (but not always) sets its numerical variables from
        !          4445:        inside negative assertions.
1.1       misho    4446: 
                   4447:        4. Though binary zero characters are supported in the  subject  string,
                   4448:        they are not allowed in a pattern string because it is passed as a nor-
                   4449:        mal C string, terminated by zero. The escape sequence \0 can be used in
                   4450:        the pattern to represent a binary zero.
                   4451: 
                   4452:        5.  The  following Perl escape sequences are not supported: \l, \u, \L,
                   4453:        \U, and \N when followed by a character name or Unicode value.  (\N  on
                   4454:        its own, matching a non-newline character, is supported.) In fact these
                   4455:        are implemented by Perl's general string-handling and are not  part  of
                   4456:        its  pattern  matching engine. If any of these are encountered by PCRE,
                   4457:        an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
                   4458:        PAT  option  is set, \U and \u are interpreted as JavaScript interprets
                   4459:        them.
                   4460: 
                   4461:        6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
                   4462:        is  built  with Unicode character property support. The properties that
                   4463:        can be tested with \p and \P are limited to the general category  prop-
                   4464:        erties  such  as  Lu and Nd, script names such as Greek or Han, and the
                   4465:        derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
                   4466:        property,  which  Perl  does  not; the Perl documentation says "Because
                   4467:        Perl hides the need for the user to understand the internal representa-
                   4468:        tion  of Unicode characters, there is no need to implement the somewhat
                   4469:        messy concept of surrogates."
                   4470: 
1.1.1.4 ! misho    4471:        7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
        !          4472:        ters  in  between  are  treated as literals. This is slightly different
        !          4473:        from Perl in that $ and @ are  also  handled  as  literals  inside  the
        !          4474:        quotes.  In Perl, they cause variable interpolation (but of course PCRE
1.1       misho    4475:        does not have variables). Note the following examples:
                   4476: 
                   4477:            Pattern            PCRE matches      Perl matches
                   4478: 
                   4479:            \Qabc$xyz\E        abc$xyz           abc followed by the
                   4480:                                                   contents of $xyz
                   4481:            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
                   4482:            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
                   4483: 
1.1.1.4 ! misho    4484:        The \Q...\E sequence is recognized both inside  and  outside  character
1.1       misho    4485:        classes.
                   4486: 
1.1.1.4 ! misho    4487:        8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
        !          4488:        constructions. However, there is support for recursive  patterns.  This
        !          4489:        is  not  available  in Perl 5.8, but it is in Perl 5.10. Also, the PCRE
        !          4490:        "callout" feature allows an external function to be called during  pat-
1.1       misho    4491:        tern matching. See the pcrecallout documentation for details.
                   4492: 
1.1.1.4 ! misho    4493:        9.  Subpatterns  that  are called as subroutines (whether or not recur-
        !          4494:        sively) are always treated as atomic  groups  in  PCRE.  This  is  like
        !          4495:        Python,  but  unlike Perl.  Captured values that are set outside a sub-
        !          4496:        routine call can be reference from inside in PCRE,  but  not  in  Perl.
1.1       misho    4497:        There is a discussion that explains these differences in more detail in
                   4498:        the section on recursion differences from Perl in the pcrepattern page.
                   4499: 
1.1.1.4 ! misho    4500:        10. If any of the backtracking control verbs are used in  a  subpattern
        !          4501:        that  is  called  as  a  subroutine (whether or not recursively), their
        !          4502:        effect is confined to that subpattern; it does not extend to  the  sur-
        !          4503:        rounding  pattern.  This is not always the case in Perl. In particular,
        !          4504:        if (*THEN) is present in a group that is called as  a  subroutine,  its
        !          4505:        action is limited to that group, even if the group does not contain any
        !          4506:        | characters. Note that such subpatterns are processed as  anchored  at
        !          4507:        the point where they are tested.
        !          4508: 
        !          4509:        11.  If a pattern contains more than one backtracking control verb, the
        !          4510:        first one that is backtracked onto acts. For example,  in  the  pattern
        !          4511:        A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
        !          4512:        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
        !          4513:        it is the same as PCRE, but there are examples where it differs.
        !          4514: 
        !          4515:        12.  Most  backtracking  verbs in assertions have their normal actions.
        !          4516:        They are not confined to the assertion.
        !          4517: 
        !          4518:        13. There are some differences that are concerned with the settings  of
        !          4519:        captured  strings  when  part  of  a  pattern is repeated. For example,
        !          4520:        matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
1.1       misho    4521:        unset, but in PCRE it is set to "b".
                   4522: 
1.1.1.4 ! misho    4523:        14.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
1.1       misho    4524:        pattern names is not as general as Perl's. This is a consequence of the
                   4525:        fact the PCRE works internally just with numbers, using an external ta-
1.1.1.4 ! misho    4526:        ble to translate between numbers and names. In  particular,  a  pattern
        !          4527:        such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
        !          4528:        the same number but different names, is not supported,  and  causes  an
        !          4529:        error  at compile time. If it were allowed, it would not be possible to
        !          4530:        distinguish which parentheses matched, because both names map  to  cap-
1.1       misho    4531:        turing subpattern number 1. To avoid this confusing situation, an error
                   4532:        is given at compile time.
                   4533: 
1.1.1.4 ! misho    4534:        15. Perl recognizes comments in some places that  PCRE  does  not,  for
        !          4535:        example,  between  the  ( and ? at the start of a subpattern. If the /x
1.1.1.3   misho    4536:        modifier is set, Perl allows white space between ( and ? but PCRE never
1.1       misho    4537:        does, even if the PCRE_EXTENDED option is set.
                   4538: 
1.1.1.4 ! misho    4539:        16.  In  PCRE,  the upper/lower case character properties Lu and Ll are
        !          4540:        not affected when case-independent matching is specified. For  example,
        !          4541:        \p{Lu} always matches an upper case letter. I think Perl has changed in
        !          4542:        this respect; in the release at the time of writing (5.16), \p{Lu}  and
        !          4543:        \p{Ll} match all letters, regardless of case, when case independence is
        !          4544:        specified.
        !          4545: 
        !          4546:        17. PCRE provides some extensions to the Perl regular expression facil-
1.1       misho    4547:        ities.   Perl  5.10  includes new features that are not in earlier ver-
                   4548:        sions of Perl, some of which (such as named parentheses) have  been  in
                   4549:        PCRE for some time. This list is with respect to Perl 5.10:
                   4550: 
                   4551:        (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
                   4552:        strings, each alternative branch of a lookbehind assertion can match  a
                   4553:        different  length  of  string.  Perl requires them all to have the same
                   4554:        length.
                   4555: 
                   4556:        (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
                   4557:        meta-character matches only at the very end of the string.
                   4558: 
                   4559:        (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
                   4560:        cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
                   4561:        ignored.  (Perl can be made to issue a warning.)
                   4562: 
                   4563:        (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
                   4564:        fiers is inverted, that is, by default they are not greedy, but if fol-
                   4565:        lowed by a question mark they are.
                   4566: 
                   4567:        (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
                   4568:        tried only at the first matching position in the subject string.
                   4569: 
                   4570:        (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
                   4571:        and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
                   4572:        lents.
                   4573: 
                   4574:        (g) The \R escape sequence can be restricted to match only CR,  LF,  or
                   4575:        CRLF by the PCRE_BSR_ANYCRLF option.
                   4576: 
                   4577:        (h) The callout facility is PCRE-specific.
                   4578: 
                   4579:        (i) The partial matching facility is PCRE-specific.
                   4580: 
                   4581:        (j) Patterns compiled by PCRE can be saved and re-used at a later time,
                   4582:        even on different hosts that have the other endianness.  However,  this
                   4583:        does not apply to optimized data created by the just-in-time compiler.
                   4584: 
1.1.1.4 ! misho    4585:        (k)    The    alternative    matching    functions    (pcre_dfa_exec(),
        !          4586:        pcre16_dfa_exec() and pcre32_dfa_exec(),) match in a different way  and
        !          4587:        are not Perl-compatible.
1.1       misho    4588: 
1.1.1.2   misho    4589:        (l)  PCRE  recognizes some special sequences such as (*CR) at the start
1.1       misho    4590:        of a pattern that set overall options that cannot be changed within the
                   4591:        pattern.
                   4592: 
                   4593: 
                   4594: AUTHOR
                   4595: 
                   4596:        Philip Hazel
                   4597:        University Computing Service
                   4598:        Cambridge CB2 3QH, England.
                   4599: 
                   4600: 
                   4601: REVISION
                   4602: 
1.1.1.4 ! misho    4603:        Last updated: 19 March 2013
        !          4604:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    4605: ------------------------------------------------------------------------------
                   4606: 
                   4607: 
1.1.1.4 ! misho    4608: PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
        !          4609: 
1.1       misho    4610: 
                   4611: 
                   4612: NAME
                   4613:        PCRE - Perl-compatible regular expressions
                   4614: 
                   4615: PCRE REGULAR EXPRESSION DETAILS
                   4616: 
                   4617:        The  syntax and semantics of the regular expressions that are supported
                   4618:        by PCRE are described in detail below. There is a quick-reference  syn-
                   4619:        tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
                   4620:        semantics as closely as it can. PCRE  also  supports  some  alternative
                   4621:        regular  expression  syntax (which does not conflict with the Perl syn-
                   4622:        tax) in order to provide some compatibility with regular expressions in
                   4623:        Python, .NET, and Oniguruma.
                   4624: 
                   4625:        Perl's  regular expressions are described in its own documentation, and
                   4626:        regular expressions in general are covered in a number of  books,  some
                   4627:        of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
                   4628:        Expressions", published by  O'Reilly,  covers  regular  expressions  in
                   4629:        great  detail.  This  description  of  PCRE's  regular  expressions  is
                   4630:        intended as reference material.
                   4631: 
1.1.1.4 ! misho    4632:        This document discusses the patterns that are supported  by  PCRE  when
        !          4633:        one    its    main   matching   functions,   pcre_exec()   (8-bit)   or
        !          4634:        pcre[16|32]_exec() (16- or 32-bit), is used. PCRE also has  alternative
        !          4635:        matching  functions,  pcre_dfa_exec()  and pcre[16|32_dfa_exec(), which
        !          4636:        match using a different algorithm that is not Perl-compatible. Some  of
        !          4637:        the  features  discussed  below  are not available when DFA matching is
        !          4638:        used. The advantages and disadvantages of  the  alternative  functions,
        !          4639:        and  how  they  differ  from the normal functions, are discussed in the
        !          4640:        pcrematching page.
        !          4641: 
        !          4642: 
        !          4643: SPECIAL START-OF-PATTERN ITEMS
        !          4644: 
        !          4645:        A number of options that can be passed to pcre_compile()  can  also  be
        !          4646:        set by special items at the start of a pattern. These are not Perl-com-
        !          4647:        patible, but are provided to make these options accessible  to  pattern
        !          4648:        writers  who are not able to change the program that processes the pat-
        !          4649:        tern. Any number of these items  may  appear,  but  they  must  all  be
        !          4650:        together right at the start of the pattern string, and the letters must
        !          4651:        be in upper case.
        !          4652: 
        !          4653:    UTF support
        !          4654: 
1.1       misho    4655:        The original operation of PCRE was on strings of  one-byte  characters.
1.1.1.2   misho    4656:        However,  there  is  now also support for UTF-8 strings in the original
1.1.1.4 ! misho    4657:        library, an extra library that supports  16-bit  and  UTF-16  character
        !          4658:        strings,  and a third library that supports 32-bit and UTF-32 character
1.1.1.2   misho    4659:        strings. To use these features, PCRE must be built to include appropri-
1.1.1.4 ! misho    4660:        ate  support. When using UTF strings you must either call the compiling
        !          4661:        function with the PCRE_UTF8, PCRE_UTF16, or PCRE_UTF32 option,  or  the
        !          4662:        pattern must start with one of these special sequences:
1.1       misho    4663: 
                   4664:          (*UTF8)
1.1.1.2   misho    4665:          (*UTF16)
1.1.1.4 ! misho    4666:          (*UTF32)
        !          4667:          (*UTF)
        !          4668: 
        !          4669:        (*UTF)  is  a  generic  sequence  that  can  be  used  with  any of the
        !          4670:        libraries.  Starting a pattern with such a sequence  is  equivalent  to
        !          4671:        setting  the  relevant  option.  How setting a UTF mode affects pattern
        !          4672:        matching is mentioned in several places below. There is also a  summary
        !          4673:        of features in the pcreunicode page.
        !          4674: 
        !          4675:        Some applications that allow their users to supply patterns may wish to
        !          4676:        restrict  them  to  non-UTF  data  for   security   reasons.   If   the
        !          4677:        PCRE_NEVER_UTF  option  is  set  at  compile  time, (*UTF) etc. are not
        !          4678:        allowed, and their appearance causes an error.
1.1       misho    4679: 
1.1.1.4 ! misho    4680:    Unicode property support
1.1       misho    4681: 
1.1.1.4 ! misho    4682:        Another special sequence that may appear at the start of a pattern is
1.1       misho    4683: 
                   4684:          (*UCP)
                   4685: 
1.1.1.2   misho    4686:        This has the same effect as setting  the  PCRE_UCP  option:  it  causes
                   4687:        sequences  such  as  \d  and  \w to use Unicode properties to determine
1.1       misho    4688:        character types, instead of recognizing only characters with codes less
                   4689:        than 128 via a lookup table.
                   4690: 
1.1.1.4 ! misho    4691:    Disabling start-up optimizations
        !          4692: 
1.1.1.2   misho    4693:        If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
1.1       misho    4694:        setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
1.1.1.4 ! misho    4695:        time.
1.1       misho    4696: 
1.1.1.4 ! misho    4697:    Newline conventions
1.1       misho    4698: 
1.1.1.4 ! misho    4699:        PCRE  supports five different conventions for indicating line breaks in
        !          4700:        strings: a single CR (carriage return) character, a  single  LF  (line-
1.1       misho    4701:        feed) character, the two-character sequence CRLF, any of the three pre-
1.1.1.4 ! misho    4702:        ceding, or any Unicode newline sequence. The pcreapi page  has  further
        !          4703:        discussion  about newlines, and shows how to set the newline convention
1.1       misho    4704:        in the options arguments for the compiling and matching functions.
                   4705: 
1.1.1.4 ! misho    4706:        It is also possible to specify a newline convention by starting a  pat-
1.1       misho    4707:        tern string with one of the following five sequences:
                   4708: 
                   4709:          (*CR)        carriage return
                   4710:          (*LF)        linefeed
                   4711:          (*CRLF)      carriage return, followed by linefeed
                   4712:          (*ANYCRLF)   any of the three above
                   4713:          (*ANY)       all Unicode newline sequences
                   4714: 
1.1.1.2   misho    4715:        These override the default and the options given to the compiling func-
1.1.1.4 ! misho    4716:        tion. For example, on a Unix system where LF  is  the  default  newline
1.1.1.2   misho    4717:        sequence, the pattern
1.1       misho    4718: 
                   4719:          (*CR)a.b
                   4720: 
                   4721:        changes the convention to CR. That pattern matches "a\nb" because LF is
1.1.1.4 ! misho    4722:        no longer a newline. If more than one of these settings is present, the
        !          4723:        last one is used.
        !          4724: 
        !          4725:        The  newline  convention affects where the circumflex and dollar asser-
        !          4726:        tions are true. It also affects the interpretation of the dot metachar-
        !          4727:        acter when PCRE_DOTALL is not set, and the behaviour of \N. However, it
        !          4728:        does not affect what the \R escape sequence matches. By  default,  this
        !          4729:        is  any Unicode newline sequence, for Perl compatibility. However, this
        !          4730:        can be changed; see the description of \R in the section entitled "New-
        !          4731:        line  sequences"  below.  A change of \R setting can be combined with a
        !          4732:        change of newline convention.
        !          4733: 
        !          4734:    Setting match and recursion limits
        !          4735: 
        !          4736:        The caller of pcre_exec() can set a limit on the number  of  times  the
        !          4737:        internal  match() function is called and on the maximum depth of recur-
        !          4738:        sive calls. These facilities are provided to catch runaway matches that
        !          4739:        are provoked by patterns with huge matching trees (a typical example is
        !          4740:        a pattern with nested unlimited repeats) and to avoid  running  out  of
        !          4741:        system  stack  by  too  much  recursion.  When  one  of these limits is
        !          4742:        reached, pcre_exec() gives an error return. The limits can also be  set
        !          4743:        by items at the start of the pattern of the form
        !          4744: 
        !          4745:          (*LIMIT_MATCH=d)
        !          4746:          (*LIMIT_RECURSION=d)
        !          4747: 
        !          4748:        where d is any number of decimal digits. However, the value of the set-
        !          4749:        ting must be less than the value set by the caller of  pcre_exec()  for
        !          4750:        it to have any effect. In other words, the pattern writer can lower the
        !          4751:        limit set by the programmer, but not raise it. If there  is  more  than
        !          4752:        one setting of one of these limits, the lower value is used.
        !          4753: 
        !          4754: 
        !          4755: EBCDIC CHARACTER CODES
        !          4756: 
        !          4757:        PCRE  can  be compiled to run in an environment that uses EBCDIC as its
        !          4758:        character code rather than ASCII or Unicode (typically a mainframe sys-
        !          4759:        tem).  In  the  sections below, character code values are ASCII or Uni-
        !          4760:        code; in an EBCDIC environment these characters may have different code
        !          4761:        values, and there are no code points greater than 255.
1.1       misho    4762: 
                   4763: 
                   4764: CHARACTERS AND METACHARACTERS
                   4765: 
1.1.1.4 ! misho    4766:        A  regular  expression  is  a pattern that is matched against a subject
        !          4767:        string from left to right. Most characters stand for  themselves  in  a
        !          4768:        pattern,  and  match  the corresponding characters in the subject. As a
1.1       misho    4769:        trivial example, the pattern
                   4770: 
                   4771:          The quick brown fox
                   4772: 
                   4773:        matches a portion of a subject string that is identical to itself. When
1.1.1.4 ! misho    4774:        caseless  matching is specified (the PCRE_CASELESS option), letters are
        !          4775:        matched independently of case. In a UTF mode, PCRE  always  understands
        !          4776:        the  concept  of case for characters whose values are less than 128, so
        !          4777:        caseless matching is always possible. For characters with  higher  val-
        !          4778:        ues,  the concept of case is supported if PCRE is compiled with Unicode
        !          4779:        property support, but not otherwise.   If  you  want  to  use  caseless
        !          4780:        matching  for  characters  128  and above, you must ensure that PCRE is
1.1.1.2   misho    4781:        compiled with Unicode property support as well as with UTF support.
1.1       misho    4782: 
1.1.1.4 ! misho    4783:        The power of regular expressions comes  from  the  ability  to  include
        !          4784:        alternatives  and  repetitions in the pattern. These are encoded in the
1.1       misho    4785:        pattern by the use of metacharacters, which do not stand for themselves
                   4786:        but instead are interpreted in some special way.
                   4787: 
1.1.1.4 ! misho    4788:        There  are  two different sets of metacharacters: those that are recog-
        !          4789:        nized anywhere in the pattern except within square brackets, and  those
        !          4790:        that  are  recognized  within square brackets. Outside square brackets,
1.1       misho    4791:        the metacharacters are as follows:
                   4792: 
                   4793:          \      general escape character with several uses
                   4794:          ^      assert start of string (or line, in multiline mode)
                   4795:          $      assert end of string (or line, in multiline mode)
                   4796:          .      match any character except newline (by default)
                   4797:          [      start character class definition
                   4798:          |      start of alternative branch
                   4799:          (      start subpattern
                   4800:          )      end subpattern
                   4801:          ?      extends the meaning of (
                   4802:                 also 0 or 1 quantifier
                   4803:                 also quantifier minimizer
                   4804:          *      0 or more quantifier
                   4805:          +      1 or more quantifier
                   4806:                 also "possessive quantifier"
                   4807:          {      start min/max quantifier
                   4808: 
1.1.1.4 ! misho    4809:        Part of a pattern that is in square brackets  is  called  a  "character
1.1       misho    4810:        class". In a character class the only metacharacters are:
                   4811: 
                   4812:          \      general escape character
                   4813:          ^      negate the class, but only if the first character
                   4814:          -      indicates character range
                   4815:          [      POSIX character class (only if followed by POSIX
                   4816:                   syntax)
                   4817:          ]      terminates the character class
                   4818: 
                   4819:        The following sections describe the use of each of the metacharacters.
                   4820: 
                   4821: 
                   4822: BACKSLASH
                   4823: 
                   4824:        The backslash character has several uses. Firstly, if it is followed by
                   4825:        a character that is not a number or a letter, it takes away any special
1.1.1.4 ! misho    4826:        meaning  that  character  may  have. This use of backslash as an escape
1.1       misho    4827:        character applies both inside and outside character classes.
                   4828: 
1.1.1.4 ! misho    4829:        For example, if you want to match a * character, you write  \*  in  the
        !          4830:        pattern.   This  escaping  action  applies whether or not the following
        !          4831:        character would otherwise be interpreted as a metacharacter, so  it  is
        !          4832:        always  safe  to  precede  a non-alphanumeric with backslash to specify
        !          4833:        that it stands for itself. In particular, if you want to match a  back-
1.1       misho    4834:        slash, you write \\.
                   4835: 
1.1.1.4 ! misho    4836:        In  a UTF mode, only ASCII numbers and letters have any special meaning
        !          4837:        after a backslash. All other characters  (in  particular,  those  whose
1.1       misho    4838:        codepoints are greater than 127) are treated as literals.
                   4839: 
1.1.1.4 ! misho    4840:        If  a pattern is compiled with the PCRE_EXTENDED option, white space in
        !          4841:        the pattern (other than in a character class) and characters between  a
1.1       misho    4842:        # outside a character class and the next newline are ignored. An escap-
1.1.1.4 ! misho    4843:        ing backslash can be used to include a white space or  #  character  as
1.1       misho    4844:        part of the pattern.
                   4845: 
1.1.1.4 ! misho    4846:        If  you  want  to remove the special meaning from a sequence of charac-
        !          4847:        ters, you can do so by putting them between \Q and \E. This is  differ-
        !          4848:        ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
        !          4849:        sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
1.1       misho    4850:        tion. Note the following examples:
                   4851: 
                   4852:          Pattern            PCRE matches   Perl matches
                   4853: 
                   4854:          \Qabc$xyz\E        abc$xyz        abc followed by the
                   4855:                                              contents of $xyz
                   4856:          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
                   4857:          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
                   4858: 
1.1.1.4 ! misho    4859:        The  \Q...\E  sequence  is recognized both inside and outside character
        !          4860:        classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q
        !          4861:        is  not followed by \E later in the pattern, the literal interpretation
        !          4862:        continues to the end of the pattern (that is,  \E  is  assumed  at  the
        !          4863:        end).  If  the  isolated \Q is inside a character class, this causes an
1.1       misho    4864:        error, because the character class is not terminated.
                   4865: 
                   4866:    Non-printing characters
                   4867: 
                   4868:        A second use of backslash provides a way of encoding non-printing char-
1.1.1.4 ! misho    4869:        acters  in patterns in a visible manner. There is no restriction on the
        !          4870:        appearance of non-printing characters, apart from the binary zero  that
        !          4871:        terminates  a  pattern,  but  when  a pattern is being prepared by text
        !          4872:        editing, it is  often  easier  to  use  one  of  the  following  escape
1.1       misho    4873:        sequences than the binary character it represents:
                   4874: 
                   4875:          \a        alarm, that is, the BEL character (hex 07)
                   4876:          \cx       "control-x", where x is any ASCII character
                   4877:          \e        escape (hex 1B)
1.1.1.3   misho    4878:          \f        form feed (hex 0C)
1.1       misho    4879:          \n        linefeed (hex 0A)
                   4880:          \r        carriage return (hex 0D)
                   4881:          \t        tab (hex 09)
                   4882:          \ddd      character with octal code ddd, or back reference
                   4883:          \xhh      character with hex code hh
                   4884:          \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
                   4885:          \uhhhh    character with hex code hhhh (JavaScript mode only)
                   4886: 
1.1.1.4 ! misho    4887:        The  precise effect of \cx on ASCII characters is as follows: if x is a
        !          4888:        lower case letter, it is converted to upper case. Then  bit  6  of  the
        !          4889:        character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
        !          4890:        (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and  \c;  becomes
        !          4891:        hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c
1.1       misho    4892:        has a value greater than 127, a compile-time error occurs.  This  locks
1.1.1.4 ! misho    4893:        out non-ASCII characters in all modes.
        !          4894: 
        !          4895:        The  \c  facility  was designed for use with ASCII characters, but with
        !          4896:        the extension to Unicode it is even less useful than it  once  was.  It
        !          4897:        is,  however,  recognized  when  PCRE is compiled in EBCDIC mode, where
        !          4898:        data items are always bytes. In this mode, all values are  valid  after
        !          4899:        \c.  If  the  next character is a lower case letter, it is converted to
        !          4900:        upper case. Then the 0xc0 bits of  the  byte  are  inverted.  Thus  \cA
        !          4901:        becomes  hex  01, as in ASCII (A is C1), but because the EBCDIC letters
        !          4902:        are disjoint, \cZ becomes hex 29 (Z is E9), and other  characters  also
        !          4903:        generate different values.
1.1       misho    4904: 
                   4905:        By  default,  after  \x,  from  zero to two hexadecimal digits are read
                   4906:        (letters can be in upper or lower case). Any number of hexadecimal dig-
1.1.1.2   misho    4907:        its may appear between \x{ and }, but the character code is constrained
                   4908:        as follows:
                   4909: 
                   4910:          8-bit non-UTF mode    less than 0x100
                   4911:          8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
                   4912:          16-bit non-UTF mode   less than 0x10000
                   4913:          16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
1.1.1.4 ! misho    4914:          32-bit non-UTF mode   less than 0x80000000
        !          4915:          32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
1.1       misho    4916: 
1.1.1.2   misho    4917:        Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
1.1.1.4 ! misho    4918:        called "surrogate" codepoints), and 0xffef.
1.1.1.2   misho    4919: 
                   4920:        If  characters  other than hexadecimal digits appear between \x{ and },
1.1       misho    4921:        or if there is no terminating }, this form of escape is not recognized.
1.1.1.2   misho    4922:        Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
                   4923:        escape, with no following digits, giving a  character  whose  value  is
1.1       misho    4924:        zero.
                   4925: 
1.1.1.2   misho    4926:        If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
                   4927:        is as just described only when it is followed by two  hexadecimal  dig-
                   4928:        its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
1.1       misho    4929:        mode, support for code points greater than 256 is provided by \u, which
1.1.1.2   misho    4930:        must  be  followed  by  four hexadecimal digits; otherwise it matches a
1.1.1.3   misho    4931:        literal "u" character.  Character codes specified by \u  in  JavaScript
                   4932:        mode  are  constrained in the same was as those specified by \x in non-
                   4933:        JavaScript mode.
1.1       misho    4934: 
                   4935:        Characters whose value is less than 256 can be defined by either of the
1.1.1.2   misho    4936:        two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
1.1       misho    4937:        ence in the way they are handled. For example, \xdc is exactly the same
                   4938:        as \x{dc} (or \u00dc in JavaScript mode).
                   4939: 
1.1.1.2   misho    4940:        After  \0  up  to two further octal digits are read. If there are fewer
                   4941:        than two digits, just  those  that  are  present  are  used.  Thus  the
1.1       misho    4942:        sequence \0\x\07 specifies two binary zeros followed by a BEL character
1.1.1.2   misho    4943:        (code value 7). Make sure you supply two digits after the initial  zero
1.1       misho    4944:        if the pattern character that follows is itself an octal digit.
                   4945: 
                   4946:        The handling of a backslash followed by a digit other than 0 is compli-
                   4947:        cated.  Outside a character class, PCRE reads it and any following dig-
1.1.1.2   misho    4948:        its  as  a  decimal  number. If the number is less than 10, or if there
1.1       misho    4949:        have been at least that many previous capturing left parentheses in the
1.1.1.2   misho    4950:        expression,  the  entire  sequence  is  taken  as  a  back reference. A
                   4951:        description of how this works is given later, following the  discussion
1.1       misho    4952:        of parenthesized subpatterns.
                   4953: 
1.1.1.2   misho    4954:        Inside  a  character  class, or if the decimal number is greater than 9
                   4955:        and there have not been that many capturing subpatterns, PCRE  re-reads
1.1       misho    4956:        up to three octal digits following the backslash, and uses them to gen-
1.1.1.2   misho    4957:        erate a data character. Any subsequent digits stand for themselves. The
                   4958:        value  of  the  character  is constrained in the same way as characters
                   4959:        specified in hexadecimal.  For example:
1.1       misho    4960: 
1.1.1.4 ! misho    4961:          \040   is another way of writing an ASCII space
1.1       misho    4962:          \40    is the same, provided there are fewer than 40
                   4963:                    previous capturing subpatterns
                   4964:          \7     is always a back reference
                   4965:          \11    might be a back reference, or another way of
                   4966:                    writing a tab
                   4967:          \011   is always a tab
                   4968:          \0113  is a tab followed by the character "3"
                   4969:          \113   might be a back reference, otherwise the
                   4970:                    character with octal code 113
                   4971:          \377   might be a back reference, otherwise
1.1.1.2   misho    4972:                    the value 255 (decimal)
1.1       misho    4973:          \81    is either a back reference, or a binary zero
                   4974:                    followed by the two characters "8" and "1"
                   4975: 
                   4976:        Note that octal values of 100 or greater must not be  introduced  by  a
                   4977:        leading zero, because no more than three octal digits are ever read.
                   4978: 
                   4979:        All the sequences that define a single character value can be used both
                   4980:        inside and outside character classes. In addition, inside  a  character
                   4981:        class, \b is interpreted as the backspace character (hex 08).
                   4982: 
                   4983:        \N  is not allowed in a character class. \B, \R, and \X are not special
                   4984:        inside a character class. Like  other  unrecognized  escape  sequences,
                   4985:        they  are  treated  as  the  literal  characters  "B",  "R", and "X" by
                   4986:        default, but cause an error if the PCRE_EXTRA option is set. Outside  a
                   4987:        character class, these sequences have different meanings.
                   4988: 
                   4989:    Unsupported escape sequences
                   4990: 
                   4991:        In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
                   4992:        handler and used  to  modify  the  case  of  following  characters.  By
                   4993:        default,  PCRE does not support these escape sequences. However, if the
                   4994:        PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and
                   4995:        \u can be used to define a character by code point, as described in the
                   4996:        previous section.
                   4997: 
                   4998:    Absolute and relative back references
                   4999: 
                   5000:        The sequence \g followed by an unsigned or a negative  number,  option-
                   5001:        ally  enclosed  in braces, is an absolute or relative back reference. A
                   5002:        named back reference can be coded as \g{name}. Back references are dis-
                   5003:        cussed later, following the discussion of parenthesized subpatterns.
                   5004: 
                   5005:    Absolute and relative subroutine calls
                   5006: 
                   5007:        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
                   5008:        name or a number enclosed either in angle brackets or single quotes, is
                   5009:        an  alternative  syntax for referencing a subpattern as a "subroutine".
                   5010:        Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
                   5011:        \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
                   5012:        reference; the latter is a subroutine call.
                   5013: 
                   5014:    Generic character types
                   5015: 
                   5016:        Another use of backslash is for specifying generic character types:
                   5017: 
                   5018:          \d     any decimal digit
                   5019:          \D     any character that is not a decimal digit
1.1.1.3   misho    5020:          \h     any horizontal white space character
                   5021:          \H     any character that is not a horizontal white space character
                   5022:          \s     any white space character
                   5023:          \S     any character that is not a white space character
                   5024:          \v     any vertical white space character
                   5025:          \V     any character that is not a vertical white space character
1.1       misho    5026:          \w     any "word" character
                   5027:          \W     any "non-word" character
                   5028: 
                   5029:        There is also the single sequence \N, which matches a non-newline char-
                   5030:        acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
                   5031:        not set. Perl also uses \N to match characters by name; PCRE  does  not
                   5032:        support this.
                   5033: 
                   5034:        Each  pair of lower and upper case escape sequences partitions the com-
                   5035:        plete set of characters into two disjoint  sets.  Any  given  character
                   5036:        matches  one, and only one, of each pair. The sequences can appear both
                   5037:        inside and outside character classes. They each match one character  of
                   5038:        the  appropriate  type.  If the current matching point is at the end of
                   5039:        the subject string, all of them fail, because there is no character  to
                   5040:        match.
                   5041: 
                   5042:        For  compatibility  with Perl, \s does not match the VT character (code
                   5043:        11).  This makes it different from the the POSIX "space" class. The  \s
                   5044:        characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
                   5045:        "use locale;" is included in a Perl script, \s may match the VT charac-
                   5046:        ter. In PCRE, it never does.
                   5047: 
                   5048:        A  "word"  character is an underscore or any character that is a letter
                   5049:        or digit.  By default, the definition of letters  and  digits  is  con-
                   5050:        trolled  by PCRE's low-valued character tables, and may vary if locale-
                   5051:        specific matching is taking place (see "Locale support" in the  pcreapi
                   5052:        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
                   5053:        systems, or "french" in Windows, some character codes greater than  128
                   5054:        are  used  for  accented letters, and these are then matched by \w. The
                   5055:        use of locales with Unicode is discouraged.
                   5056: 
1.1.1.2   misho    5057:        By default, in a UTF mode, characters  with  values  greater  than  128
1.1       misho    5058:        never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
1.1.1.2   misho    5059:        sequences retain their original meanings from before  UTF  support  was
1.1       misho    5060:        available,  mainly for efficiency reasons. However, if PCRE is compiled
                   5061:        with Unicode property support, and the PCRE_UCP option is set, the  be-
                   5062:        haviour  is  changed  so  that Unicode properties are used to determine
                   5063:        character types, as follows:
                   5064: 
                   5065:          \d  any character that \p{Nd} matches (decimal digit)
                   5066:          \s  any character that \p{Z} matches, plus HT, LF, FF, CR
                   5067:          \w  any character that \p{L} or \p{N} matches, plus underscore
                   5068: 
                   5069:        The upper case escapes match the inverse sets of characters. Note  that
                   5070:        \d  matches  only decimal digits, whereas \w matches any Unicode digit,
                   5071:        as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
                   5072:        affects  \b,  and  \B  because  they are defined in terms of \w and \W.
                   5073:        Matching these sequences is noticeably slower when PCRE_UCP is set.
                   5074: 
                   5075:        The sequences \h, \H, \v, and \V are features that were added  to  Perl
                   5076:        at  release  5.10. In contrast to the other sequences, which match only
                   5077:        ASCII characters by default, these  always  match  certain  high-valued
1.1.1.2   misho    5078:        codepoints,  whether or not PCRE_UCP is set. The horizontal space char-
                   5079:        acters are:
1.1       misho    5080: 
1.1.1.4 ! misho    5081:          U+0009     Horizontal tab (HT)
1.1       misho    5082:          U+0020     Space
                   5083:          U+00A0     Non-break space
                   5084:          U+1680     Ogham space mark
                   5085:          U+180E     Mongolian vowel separator
                   5086:          U+2000     En quad
                   5087:          U+2001     Em quad
                   5088:          U+2002     En space
                   5089:          U+2003     Em space
                   5090:          U+2004     Three-per-em space
                   5091:          U+2005     Four-per-em space
                   5092:          U+2006     Six-per-em space
                   5093:          U+2007     Figure space
                   5094:          U+2008     Punctuation space
                   5095:          U+2009     Thin space
                   5096:          U+200A     Hair space
                   5097:          U+202F     Narrow no-break space
                   5098:          U+205F     Medium mathematical space
                   5099:          U+3000     Ideographic space
                   5100: 
                   5101:        The vertical space characters are:
                   5102: 
1.1.1.4 ! misho    5103:          U+000A     Linefeed (LF)
        !          5104:          U+000B     Vertical tab (VT)
        !          5105:          U+000C     Form feed (FF)
        !          5106:          U+000D     Carriage return (CR)
        !          5107:          U+0085     Next line (NEL)
1.1       misho    5108:          U+2028     Line separator
                   5109:          U+2029     Paragraph separator
                   5110: 
1.1.1.2   misho    5111:        In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
                   5112:        256 are relevant.
                   5113: 
1.1       misho    5114:    Newline sequences
                   5115: 
1.1.1.2   misho    5116:        Outside  a  character class, by default, the escape sequence \R matches
                   5117:        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
                   5118:        to the following:
1.1       misho    5119: 
                   5120:          (?>\r\n|\n|\x0b|\f|\r|\x85)
                   5121: 
1.1.1.2   misho    5122:        This  is  an  example  of an "atomic group", details of which are given
1.1       misho    5123:        below.  This particular group matches either the two-character sequence
1.1.1.2   misho    5124:        CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
1.1.1.3   misho    5125:        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
                   5126:        riage  return,  U+000D),  or NEL (next line, U+0085). The two-character
                   5127:        sequence is treated as a single unit that cannot be split.
1.1       misho    5128: 
1.1.1.2   misho    5129:        In other modes, two additional characters whose codepoints are  greater
1.1       misho    5130:        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
1.1.1.2   misho    5131:        rator, U+2029).  Unicode character property support is not  needed  for
1.1       misho    5132:        these characters to be recognized.
                   5133: 
                   5134:        It is possible to restrict \R to match only CR, LF, or CRLF (instead of
1.1.1.2   misho    5135:        the complete set  of  Unicode  line  endings)  by  setting  the  option
1.1       misho    5136:        PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
                   5137:        (BSR is an abbrevation for "backslash R".) This can be made the default
1.1.1.2   misho    5138:        when  PCRE  is  built;  if this is the case, the other behaviour can be
                   5139:        requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
                   5140:        specify  these  settings  by  starting a pattern string with one of the
1.1       misho    5141:        following sequences:
                   5142: 
                   5143:          (*BSR_ANYCRLF)   CR, LF, or CRLF only
                   5144:          (*BSR_UNICODE)   any Unicode newline sequence
                   5145: 
1.1.1.2   misho    5146:        These override the default and the options given to the compiling func-
                   5147:        tion,  but  they  can  themselves  be  overridden by options given to a
                   5148:        matching function. Note that these  special  settings,  which  are  not
                   5149:        Perl-compatible,  are  recognized  only at the very start of a pattern,
                   5150:        and that they must be in upper case.  If  more  than  one  of  them  is
                   5151:        present,  the  last  one is used. They can be combined with a change of
1.1       misho    5152:        newline convention; for example, a pattern can start with:
                   5153: 
                   5154:          (*ANY)(*BSR_ANYCRLF)
                   5155: 
1.1.1.4 ! misho    5156:        They can also be combined with the (*UTF8), (*UTF16), (*UTF32),  (*UTF)
        !          5157:        or (*UCP) special sequences. Inside a character class, \R is treated as
        !          5158:        an unrecognized escape sequence, and  so  matches  the  letter  "R"  by
        !          5159:        default, but causes an error if PCRE_EXTRA is set.
1.1       misho    5160: 
                   5161:    Unicode character properties
                   5162: 
                   5163:        When PCRE is built with Unicode character property support, three addi-
1.1.1.2   misho    5164:        tional escape sequences that match characters with specific  properties
                   5165:        are  available.   When  in 8-bit non-UTF-8 mode, these sequences are of
                   5166:        course limited to testing characters whose  codepoints  are  less  than
                   5167:        256, but they do work in this mode.  The extra escape sequences are:
1.1       misho    5168: 
                   5169:          \p{xx}   a character with the xx property
                   5170:          \P{xx}   a character without the xx property
1.1.1.4 ! misho    5171:          \X       a Unicode extended grapheme cluster
1.1       misho    5172: 
1.1.1.2   misho    5173:        The  property  names represented by xx above are limited to the Unicode
1.1       misho    5174:        script names, the general category properties, "Any", which matches any
1.1.1.2   misho    5175:        character   (including  newline),  and  some  special  PCRE  properties
                   5176:        (described in the next section).  Other Perl properties such as  "InMu-
                   5177:        sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
1.1       misho    5178:        does not match any characters, so always causes a match failure.
                   5179: 
                   5180:        Sets of Unicode characters are defined as belonging to certain scripts.
1.1.1.2   misho    5181:        A  character from one of these sets can be matched using a script name.
1.1       misho    5182:        For example:
                   5183: 
                   5184:          \p{Greek}
                   5185:          \P{Han}
                   5186: 
1.1.1.2   misho    5187:        Those that are not part of an identified script are lumped together  as
1.1       misho    5188:        "Common". The current list of scripts is:
                   5189: 
1.1.1.3   misho    5190:        Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
                   5191:        Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
                   5192:        Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
                   5193:        Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
                   5194:        Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
                   5195:        gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
                   5196:        tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
                   5197:        Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
                   5198:        Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
                   5199:        Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
                   5200:        Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
                   5201:        Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
                   5202:        tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
                   5203:        Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
                   5204:        Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
                   5205:        Yi.
1.1       misho    5206: 
                   5207:        Each character has exactly one Unicode general category property, spec-
1.1.1.2   misho    5208:        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
                   5209:        tion can be specified by including a  circumflex  between  the  opening
                   5210:        brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
1.1       misho    5211:        \P{Lu}.
                   5212: 
                   5213:        If only one letter is specified with \p or \P, it includes all the gen-
1.1.1.2   misho    5214:        eral  category properties that start with that letter. In this case, in
                   5215:        the absence of negation, the curly brackets in the escape sequence  are
1.1       misho    5216:        optional; these two examples have the same effect:
                   5217: 
                   5218:          \p{L}
                   5219:          \pL
                   5220: 
                   5221:        The following general category property codes are supported:
                   5222: 
                   5223:          C     Other
                   5224:          Cc    Control
                   5225:          Cf    Format
                   5226:          Cn    Unassigned
                   5227:          Co    Private use
                   5228:          Cs    Surrogate
                   5229: 
                   5230:          L     Letter
                   5231:          Ll    Lower case letter
                   5232:          Lm    Modifier letter
                   5233:          Lo    Other letter
                   5234:          Lt    Title case letter
                   5235:          Lu    Upper case letter
                   5236: 
                   5237:          M     Mark
                   5238:          Mc    Spacing mark
                   5239:          Me    Enclosing mark
                   5240:          Mn    Non-spacing mark
                   5241: 
                   5242:          N     Number
                   5243:          Nd    Decimal number
                   5244:          Nl    Letter number
                   5245:          No    Other number
                   5246: 
                   5247:          P     Punctuation
                   5248:          Pc    Connector punctuation
                   5249:          Pd    Dash punctuation
                   5250:          Pe    Close punctuation
                   5251:          Pf    Final punctuation
                   5252:          Pi    Initial punctuation
                   5253:          Po    Other punctuation
                   5254:          Ps    Open punctuation
                   5255: 
                   5256:          S     Symbol
                   5257:          Sc    Currency symbol
                   5258:          Sk    Modifier symbol
                   5259:          Sm    Mathematical symbol
                   5260:          So    Other symbol
                   5261: 
                   5262:          Z     Separator
                   5263:          Zl    Line separator
                   5264:          Zp    Paragraph separator
                   5265:          Zs    Space separator
                   5266: 
1.1.1.2   misho    5267:        The  special property L& is also supported: it matches a character that
                   5268:        has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
1.1       misho    5269:        classified as a modifier or "other".
                   5270: 
1.1.1.2   misho    5271:        The  Cs  (Surrogate)  property  applies only to characters in the range
                   5272:        U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
                   5273:        so  cannot  be  tested  by  PCRE, unless UTF validity checking has been
1.1.1.4 ! misho    5274:        turned    off    (see    the    discussion    of    PCRE_NO_UTF8_CHECK,
        !          5275:        PCRE_NO_UTF16_CHECK  and PCRE_NO_UTF32_CHECK in the pcreapi page). Perl
        !          5276:        does not support the Cs property.
1.1       misho    5277: 
                   5278:        The long synonyms for  property  names  that  Perl  supports  (such  as
                   5279:        \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
                   5280:        any of these properties with "Is".
                   5281: 
                   5282:        No character that is in the Unicode table has the Cn (unassigned) prop-
                   5283:        erty.  Instead, this property is assumed for any code point that is not
                   5284:        in the Unicode table.
                   5285: 
                   5286:        Specifying caseless matching does not affect  these  escape  sequences.
1.1.1.4 ! misho    5287:        For  example,  \p{Lu}  always  matches only upper case letters. This is
        !          5288:        different from the behaviour of current versions of Perl.
        !          5289: 
        !          5290:        Matching characters by Unicode property is not fast, because  PCRE  has
        !          5291:        to  do  a  multistage table lookup in order to find a character's prop-
        !          5292:        erty. That is why the traditional escape sequences such as \d and \w do
        !          5293:        not use Unicode properties in PCRE by default, though you can make them
        !          5294:        do so by setting the PCRE_UCP option or by starting  the  pattern  with
        !          5295:        (*UCP).
        !          5296: 
        !          5297:    Extended grapheme clusters
1.1       misho    5298: 
                   5299:        The  \X  escape  matches  any number of Unicode characters that form an
1.1.1.4 ! misho    5300:        "extended grapheme cluster", and treats the sequence as an atomic group
        !          5301:        (see  below).   Up  to and including release 8.31, PCRE matched an ear-
        !          5302:        lier, simpler definition that was equivalent to
1.1       misho    5303: 
                   5304:          (?>\PM\pM*)
                   5305: 
1.1.1.4 ! misho    5306:        That is, it matched a character without the "mark"  property,  followed
        !          5307:        by  zero  or  more characters with the "mark" property. Characters with
        !          5308:        the "mark" property are typically non-spacing accents that  affect  the
        !          5309:        preceding character.
        !          5310: 
        !          5311:        This  simple definition was extended in Unicode to include more compli-
        !          5312:        cated kinds of composite character by giving each character a  grapheme
        !          5313:        breaking  property,  and  creating  rules  that use these properties to
        !          5314:        define the boundaries of extended grapheme  clusters.  In  releases  of
        !          5315:        PCRE later than 8.31, \X matches one of these clusters.
        !          5316: 
        !          5317:        \X  always  matches  at least one character. Then it decides whether to
        !          5318:        add additional characters according to the following rules for ending a
        !          5319:        cluster:
        !          5320: 
        !          5321:        1. End at the end of the subject string.
        !          5322: 
        !          5323:        2.  Do not end between CR and LF; otherwise end after any control char-
        !          5324:        acter.
        !          5325: 
        !          5326:        3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
        !          5327:        characters  are of five types: L, V, T, LV, and LVT. An L character may
        !          5328:        be followed by an L, V, LV, or LVT character; an LV or V character  may
        !          5329:        be followed by a V or T character; an LVT or T character may be follwed
        !          5330:        only by a T character.
        !          5331: 
        !          5332:        4. Do not end before extending characters or spacing marks.  Characters
        !          5333:        with  the  "mark"  property  always have the "extend" grapheme breaking
        !          5334:        property.
        !          5335: 
        !          5336:        5. Do not end after prepend characters.
        !          5337: 
        !          5338:        6. Otherwise, end the cluster.
1.1       misho    5339: 
                   5340:    PCRE's additional properties
                   5341: 
1.1.1.4 ! misho    5342:        As well as the standard Unicode properties described above,  PCRE  sup-
        !          5343:        ports  four  more  that  make it possible to convert traditional escape
        !          5344:        sequences such as \w and \s and POSIX character classes to use  Unicode
        !          5345:        properties.  PCRE  uses  these non-standard, non-Perl properties inter-
        !          5346:        nally when PCRE_UCP is set. However, they may also be used  explicitly.
        !          5347:        These properties are:
1.1       misho    5348: 
                   5349:          Xan   Any alphanumeric character
                   5350:          Xps   Any POSIX space character
                   5351:          Xsp   Any Perl space character
                   5352:          Xwd   Any Perl "word" character
                   5353: 
1.1.1.4 ! misho    5354:        Xan  matches  characters that have either the L (letter) or the N (num-
        !          5355:        ber) property. Xps matches the characters tab, linefeed, vertical  tab,
        !          5356:        form  feed,  or carriage return, and any other character that has the Z
1.1       misho    5357:        (separator) property.  Xsp is the same as Xps, except that vertical tab
                   5358:        is excluded. Xwd matches the same characters as Xan, plus underscore.
                   5359: 
1.1.1.4 ! misho    5360:        There  is another non-standard property, Xuc, which matches any charac-
        !          5361:        ter that can be represented by a Universal Character Name  in  C++  and
        !          5362:        other  programming  languages.  These are the characters $, @, ` (grave
        !          5363:        accent), and all characters with Unicode code points  greater  than  or
        !          5364:        equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
        !          5365:        most base (ASCII) characters are excluded. (Universal  Character  Names
        !          5366:        are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
        !          5367:        Note that the Xuc property does not match these sequences but the char-
        !          5368:        acters that they represent.)
        !          5369: 
1.1       misho    5370:    Resetting the match start
                   5371: 
1.1.1.4 ! misho    5372:        The  escape sequence \K causes any previously matched characters not to
1.1       misho    5373:        be included in the final matched sequence. For example, the pattern:
                   5374: 
                   5375:          foo\Kbar
                   5376: 
1.1.1.4 ! misho    5377:        matches "foobar", but reports that it has matched "bar".  This  feature
        !          5378:        is  similar  to  a lookbehind assertion (described below).  However, in
        !          5379:        this case, the part of the subject before the real match does not  have
        !          5380:        to  be of fixed length, as lookbehind assertions do. The use of \K does
        !          5381:        not interfere with the setting of captured  substrings.   For  example,
1.1       misho    5382:        when the pattern
                   5383: 
                   5384:          (foo)\Kbar
                   5385: 
                   5386:        matches "foobar", the first substring is still set to "foo".
                   5387: 
1.1.1.4 ! misho    5388:        Perl  documents  that  the  use  of  \K  within assertions is "not well
        !          5389:        defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive
1.1       misho    5390:        assertions, but is ignored in negative assertions.
                   5391: 
                   5392:    Simple assertions
                   5393: 
1.1.1.4 ! misho    5394:        The  final use of backslash is for certain simple assertions. An asser-
        !          5395:        tion specifies a condition that has to be met at a particular point  in
        !          5396:        a  match, without consuming any characters from the subject string. The
        !          5397:        use of subpatterns for more complicated assertions is described  below.
1.1       misho    5398:        The backslashed assertions are:
                   5399: 
                   5400:          \b     matches at a word boundary
                   5401:          \B     matches when not at a word boundary
                   5402:          \A     matches at the start of the subject
                   5403:          \Z     matches at the end of the subject
                   5404:                  also matches before a newline at the end of the subject
                   5405:          \z     matches only at the end of the subject
                   5406:          \G     matches at the first matching position in the subject
                   5407: 
1.1.1.4 ! misho    5408:        Inside  a  character  class, \b has a different meaning; it matches the
        !          5409:        backspace character. If any other of  these  assertions  appears  in  a
        !          5410:        character  class, by default it matches the corresponding literal char-
1.1       misho    5411:        acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
1.1.1.4 ! misho    5412:        PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-
1.1       misho    5413:        ated instead.
                   5414: 
1.1.1.4 ! misho    5415:        A word boundary is a position in the subject string where  the  current
        !          5416:        character  and  the previous character do not both match \w or \W (i.e.
        !          5417:        one matches \w and the other matches \W), or the start or  end  of  the
        !          5418:        string  if  the  first or last character matches \w, respectively. In a
        !          5419:        UTF mode, the meanings of \w and \W  can  be  changed  by  setting  the
        !          5420:        PCRE_UCP  option. When this is done, it also affects \b and \B. Neither
        !          5421:        PCRE nor Perl has a separate "start of word" or "end of  word"  metase-
        !          5422:        quence.  However,  whatever follows \b normally determines which it is.
1.1       misho    5423:        For example, the fragment \ba matches "a" at the start of a word.
                   5424: 
1.1.1.4 ! misho    5425:        The \A, \Z, and \z assertions differ from  the  traditional  circumflex
1.1       misho    5426:        and dollar (described in the next section) in that they only ever match
1.1.1.4 ! misho    5427:        at the very start and end of the subject string, whatever  options  are
        !          5428:        set.  Thus,  they are independent of multiline mode. These three asser-
1.1       misho    5429:        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
1.1.1.4 ! misho    5430:        affect  only the behaviour of the circumflex and dollar metacharacters.
        !          5431:        However, if the startoffset argument of pcre_exec() is non-zero,  indi-
1.1       misho    5432:        cating that matching is to start at a point other than the beginning of
1.1.1.4 ! misho    5433:        the subject, \A can never match. The difference between \Z  and  \z  is
1.1       misho    5434:        that \Z matches before a newline at the end of the string as well as at
                   5435:        the very end, whereas \z matches only at the end.
                   5436: 
1.1.1.4 ! misho    5437:        The \G assertion is true only when the current matching position is  at
        !          5438:        the  start point of the match, as specified by the startoffset argument
        !          5439:        of pcre_exec(). It differs from \A when the  value  of  startoffset  is
        !          5440:        non-zero.  By calling pcre_exec() multiple times with appropriate argu-
1.1       misho    5441:        ments, you can mimic Perl's /g option, and it is in this kind of imple-
                   5442:        mentation where \G can be useful.
                   5443: 
1.1.1.4 ! misho    5444:        Note,  however,  that  PCRE's interpretation of \G, as the start of the
1.1       misho    5445:        current match, is subtly different from Perl's, which defines it as the
1.1.1.4 ! misho    5446:        end  of  the  previous  match. In Perl, these can be different when the
        !          5447:        previously matched string was empty. Because PCRE does just  one  match
1.1       misho    5448:        at a time, it cannot reproduce this behaviour.
                   5449: 
1.1.1.4 ! misho    5450:        If  all  the alternatives of a pattern begin with \G, the expression is
1.1       misho    5451:        anchored to the starting match position, and the "anchored" flag is set
                   5452:        in the compiled regular expression.
                   5453: 
                   5454: 
                   5455: CIRCUMFLEX AND DOLLAR
                   5456: 
1.1.1.4 ! misho    5457:        The  circumflex  and  dollar  metacharacters are zero-width assertions.
        !          5458:        That is, they test for a particular condition being true  without  con-
        !          5459:        suming any characters from the subject string.
        !          5460: 
1.1       misho    5461:        Outside a character class, in the default matching mode, the circumflex
1.1.1.4 ! misho    5462:        character is an assertion that is true only  if  the  current  matching
        !          5463:        point  is  at the start of the subject string. If the startoffset argu-
        !          5464:        ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
        !          5465:        PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
1.1       misho    5466:        has an entirely different meaning (see below).
                   5467: 
1.1.1.4 ! misho    5468:        Circumflex need not be the first character of the pattern if  a  number
        !          5469:        of  alternatives are involved, but it should be the first thing in each
        !          5470:        alternative in which it appears if the pattern is ever  to  match  that
        !          5471:        branch.  If all possible alternatives start with a circumflex, that is,
        !          5472:        if the pattern is constrained to match only at the start  of  the  sub-
        !          5473:        ject,  it  is  said  to be an "anchored" pattern. (There are also other
1.1       misho    5474:        constructs that can cause a pattern to be anchored.)
                   5475: 
1.1.1.4 ! misho    5476:        The dollar character is an assertion that is true only if  the  current
        !          5477:        matching  point  is  at  the  end of the subject string, or immediately
        !          5478:        before a newline at the end of the string (by default). Note,  however,
        !          5479:        that  it  does  not  actually match the newline. Dollar need not be the
        !          5480:        last character of the pattern if a number of alternatives are involved,
        !          5481:        but  it should be the last item in any branch in which it appears. Dol-
        !          5482:        lar has no special meaning in a character class.
1.1       misho    5483: 
                   5484:        The meaning of dollar can be changed so that it  matches  only  at  the
                   5485:        very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
                   5486:        compile time. This does not affect the \Z assertion.
                   5487: 
                   5488:        The meanings of the circumflex and dollar characters are changed if the
                   5489:        PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
                   5490:        matches immediately after internal newlines as well as at the start  of
                   5491:        the  subject  string.  It  does not match after a newline that ends the
                   5492:        string. A dollar matches before any newlines in the string, as well  as
                   5493:        at  the very end, when PCRE_MULTILINE is set. When newline is specified
                   5494:        as the two-character sequence CRLF, isolated CR and  LF  characters  do
                   5495:        not indicate newlines.
                   5496: 
                   5497:        For  example, the pattern /^abc$/ matches the subject string "def\nabc"
                   5498:        (where \n represents a newline) in multiline mode, but  not  otherwise.
                   5499:        Consequently,  patterns  that  are anchored in single line mode because
                   5500:        all branches start with ^ are not anchored in  multiline  mode,  and  a
                   5501:        match  for  circumflex  is  possible  when  the startoffset argument of
                   5502:        pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
                   5503:        PCRE_MULTILINE is set.
                   5504: 
                   5505:        Note  that  the sequences \A, \Z, and \z can be used to match the start
                   5506:        and end of the subject in both modes, and if all branches of a  pattern
                   5507:        start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
                   5508:        set.
                   5509: 
                   5510: 
                   5511: FULL STOP (PERIOD, DOT) AND \N
                   5512: 
                   5513:        Outside a character class, a dot in the pattern matches any one charac-
                   5514:        ter  in  the subject string except (by default) a character that signi-
1.1.1.2   misho    5515:        fies the end of a line.
1.1       misho    5516: 
1.1.1.2   misho    5517:        When a line ending is defined as a single character, dot never  matches
                   5518:        that  character; when the two-character sequence CRLF is used, dot does
                   5519:        not match CR if it is immediately followed  by  LF,  but  otherwise  it
                   5520:        matches  all characters (including isolated CRs and LFs). When any Uni-
                   5521:        code line endings are being recognized, dot does not match CR or LF  or
1.1       misho    5522:        any of the other line ending characters.
                   5523: 
1.1.1.2   misho    5524:        The  behaviour  of  dot  with regard to newlines can be changed. If the
                   5525:        PCRE_DOTALL option is set, a dot matches  any  one  character,  without
1.1       misho    5526:        exception. If the two-character sequence CRLF is present in the subject
                   5527:        string, it takes two dots to match it.
                   5528: 
1.1.1.2   misho    5529:        The handling of dot is entirely independent of the handling of  circum-
                   5530:        flex  and  dollar,  the  only relationship being that they both involve
1.1       misho    5531:        newlines. Dot has no special meaning in a character class.
                   5532: 
1.1.1.2   misho    5533:        The escape sequence \N behaves like  a  dot,  except  that  it  is  not
                   5534:        affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
                   5535:        character except one that signifies the end of a line. Perl  also  uses
1.1       misho    5536:        \N to match characters by name; PCRE does not support this.
                   5537: 
                   5538: 
1.1.1.2   misho    5539: MATCHING A SINGLE DATA UNIT
1.1       misho    5540: 
1.1.1.2   misho    5541:        Outside  a character class, the escape sequence \C matches any one data
                   5542:        unit, whether or not a UTF mode is set. In the 8-bit library, one  data
1.1.1.4 ! misho    5543:        unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
        !          5544:        32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
        !          5545:        line-ending  characters.  The  feature  is provided in Perl in order to
        !          5546:        match individual bytes in UTF-8 mode, but it is unclear how it can use-
        !          5547:        fully  be  used.  Because  \C breaks up characters into individual data
        !          5548:        units, matching one unit with \C in a UTF mode means that the  rest  of
        !          5549:        the string may start with a malformed UTF character. This has undefined
        !          5550:        results, because PCRE assumes that it is dealing with valid UTF strings
        !          5551:        (and  by  default  it checks this at the start of processing unless the
        !          5552:        PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or  PCRE_NO_UTF32_CHECK  option
        !          5553:        is used).
1.1       misho    5554: 
1.1.1.4 ! misho    5555:        PCRE  does  not  allow \C to appear in lookbehind assertions (described
        !          5556:        below) in a UTF mode, because this would make it impossible  to  calcu-
1.1       misho    5557:        late the length of the lookbehind.
                   5558: 
1.1.1.2   misho    5559:        In general, the \C escape sequence is best avoided. However, one way of
1.1.1.4 ! misho    5560:        using it that avoids the problem of malformed UTF characters is to  use
        !          5561:        a  lookahead to check the length of the next character, as in this pat-
        !          5562:        tern, which could be used with a UTF-8 string (ignore white  space  and
1.1.1.2   misho    5563:        line breaks):
1.1       misho    5564: 
                   5565:          (?| (?=[\x00-\x7f])(\C) |
                   5566:              (?=[\x80-\x{7ff}])(\C)(\C) |
                   5567:              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
                   5568:              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
                   5569: 
1.1.1.4 ! misho    5570:        A  group  that starts with (?| resets the capturing parentheses numbers
        !          5571:        in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The
        !          5572:        assertions  at  the start of each branch check the next UTF-8 character
        !          5573:        for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The
        !          5574:        character's  individual bytes are then captured by the appropriate num-
1.1       misho    5575:        ber of groups.
                   5576: 
                   5577: 
                   5578: SQUARE BRACKETS AND CHARACTER CLASSES
                   5579: 
                   5580:        An opening square bracket introduces a character class, terminated by a
                   5581:        closing square bracket. A closing square bracket on its own is not spe-
                   5582:        cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
                   5583:        a lone closing square bracket causes a compile-time error. If a closing
1.1.1.4 ! misho    5584:        square bracket is required as a member of the class, it should  be  the
        !          5585:        first  data  character  in  the  class (after an initial circumflex, if
1.1       misho    5586:        present) or escaped with a backslash.
                   5587: 
1.1.1.4 ! misho    5588:        A character class matches a single character in the subject. In  a  UTF
        !          5589:        mode,  the  character  may  be  more than one data unit long. A matched
1.1.1.2   misho    5590:        character must be in the set of characters defined by the class, unless
1.1.1.4 ! misho    5591:        the  first  character in the class definition is a circumflex, in which
1.1.1.2   misho    5592:        case the subject character must not be in the set defined by the class.
1.1.1.4 ! misho    5593:        If  a  circumflex is actually required as a member of the class, ensure
1.1.1.2   misho    5594:        it is not the first character, or escape it with a backslash.
1.1       misho    5595: 
1.1.1.4 ! misho    5596:        For example, the character class [aeiou] matches any lower case  vowel,
        !          5597:        while  [^aeiou]  matches  any character that is not a lower case vowel.
1.1       misho    5598:        Note that a circumflex is just a convenient notation for specifying the
1.1.1.4 ! misho    5599:        characters  that  are in the class by enumerating those that are not. A
        !          5600:        class that starts with a circumflex is not an assertion; it still  con-
        !          5601:        sumes  a  character  from the subject string, and therefore it fails if
1.1       misho    5602:        the current pointer is at the end of the string.
                   5603: 
1.1.1.4 ! misho    5604:        In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
        !          5605:        (0xffff)  can be included in a class as a literal string of data units,
1.1.1.2   misho    5606:        or by using the \x{ escaping mechanism.
                   5607: 
1.1.1.4 ! misho    5608:        When caseless matching is set, any letters in a  class  represent  both
        !          5609:        their  upper  case  and lower case versions, so for example, a caseless
        !          5610:        [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
        !          5611:        match  "A", whereas a caseful version would. In a UTF mode, PCRE always
        !          5612:        understands the concept of case for characters whose  values  are  less
        !          5613:        than  128, so caseless matching is always possible. For characters with
        !          5614:        higher values, the concept of case is supported  if  PCRE  is  compiled
        !          5615:        with  Unicode  property support, but not otherwise.  If you want to use
        !          5616:        caseless matching in a UTF mode for characters 128 and above, you  must
        !          5617:        ensure  that  PCRE is compiled with Unicode property support as well as
1.1.1.2   misho    5618:        with UTF support.
                   5619: 
1.1.1.4 ! misho    5620:        Characters that might indicate line breaks are  never  treated  in  any
        !          5621:        special  way  when  matching  character  classes,  whatever line-ending
        !          5622:        sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
1.1       misho    5623:        PCRE_MULTILINE options is used. A class such as [^a] always matches one
                   5624:        of these characters.
                   5625: 
1.1.1.4 ! misho    5626:        The minus (hyphen) character can be used to specify a range of  charac-
        !          5627:        ters  in  a  character  class.  For  example,  [d-m] matches any letter
        !          5628:        between d and m, inclusive. If a  minus  character  is  required  in  a
        !          5629:        class,  it  must  be  escaped  with a backslash or appear in a position
        !          5630:        where it cannot be interpreted as indicating a range, typically as  the
1.1       misho    5631:        first or last character in the class.
                   5632: 
                   5633:        It is not possible to have the literal character "]" as the end charac-
1.1.1.4 ! misho    5634:        ter of a range. A pattern such as [W-]46] is interpreted as a class  of
        !          5635:        two  characters ("W" and "-") followed by a literal string "46]", so it
        !          5636:        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
        !          5637:        backslash  it is interpreted as the end of range, so [W-\]46] is inter-
        !          5638:        preted as a class containing a range followed by two other  characters.
        !          5639:        The  octal or hexadecimal representation of "]" can also be used to end
1.1       misho    5640:        a range.
                   5641: 
1.1.1.4 ! misho    5642:        Ranges operate in the collating sequence of character values. They  can
        !          5643:        also   be  used  for  characters  specified  numerically,  for  example
        !          5644:        [\000-\037]. Ranges can include any characters that are valid  for  the
1.1.1.2   misho    5645:        current mode.
1.1       misho    5646: 
                   5647:        If a range that includes letters is used when caseless matching is set,
                   5648:        it matches the letters in either case. For example, [W-c] is equivalent
1.1.1.4 ! misho    5649:        to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if
        !          5650:        character tables for a French locale are in  use,  [\xc8-\xcb]  matches
        !          5651:        accented  E  characters  in both cases. In UTF modes, PCRE supports the
        !          5652:        concept of case for characters with values greater than 128  only  when
1.1       misho    5653:        it is compiled with Unicode property support.
                   5654: 
1.1.1.4 ! misho    5655:        The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,
1.1       misho    5656:        \w, and \W may appear in a character class, and add the characters that
1.1.1.4 ! misho    5657:        they  match to the class. For example, [\dABCDEF] matches any hexadeci-
        !          5658:        mal digit. In UTF modes, the PCRE_UCP option affects  the  meanings  of
        !          5659:        \d,  \s,  \w  and  their upper case partners, just as it does when they
        !          5660:        appear outside a character class, as described in the section  entitled
1.1       misho    5661:        "Generic character types" above. The escape sequence \b has a different
1.1.1.4 ! misho    5662:        meaning inside a character class; it matches the  backspace  character.
        !          5663:        The  sequences  \B,  \N,  \R, and \X are not special inside a character
        !          5664:        class. Like any other unrecognized escape sequences, they  are  treated
        !          5665:        as  the literal characters "B", "N", "R", and "X" by default, but cause
1.1       misho    5666:        an error if the PCRE_EXTRA option is set.
                   5667: 
1.1.1.4 ! misho    5668:        A circumflex can conveniently be used with  the  upper  case  character
        !          5669:        types  to specify a more restricted set of characters than the matching
        !          5670:        lower case type.  For example, the class [^\W_] matches any  letter  or
1.1       misho    5671:        digit, but not underscore, whereas [\w] includes underscore. A positive
                   5672:        character class should be read as "something OR something OR ..." and a
                   5673:        negative class as "NOT something AND NOT something AND NOT ...".
                   5674: 
1.1.1.4 ! misho    5675:        The  only  metacharacters  that are recognized in character classes are
        !          5676:        backslash, hyphen (only where it can be  interpreted  as  specifying  a
        !          5677:        range),  circumflex  (only  at the start), opening square bracket (only
        !          5678:        when it can be interpreted as introducing a POSIX class name - see  the
        !          5679:        next  section),  and  the  terminating closing square bracket. However,
1.1       misho    5680:        escaping other non-alphanumeric characters does no harm.
                   5681: 
                   5682: 
                   5683: POSIX CHARACTER CLASSES
                   5684: 
                   5685:        Perl supports the POSIX notation for character classes. This uses names
1.1.1.4 ! misho    5686:        enclosed  by  [: and :] within the enclosing square brackets. PCRE also
1.1       misho    5687:        supports this notation. For example,
                   5688: 
                   5689:          [01[:alpha:]%]
                   5690: 
                   5691:        matches "0", "1", any alphabetic character, or "%". The supported class
                   5692:        names are:
                   5693: 
                   5694:          alnum    letters and digits
                   5695:          alpha    letters
                   5696:          ascii    character codes 0 - 127
                   5697:          blank    space or tab only
                   5698:          cntrl    control characters
                   5699:          digit    decimal digits (same as \d)
                   5700:          graph    printing characters, excluding space
                   5701:          lower    lower case letters
                   5702:          print    printing characters, including space
                   5703:          punct    printing characters, excluding letters and digits and space
                   5704:          space    white space (not quite the same as \s)
                   5705:          upper    upper case letters
                   5706:          word     "word" characters (same as \w)
                   5707:          xdigit   hexadecimal digits
                   5708: 
1.1.1.4 ! misho    5709:        The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
        !          5710:        and space (32). Notice that this list includes the VT  character  (code
1.1       misho    5711:        11). This makes "space" different to \s, which does not include VT (for
                   5712:        Perl compatibility).
                   5713: 
1.1.1.4 ! misho    5714:        The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
        !          5715:        from  Perl  5.8. Another Perl extension is negation, which is indicated
1.1       misho    5716:        by a ^ character after the colon. For example,
                   5717: 
                   5718:          [12[:^digit:]]
                   5719: 
1.1.1.4 ! misho    5720:        matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
1.1       misho    5721:        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
                   5722:        these are not supported, and an error is given if they are encountered.
                   5723: 
1.1.1.4 ! misho    5724:        By default, in UTF modes, characters with values greater  than  128  do
        !          5725:        not  match any of the POSIX character classes. However, if the PCRE_UCP
        !          5726:        option is passed to pcre_compile(), some of the classes are changed  so
1.1       misho    5727:        that Unicode character properties are used. This is achieved by replac-
                   5728:        ing the POSIX classes by other sequences, as follows:
                   5729: 
                   5730:          [:alnum:]  becomes  \p{Xan}
                   5731:          [:alpha:]  becomes  \p{L}
                   5732:          [:blank:]  becomes  \h
                   5733:          [:digit:]  becomes  \p{Nd}
                   5734:          [:lower:]  becomes  \p{Ll}
                   5735:          [:space:]  becomes  \p{Xps}
                   5736:          [:upper:]  becomes  \p{Lu}
                   5737:          [:word:]   becomes  \p{Xwd}
                   5738: 
1.1.1.4 ! misho    5739:        Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other
1.1       misho    5740:        POSIX classes are unchanged, and match only characters with code points
                   5741:        less than 128.
                   5742: 
                   5743: 
                   5744: VERTICAL BAR
                   5745: 
1.1.1.4 ! misho    5746:        Vertical bar characters are used to separate alternative patterns.  For
1.1       misho    5747:        example, the pattern
                   5748: 
                   5749:          gilbert|sullivan
                   5750: 
1.1.1.4 ! misho    5751:        matches  either "gilbert" or "sullivan". Any number of alternatives may
        !          5752:        appear, and an empty  alternative  is  permitted  (matching  the  empty
1.1       misho    5753:        string). The matching process tries each alternative in turn, from left
1.1.1.4 ! misho    5754:        to right, and the first one that succeeds is used. If the  alternatives
        !          5755:        are  within a subpattern (defined below), "succeeds" means matching the
1.1       misho    5756:        rest of the main pattern as well as the alternative in the subpattern.
                   5757: 
                   5758: 
                   5759: INTERNAL OPTION SETTING
                   5760: 
1.1.1.4 ! misho    5761:        The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and
        !          5762:        PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from
        !          5763:        within the pattern by  a  sequence  of  Perl  option  letters  enclosed
1.1       misho    5764:        between "(?" and ")".  The option letters are
                   5765: 
                   5766:          i  for PCRE_CASELESS
                   5767:          m  for PCRE_MULTILINE
                   5768:          s  for PCRE_DOTALL
                   5769:          x  for PCRE_EXTENDED
                   5770: 
                   5771:        For example, (?im) sets caseless, multiline matching. It is also possi-
                   5772:        ble to unset these options by preceding the letter with a hyphen, and a
1.1.1.4 ! misho    5773:        combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-
        !          5774:        LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,
        !          5775:        is  also  permitted.  If  a  letter  appears  both before and after the
1.1       misho    5776:        hyphen, the option is unset.
                   5777: 
1.1.1.4 ! misho    5778:        The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
        !          5779:        can  be changed in the same way as the Perl-compatible options by using
1.1       misho    5780:        the characters J, U and X respectively.
                   5781: 
1.1.1.4 ! misho    5782:        When one of these option changes occurs at  top  level  (that  is,  not
        !          5783:        inside  subpattern parentheses), the change applies to the remainder of
1.1       misho    5784:        the pattern that follows. If the change is placed right at the start of
                   5785:        a pattern, PCRE extracts it into the global options (and it will there-
                   5786:        fore show up in data extracted by the pcre_fullinfo() function).
                   5787: 
1.1.1.4 ! misho    5788:        An option change within a subpattern (see below for  a  description  of
        !          5789:        subpatterns)  affects only that part of the subpattern that follows it,
1.1       misho    5790:        so
                   5791: 
                   5792:          (a(?i)b)c
                   5793: 
                   5794:        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
1.1.1.4 ! misho    5795:        used).   By  this means, options can be made to have different settings
        !          5796:        in different parts of the pattern. Any changes made in one  alternative
        !          5797:        do  carry  on  into subsequent branches within the same subpattern. For
1.1       misho    5798:        example,
                   5799: 
                   5800:          (a(?i)b|c)
                   5801: 
1.1.1.4 ! misho    5802:        matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
        !          5803:        first  branch  is  abandoned before the option setting. This is because
        !          5804:        the effects of option settings happen at compile time. There  would  be
1.1       misho    5805:        some very weird behaviour otherwise.
                   5806: 
1.1.1.4 ! misho    5807:        Note:  There  are  other  PCRE-specific  options that can be set by the
        !          5808:        application when the compiling or matching  functions  are  called.  In
        !          5809:        some  cases  the  pattern can contain special leading sequences such as
        !          5810:        (*CRLF) to override what the application  has  set  or  what  has  been
        !          5811:        defaulted.   Details   are  given  in  the  section  entitled  "Newline
        !          5812:        sequences" above. There are also the  (*UTF8),  (*UTF16),(*UTF32),  and
        !          5813:        (*UCP)  leading sequences that can be used to set UTF and Unicode prop-
        !          5814:        erty modes; they are equivalent to setting the  PCRE_UTF8,  PCRE_UTF16,
        !          5815:        PCRE_UTF32  and the PCRE_UCP options, respectively. The (*UTF) sequence
        !          5816:        is a generic version that can be used with any of the  libraries.  How-
        !          5817:        ever,  the  application  can set the PCRE_NEVER_UTF option, which locks
        !          5818:        out the use of the (*UTF) sequences.
1.1       misho    5819: 
                   5820: 
                   5821: SUBPATTERNS
                   5822: 
                   5823:        Subpatterns are delimited by parentheses (round brackets), which can be
                   5824:        nested.  Turning part of a pattern into a subpattern does two things:
                   5825: 
                   5826:        1. It localizes a set of alternatives. For example, the pattern
                   5827: 
                   5828:          cat(aract|erpillar|)
                   5829: 
1.1.1.3   misho    5830:        matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
1.1       misho    5831:        it would match "cataract", "erpillar" or an empty string.
                   5832: 
1.1.1.3   misho    5833:        2. It sets up the subpattern as  a  capturing  subpattern.  This  means
                   5834:        that,  when  the  whole  pattern  matches,  that portion of the subject
1.1       misho    5835:        string that matched the subpattern is passed back to the caller via the
1.1.1.3   misho    5836:        ovector  argument  of  the matching function. (This applies only to the
                   5837:        traditional matching functions; the DFA matching functions do not  sup-
1.1.1.2   misho    5838:        port capturing.)
                   5839: 
                   5840:        Opening parentheses are counted from left to right (starting from 1) to
1.1.1.3   misho    5841:        obtain numbers for the  capturing  subpatterns.  For  example,  if  the
1.1.1.2   misho    5842:        string "the red king" is matched against the pattern
1.1       misho    5843: 
                   5844:          the ((red|white) (king|queen))
                   5845: 
                   5846:        the captured substrings are "red king", "red", and "king", and are num-
                   5847:        bered 1, 2, and 3, respectively.
                   5848: 
1.1.1.3   misho    5849:        The fact that plain parentheses fulfil  two  functions  is  not  always
                   5850:        helpful.   There are often times when a grouping subpattern is required
                   5851:        without a capturing requirement. If an opening parenthesis is  followed
                   5852:        by  a question mark and a colon, the subpattern does not do any captur-
                   5853:        ing, and is not counted when computing the  number  of  any  subsequent
                   5854:        capturing  subpatterns. For example, if the string "the white queen" is
1.1       misho    5855:        matched against the pattern
                   5856: 
                   5857:          the ((?:red|white) (king|queen))
                   5858: 
                   5859:        the captured substrings are "white queen" and "queen", and are numbered
                   5860:        1 and 2. The maximum number of capturing subpatterns is 65535.
                   5861: 
1.1.1.3   misho    5862:        As  a  convenient shorthand, if any option settings are required at the
                   5863:        start of a non-capturing subpattern,  the  option  letters  may  appear
1.1       misho    5864:        between the "?" and the ":". Thus the two patterns
                   5865: 
                   5866:          (?i:saturday|sunday)
                   5867:          (?:(?i)saturday|sunday)
                   5868: 
                   5869:        match exactly the same set of strings. Because alternative branches are
1.1.1.3   misho    5870:        tried from left to right, and options are not reset until  the  end  of
                   5871:        the  subpattern is reached, an option setting in one branch does affect
                   5872:        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
1.1       misho    5873:        "Saturday".
                   5874: 
                   5875: 
                   5876: DUPLICATE SUBPATTERN NUMBERS
                   5877: 
                   5878:        Perl 5.10 introduced a feature whereby each alternative in a subpattern
1.1.1.3   misho    5879:        uses the same numbers for its capturing parentheses. Such a  subpattern
                   5880:        starts  with (?| and is itself a non-capturing subpattern. For example,
1.1       misho    5881:        consider this pattern:
                   5882: 
                   5883:          (?|(Sat)ur|(Sun))day
                   5884: 
1.1.1.3   misho    5885:        Because the two alternatives are inside a (?| group, both sets of  cap-
                   5886:        turing  parentheses  are  numbered one. Thus, when the pattern matches,
                   5887:        you can look at captured substring number  one,  whichever  alternative
                   5888:        matched.  This  construct  is useful when you want to capture part, but
1.1       misho    5889:        not all, of one of a number of alternatives. Inside a (?| group, paren-
1.1.1.3   misho    5890:        theses  are  numbered as usual, but the number is reset at the start of
                   5891:        each branch. The numbers of any capturing parentheses that  follow  the
                   5892:        subpattern  start after the highest number used in any branch. The fol-
1.1       misho    5893:        lowing example is taken from the Perl documentation. The numbers under-
                   5894:        neath show in which buffer the captured content will be stored.
                   5895: 
                   5896:          # before  ---------------branch-reset----------- after
                   5897:          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
                   5898:          # 1            2         2  3        2     3     4
                   5899: 
1.1.1.3   misho    5900:        A  back  reference  to a numbered subpattern uses the most recent value
                   5901:        that is set for that number by any subpattern.  The  following  pattern
1.1       misho    5902:        matches "abcabc" or "defdef":
                   5903: 
                   5904:          /(?|(abc)|(def))\1/
                   5905: 
1.1.1.3   misho    5906:        In  contrast,  a subroutine call to a numbered subpattern always refers
                   5907:        to the first one in the pattern with the given  number.  The  following
1.1       misho    5908:        pattern matches "abcabc" or "defabc":
                   5909: 
                   5910:          /(?|(abc)|(def))(?1)/
                   5911: 
1.1.1.3   misho    5912:        If  a condition test for a subpattern's having matched refers to a non-
                   5913:        unique number, the test is true if any of the subpatterns of that  num-
1.1       misho    5914:        ber have matched.
                   5915: 
1.1.1.3   misho    5916:        An  alternative approach to using this "branch reset" feature is to use
1.1       misho    5917:        duplicate named subpatterns, as described in the next section.
                   5918: 
                   5919: 
                   5920: NAMED SUBPATTERNS
                   5921: 
1.1.1.3   misho    5922:        Identifying capturing parentheses by number is simple, but  it  can  be
                   5923:        very  hard  to keep track of the numbers in complicated regular expres-
                   5924:        sions. Furthermore, if an  expression  is  modified,  the  numbers  may
                   5925:        change.  To help with this difficulty, PCRE supports the naming of sub-
1.1       misho    5926:        patterns. This feature was not added to Perl until release 5.10. Python
1.1.1.3   misho    5927:        had  the  feature earlier, and PCRE introduced it at release 4.0, using
                   5928:        the Python syntax. PCRE now supports both the Perl and the Python  syn-
                   5929:        tax.  Perl  allows  identically  numbered subpatterns to have different
1.1       misho    5930:        names, but PCRE does not.
                   5931: 
1.1.1.3   misho    5932:        In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
                   5933:        or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
                   5934:        to capturing parentheses from other parts of the pattern, such as  back
                   5935:        references,  recursion,  and conditions, can be made by name as well as
1.1       misho    5936:        by number.
                   5937: 
1.1.1.3   misho    5938:        Names consist of up to  32  alphanumeric  characters  and  underscores.
                   5939:        Named  capturing  parentheses  are  still  allocated numbers as well as
                   5940:        names, exactly as if the names were not present. The PCRE API  provides
1.1       misho    5941:        function calls for extracting the name-to-number translation table from
                   5942:        a compiled pattern. There is also a convenience function for extracting
                   5943:        a captured substring by name.
                   5944: 
1.1.1.3   misho    5945:        By  default, a name must be unique within a pattern, but it is possible
1.1       misho    5946:        to relax this constraint by setting the PCRE_DUPNAMES option at compile
1.1.1.3   misho    5947:        time.  (Duplicate  names are also always permitted for subpatterns with
                   5948:        the same number, set up as described in the previous  section.)  Dupli-
                   5949:        cate  names  can  be useful for patterns where only one instance of the
                   5950:        named parentheses can match. Suppose you want to match the  name  of  a
                   5951:        weekday,  either as a 3-letter abbreviation or as the full name, and in
1.1       misho    5952:        both cases you want to extract the abbreviation. This pattern (ignoring
                   5953:        the line breaks) does the job:
                   5954: 
                   5955:          (?<DN>Mon|Fri|Sun)(?:day)?|
                   5956:          (?<DN>Tue)(?:sday)?|
                   5957:          (?<DN>Wed)(?:nesday)?|
                   5958:          (?<DN>Thu)(?:rsday)?|
                   5959:          (?<DN>Sat)(?:urday)?
                   5960: 
1.1.1.3   misho    5961:        There  are  five capturing substrings, but only one is ever set after a
1.1       misho    5962:        match.  (An alternative way of solving this problem is to use a "branch
                   5963:        reset" subpattern, as described in the previous section.)
                   5964: 
1.1.1.3   misho    5965:        The  convenience  function  for extracting the data by name returns the
                   5966:        substring for the first (and in this example, the only)  subpattern  of
                   5967:        that  name  that  matched.  This saves searching to find which numbered
1.1       misho    5968:        subpattern it was.
                   5969: 
1.1.1.3   misho    5970:        If you make a back reference to  a  non-unique  named  subpattern  from
                   5971:        elsewhere  in the pattern, the one that corresponds to the first occur-
1.1       misho    5972:        rence of the name is used. In the absence of duplicate numbers (see the
1.1.1.3   misho    5973:        previous  section) this is the one with the lowest number. If you use a
                   5974:        named reference in a condition test (see the section  about  conditions
                   5975:        below),  either  to check whether a subpattern has matched, or to check
                   5976:        for recursion, all subpatterns with the same name are  tested.  If  the
                   5977:        condition  is  true for any one of them, the overall condition is true.
1.1       misho    5978:        This is the same behaviour as testing by number. For further details of
                   5979:        the interfaces for handling named subpatterns, see the pcreapi documen-
                   5980:        tation.
                   5981: 
                   5982:        Warning: You cannot use different names to distinguish between two sub-
1.1.1.3   misho    5983:        patterns  with  the same number because PCRE uses only the numbers when
1.1       misho    5984:        matching. For this reason, an error is given at compile time if differ-
1.1.1.3   misho    5985:        ent  names  are given to subpatterns with the same number. However, you
                   5986:        can give the same name to subpatterns with the same number,  even  when
1.1       misho    5987:        PCRE_DUPNAMES is not set.
                   5988: 
                   5989: 
                   5990: REPETITION
                   5991: 
1.1.1.3   misho    5992:        Repetition  is  specified  by  quantifiers, which can follow any of the
1.1       misho    5993:        following items:
                   5994: 
                   5995:          a literal data character
                   5996:          the dot metacharacter
                   5997:          the \C escape sequence
1.1.1.2   misho    5998:          the \X escape sequence
1.1       misho    5999:          the \R escape sequence
                   6000:          an escape such as \d or \pL that matches a single character
                   6001:          a character class
                   6002:          a back reference (see next section)
                   6003:          a parenthesized subpattern (including assertions)
                   6004:          a subroutine call to a subpattern (recursive or otherwise)
                   6005: 
1.1.1.3   misho    6006:        The general repetition quantifier specifies a minimum and maximum  num-
                   6007:        ber  of  permitted matches, by giving the two numbers in curly brackets
                   6008:        (braces), separated by a comma. The numbers must be  less  than  65536,
1.1       misho    6009:        and the first must be less than or equal to the second. For example:
                   6010: 
                   6011:          z{2,4}
                   6012: 
1.1.1.3   misho    6013:        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
                   6014:        special character. If the second number is omitted, but  the  comma  is
                   6015:        present,  there  is  no upper limit; if the second number and the comma
                   6016:        are both omitted, the quantifier specifies an exact number of  required
1.1       misho    6017:        matches. Thus
                   6018: 
                   6019:          [aeiou]{3,}
                   6020: 
                   6021:        matches at least 3 successive vowels, but may match many more, while
                   6022: 
                   6023:          \d{8}
                   6024: 
1.1.1.3   misho    6025:        matches  exactly  8  digits. An opening curly bracket that appears in a
                   6026:        position where a quantifier is not allowed, or one that does not  match
                   6027:        the  syntax of a quantifier, is taken as a literal character. For exam-
1.1       misho    6028:        ple, {,6} is not a quantifier, but a literal string of four characters.
                   6029: 
1.1.1.2   misho    6030:        In UTF modes, quantifiers apply to characters rather than to individual
1.1.1.3   misho    6031:        data  units. Thus, for example, \x{100}{2} matches two characters, each
1.1.1.2   misho    6032:        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
1.1.1.4 ! misho    6033:        larly,  \X{3} matches three Unicode extended grapheme clusters, each of
        !          6034:        which may be several data units long (and  they  may  be  of  different
        !          6035:        lengths).
1.1       misho    6036: 
                   6037:        The quantifier {0} is permitted, causing the expression to behave as if
                   6038:        the previous item and the quantifier were not present. This may be use-
1.1.1.4 ! misho    6039:        ful  for  subpatterns that are referenced as subroutines from elsewhere
1.1       misho    6040:        in the pattern (but see also the section entitled "Defining subpatterns
1.1.1.4 ! misho    6041:        for  use  by  reference only" below). Items other than subpatterns that
1.1       misho    6042:        have a {0} quantifier are omitted from the compiled pattern.
                   6043: 
1.1.1.4 ! misho    6044:        For convenience, the three most common quantifiers have  single-charac-
1.1       misho    6045:        ter abbreviations:
                   6046: 
                   6047:          *    is equivalent to {0,}
                   6048:          +    is equivalent to {1,}
                   6049:          ?    is equivalent to {0,1}
                   6050: 
1.1.1.4 ! misho    6051:        It  is  possible  to construct infinite loops by following a subpattern
1.1       misho    6052:        that can match no characters with a quantifier that has no upper limit,
                   6053:        for example:
                   6054: 
                   6055:          (a?)*
                   6056: 
                   6057:        Earlier versions of Perl and PCRE used to give an error at compile time
1.1.1.4 ! misho    6058:        for such patterns. However, because there are cases where this  can  be
        !          6059:        useful,  such  patterns  are now accepted, but if any repetition of the
        !          6060:        subpattern does in fact match no characters, the loop is forcibly  bro-
1.1       misho    6061:        ken.
                   6062: 
1.1.1.4 ! misho    6063:        By  default,  the quantifiers are "greedy", that is, they match as much
        !          6064:        as possible (up to the maximum  number  of  permitted  times),  without
        !          6065:        causing  the  rest of the pattern to fail. The classic example of where
1.1       misho    6066:        this gives problems is in trying to match comments in C programs. These
1.1.1.4 ! misho    6067:        appear  between  /*  and  */ and within the comment, individual * and /
        !          6068:        characters may appear. An attempt to match C comments by  applying  the
1.1       misho    6069:        pattern
                   6070: 
                   6071:          /\*.*\*/
                   6072: 
                   6073:        to the string
                   6074: 
                   6075:          /* first comment */  not comment  /* second comment */
                   6076: 
1.1.1.4 ! misho    6077:        fails,  because it matches the entire string owing to the greediness of
1.1       misho    6078:        the .*  item.
                   6079: 
1.1.1.4 ! misho    6080:        However, if a quantifier is followed by a question mark, it  ceases  to
1.1       misho    6081:        be greedy, and instead matches the minimum number of times possible, so
                   6082:        the pattern
                   6083: 
                   6084:          /\*.*?\*/
                   6085: 
1.1.1.4 ! misho    6086:        does the right thing with the C comments. The meaning  of  the  various
        !          6087:        quantifiers  is  not  otherwise  changed,  just the preferred number of
        !          6088:        matches.  Do not confuse this use of question mark with its  use  as  a
        !          6089:        quantifier  in its own right. Because it has two uses, it can sometimes
1.1       misho    6090:        appear doubled, as in
                   6091: 
                   6092:          \d??\d
                   6093: 
                   6094:        which matches one digit by preference, but can match two if that is the
                   6095:        only way the rest of the pattern matches.
                   6096: 
1.1.1.4 ! misho    6097:        If  the PCRE_UNGREEDY option is set (an option that is not available in
        !          6098:        Perl), the quantifiers are not greedy by default, but  individual  ones
        !          6099:        can  be  made  greedy  by following them with a question mark. In other
1.1       misho    6100:        words, it inverts the default behaviour.
                   6101: 
1.1.1.4 ! misho    6102:        When a parenthesized subpattern is quantified  with  a  minimum  repeat
        !          6103:        count  that is greater than 1 or with a limited maximum, more memory is
        !          6104:        required for the compiled pattern, in proportion to  the  size  of  the
1.1       misho    6105:        minimum or maximum.
                   6106: 
                   6107:        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
1.1.1.4 ! misho    6108:        alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
        !          6109:        the  pattern  is  implicitly anchored, because whatever follows will be
        !          6110:        tried against every character position in the subject string, so  there
        !          6111:        is  no  point  in  retrying the overall match at any position after the
        !          6112:        first. PCRE normally treats such a pattern as though it  were  preceded
1.1       misho    6113:        by \A.
                   6114: 
1.1.1.4 ! misho    6115:        In  cases  where  it  is known that the subject string contains no new-
        !          6116:        lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
1.1       misho    6117:        mization, or alternatively using ^ to indicate anchoring explicitly.
                   6118: 
1.1.1.4 ! misho    6119:        However,  there  are  some cases where the optimization cannot be used.
1.1       misho    6120:        When .*  is inside capturing parentheses that are the subject of a back
                   6121:        reference elsewhere in the pattern, a match at the start may fail where
                   6122:        a later one succeeds. Consider, for example:
                   6123: 
                   6124:          (.*)abc\1
                   6125: 
1.1.1.4 ! misho    6126:        If the subject is "xyz123abc123" the match point is the fourth  charac-
1.1       misho    6127:        ter. For this reason, such a pattern is not implicitly anchored.
                   6128: 
1.1.1.4 ! misho    6129:        Another  case where implicit anchoring is not applied is when the lead-
        !          6130:        ing .* is inside an atomic group. Once again, a match at the start  may
        !          6131:        fail where a later one succeeds. Consider this pattern:
        !          6132: 
        !          6133:          (?>.*?a)b
        !          6134: 
        !          6135:        It  matches "ab" in the subject "aab". The use of the backtracking con-
        !          6136:        trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
        !          6137: 
1.1       misho    6138:        When a capturing subpattern is repeated, the value captured is the sub-
                   6139:        string that matched the final iteration. For example, after
                   6140: 
                   6141:          (tweedle[dume]{3}\s*)+
                   6142: 
                   6143:        has matched "tweedledum tweedledee" the value of the captured substring
1.1.1.3   misho    6144:        is "tweedledee". However, if there are  nested  capturing  subpatterns,
                   6145:        the  corresponding captured values may have been set in previous itera-
1.1       misho    6146:        tions. For example, after
                   6147: 
                   6148:          /(a|(b))+/
                   6149: 
                   6150:        matches "aba" the value of the second captured substring is "b".
                   6151: 
                   6152: 
                   6153: ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
                   6154: 
1.1.1.3   misho    6155:        With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
                   6156:        repetition,  failure  of what follows normally causes the repeated item
                   6157:        to be re-evaluated to see if a different number of repeats  allows  the
                   6158:        rest  of  the pattern to match. Sometimes it is useful to prevent this,
                   6159:        either to change the nature of the match, or to cause it  fail  earlier
                   6160:        than  it otherwise might, when the author of the pattern knows there is
1.1       misho    6161:        no point in carrying on.
                   6162: 
1.1.1.3   misho    6163:        Consider, for example, the pattern \d+foo when applied to  the  subject
1.1       misho    6164:        line
                   6165: 
                   6166:          123456bar
                   6167: 
                   6168:        After matching all 6 digits and then failing to match "foo", the normal
1.1.1.3   misho    6169:        action of the matcher is to try again with only 5 digits  matching  the
                   6170:        \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
                   6171:        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
                   6172:        the  means for specifying that once a subpattern has matched, it is not
1.1       misho    6173:        to be re-evaluated in this way.
                   6174: 
1.1.1.3   misho    6175:        If we use atomic grouping for the previous example, the  matcher  gives
                   6176:        up  immediately  on failing to match "foo" the first time. The notation
1.1       misho    6177:        is a kind of special parenthesis, starting with (?> as in this example:
                   6178: 
                   6179:          (?>\d+)foo
                   6180: 
1.1.1.3   misho    6181:        This kind of parenthesis "locks up" the  part of the  pattern  it  con-
                   6182:        tains  once  it  has matched, and a failure further into the pattern is
                   6183:        prevented from backtracking into it. Backtracking past it  to  previous
1.1       misho    6184:        items, however, works as normal.
                   6185: 
1.1.1.3   misho    6186:        An  alternative  description  is that a subpattern of this type matches
                   6187:        the string of characters that an  identical  standalone  pattern  would
1.1       misho    6188:        match, if anchored at the current point in the subject string.
                   6189: 
                   6190:        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
                   6191:        such as the above example can be thought of as a maximizing repeat that
1.1.1.3   misho    6192:        must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
                   6193:        pared to adjust the number of digits they match in order  to  make  the
1.1       misho    6194:        rest of the pattern match, (?>\d+) can only match an entire sequence of
                   6195:        digits.
                   6196: 
1.1.1.3   misho    6197:        Atomic groups in general can of course contain arbitrarily  complicated
                   6198:        subpatterns,  and  can  be  nested. However, when the subpattern for an
1.1       misho    6199:        atomic group is just a single repeated item, as in the example above, a
1.1.1.3   misho    6200:        simpler  notation,  called  a "possessive quantifier" can be used. This
                   6201:        consists of an additional + character  following  a  quantifier.  Using
1.1       misho    6202:        this notation, the previous example can be rewritten as
                   6203: 
                   6204:          \d++foo
                   6205: 
                   6206:        Note that a possessive quantifier can be used with an entire group, for
                   6207:        example:
                   6208: 
                   6209:          (abc|xyz){2,3}+
                   6210: 
1.1.1.3   misho    6211:        Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1.1       misho    6212:        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1.1.1.3   misho    6213:        simpler forms of atomic group. However, there is no difference  in  the
                   6214:        meaning  of  a  possessive  quantifier and the equivalent atomic group,
                   6215:        though there may be a performance  difference;  possessive  quantifiers
1.1       misho    6216:        should be slightly faster.
                   6217: 
1.1.1.3   misho    6218:        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
                   6219:        tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1.1       misho    6220:        edition of his book. Mike McCloskey liked it, so implemented it when he
1.1.1.3   misho    6221:        built Sun's Java package, and PCRE copied it from there. It  ultimately
1.1       misho    6222:        found its way into Perl at release 5.10.
                   6223: 
                   6224:        PCRE has an optimization that automatically "possessifies" certain sim-
1.1.1.3   misho    6225:        ple pattern constructs. For example, the sequence  A+B  is  treated  as
                   6226:        A++B  because  there is no point in backtracking into a sequence of A's
1.1       misho    6227:        when B must follow.
                   6228: 
1.1.1.3   misho    6229:        When a pattern contains an unlimited repeat inside  a  subpattern  that
                   6230:        can  itself  be  repeated  an  unlimited number of times, the use of an
                   6231:        atomic group is the only way to avoid some  failing  matches  taking  a
1.1       misho    6232:        very long time indeed. The pattern
                   6233: 
                   6234:          (\D+|<\d+>)*[!?]
                   6235: 
1.1.1.3   misho    6236:        matches  an  unlimited number of substrings that either consist of non-
                   6237:        digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1.1       misho    6238:        matches, it runs quickly. However, if it is applied to
                   6239: 
                   6240:          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
                   6241: 
1.1.1.3   misho    6242:        it  takes  a  long  time  before reporting failure. This is because the
                   6243:        string can be divided between the internal \D+ repeat and the  external
                   6244:        *  repeat  in  a  large  number of ways, and all have to be tried. (The
                   6245:        example uses [!?] rather than a single character at  the  end,  because
                   6246:        both  PCRE  and  Perl have an optimization that allows for fast failure
                   6247:        when a single character is used. They remember the last single  charac-
                   6248:        ter  that  is required for a match, and fail early if it is not present
                   6249:        in the string.) If the pattern is changed so that  it  uses  an  atomic
1.1       misho    6250:        group, like this:
                   6251: 
                   6252:          ((?>\D+)|<\d+>)*[!?]
                   6253: 
                   6254:        sequences of non-digits cannot be broken, and failure happens quickly.
                   6255: 
                   6256: 
                   6257: BACK REFERENCES
                   6258: 
                   6259:        Outside a character class, a backslash followed by a digit greater than
                   6260:        0 (and possibly further digits) is a back reference to a capturing sub-
1.1.1.3   misho    6261:        pattern  earlier  (that is, to its left) in the pattern, provided there
1.1       misho    6262:        have been that many previous capturing left parentheses.
                   6263: 
                   6264:        However, if the decimal number following the backslash is less than 10,
1.1.1.3   misho    6265:        it  is  always  taken  as a back reference, and causes an error only if
                   6266:        there are not that many capturing left parentheses in the  entire  pat-
                   6267:        tern.  In  other words, the parentheses that are referenced need not be
                   6268:        to the left of the reference for numbers less than 10. A "forward  back
                   6269:        reference"  of  this  type can make sense when a repetition is involved
                   6270:        and the subpattern to the right has participated in an  earlier  itera-
1.1       misho    6271:        tion.
                   6272: 
1.1.1.3   misho    6273:        It  is  not  possible to have a numerical "forward back reference" to a
                   6274:        subpattern whose number is 10 or  more  using  this  syntax  because  a
                   6275:        sequence  such  as  \50 is interpreted as a character defined in octal.
1.1       misho    6276:        See the subsection entitled "Non-printing characters" above for further
1.1.1.3   misho    6277:        details  of  the  handling of digits following a backslash. There is no
                   6278:        such problem when named parentheses are used. A back reference  to  any
1.1       misho    6279:        subpattern is possible using named parentheses (see below).
                   6280: 
1.1.1.3   misho    6281:        Another  way  of  avoiding  the ambiguity inherent in the use of digits
                   6282:        following a backslash is to use the \g  escape  sequence.  This  escape
1.1       misho    6283:        must be followed by an unsigned number or a negative number, optionally
                   6284:        enclosed in braces. These examples are all identical:
                   6285: 
                   6286:          (ring), \1
                   6287:          (ring), \g1
                   6288:          (ring), \g{1}
                   6289: 
1.1.1.3   misho    6290:        An unsigned number specifies an absolute reference without the  ambigu-
1.1       misho    6291:        ity that is present in the older syntax. It is also useful when literal
                   6292:        digits follow the reference. A negative number is a relative reference.
                   6293:        Consider this example:
                   6294: 
                   6295:          (abc(def)ghi)\g{-1}
                   6296: 
                   6297:        The sequence \g{-1} is a reference to the most recently started captur-
                   6298:        ing subpattern before \g, that is, is it equivalent to \2 in this exam-
1.1.1.3   misho    6299:        ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
                   6300:        references can be helpful in long patterns, and also in  patterns  that
                   6301:        are  created  by  joining  together  fragments  that contain references
1.1       misho    6302:        within themselves.
                   6303: 
1.1.1.3   misho    6304:        A back reference matches whatever actually matched the  capturing  sub-
                   6305:        pattern  in  the  current subject string, rather than anything matching
1.1       misho    6306:        the subpattern itself (see "Subpatterns as subroutines" below for a way
                   6307:        of doing that). So the pattern
                   6308: 
                   6309:          (sens|respons)e and \1ibility
                   6310: 
1.1.1.3   misho    6311:        matches  "sense and sensibility" and "response and responsibility", but
                   6312:        not "sense and responsibility". If caseful matching is in force at  the
                   6313:        time  of the back reference, the case of letters is relevant. For exam-
1.1       misho    6314:        ple,
                   6315: 
                   6316:          ((?i)rah)\s+\1
                   6317: 
1.1.1.3   misho    6318:        matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1.1       misho    6319:        original capturing subpattern is matched caselessly.
                   6320: 
1.1.1.3   misho    6321:        There  are  several  different ways of writing back references to named
                   6322:        subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
                   6323:        \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1.1       misho    6324:        unified back reference syntax, in which \g can be used for both numeric
1.1.1.3   misho    6325:        and  named  references,  is  also supported. We could rewrite the above
1.1       misho    6326:        example in any of the following ways:
                   6327: 
                   6328:          (?<p1>(?i)rah)\s+\k<p1>
                   6329:          (?'p1'(?i)rah)\s+\k{p1}
                   6330:          (?P<p1>(?i)rah)\s+(?P=p1)
                   6331:          (?<p1>(?i)rah)\s+\g{p1}
                   6332: 
1.1.1.3   misho    6333:        A subpattern that is referenced by  name  may  appear  in  the  pattern
1.1       misho    6334:        before or after the reference.
                   6335: 
1.1.1.3   misho    6336:        There  may be more than one back reference to the same subpattern. If a
                   6337:        subpattern has not actually been used in a particular match,  any  back
1.1       misho    6338:        references to it always fail by default. For example, the pattern
                   6339: 
                   6340:          (a|(bc))\2
                   6341: 
1.1.1.3   misho    6342:        always  fails  if  it starts to match "a" rather than "bc". However, if
1.1       misho    6343:        the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
                   6344:        ence to an unset value matches an empty string.
                   6345: 
1.1.1.3   misho    6346:        Because  there may be many capturing parentheses in a pattern, all dig-
                   6347:        its following a backslash are taken as part of a potential back  refer-
                   6348:        ence  number.   If  the  pattern continues with a digit character, some
                   6349:        delimiter must  be  used  to  terminate  the  back  reference.  If  the
                   6350:        PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the
                   6351:        \g{ syntax or an empty comment (see "Comments" below) can be used.
1.1       misho    6352: 
                   6353:    Recursive back references
                   6354: 
1.1.1.3   misho    6355:        A back reference that occurs inside the parentheses to which it  refers
                   6356:        fails  when  the subpattern is first used, so, for example, (a\1) never
                   6357:        matches.  However, such references can be useful inside  repeated  sub-
1.1       misho    6358:        patterns. For example, the pattern
                   6359: 
                   6360:          (a|b\1)+
                   6361: 
                   6362:        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1.1.1.3   misho    6363:        ation of the subpattern,  the  back  reference  matches  the  character
                   6364:        string  corresponding  to  the previous iteration. In order for this to
                   6365:        work, the pattern must be such that the first iteration does  not  need
                   6366:        to  match the back reference. This can be done using alternation, as in
1.1       misho    6367:        the example above, or by a quantifier with a minimum of zero.
                   6368: 
1.1.1.3   misho    6369:        Back references of this type cause the group that they reference to  be
                   6370:        treated  as  an atomic group.  Once the whole group has been matched, a
                   6371:        subsequent matching failure cannot cause backtracking into  the  middle
1.1       misho    6372:        of the group.
                   6373: 
                   6374: 
                   6375: ASSERTIONS
                   6376: 
1.1.1.3   misho    6377:        An  assertion  is  a  test on the characters following or preceding the
                   6378:        current matching point that does not actually consume  any  characters.
                   6379:        The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1.1       misho    6380:        described above.
                   6381: 
1.1.1.3   misho    6382:        More complicated assertions are coded as  subpatterns.  There  are  two
                   6383:        kinds:  those  that  look  ahead of the current position in the subject
                   6384:        string, and those that look  behind  it.  An  assertion  subpattern  is
                   6385:        matched  in  the  normal way, except that it does not cause the current
1.1       misho    6386:        matching position to be changed.
                   6387: 
1.1.1.3   misho    6388:        Assertion subpatterns are not capturing subpatterns. If such an  asser-
                   6389:        tion  contains  capturing  subpatterns within it, these are counted for
                   6390:        the purposes of numbering the capturing subpatterns in the  whole  pat-
                   6391:        tern.  However,  substring  capturing  is carried out only for positive
1.1.1.4 ! misho    6392:        assertions. (Perl sometimes, but not always, does do capturing in nega-
        !          6393:        tive assertions.)
1.1       misho    6394: 
1.1.1.4 ! misho    6395:        For  compatibility  with  Perl,  assertion subpatterns may be repeated;
        !          6396:        though it makes no sense to assert the same thing  several  times,  the
        !          6397:        side  effect  of  capturing  parentheses may occasionally be useful. In
1.1       misho    6398:        practice, there only three cases:
                   6399: 
1.1.1.4 ! misho    6400:        (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during
        !          6401:        matching.   However,  it  may  contain internal capturing parenthesized
1.1       misho    6402:        groups that are called from elsewhere via the subroutine mechanism.
                   6403: 
1.1.1.4 ! misho    6404:        (2) If quantifier is {0,n} where n is greater than zero, it is  treated
        !          6405:        as  if  it  were  {0,1}.  At run time, the rest of the pattern match is
1.1       misho    6406:        tried with and without the assertion, the order depending on the greed-
                   6407:        iness of the quantifier.
                   6408: 
1.1.1.4 ! misho    6409:        (3)  If  the minimum repetition is greater than zero, the quantifier is
        !          6410:        ignored.  The assertion is obeyed just  once  when  encountered  during
1.1       misho    6411:        matching.
                   6412: 
                   6413:    Lookahead assertions
                   6414: 
                   6415:        Lookahead assertions start with (?= for positive assertions and (?! for
                   6416:        negative assertions. For example,
                   6417: 
                   6418:          \w+(?=;)
                   6419: 
1.1.1.4 ! misho    6420:        matches a word followed by a semicolon, but does not include the  semi-
1.1       misho    6421:        colon in the match, and
                   6422: 
                   6423:          foo(?!bar)
                   6424: 
1.1.1.4 ! misho    6425:        matches  any  occurrence  of  "foo" that is not followed by "bar". Note
1.1       misho    6426:        that the apparently similar pattern
                   6427: 
                   6428:          (?!foo)bar
                   6429: 
1.1.1.4 ! misho    6430:        does not find an occurrence of "bar"  that  is  preceded  by  something
        !          6431:        other  than "foo"; it finds any occurrence of "bar" whatsoever, because
1.1       misho    6432:        the assertion (?!foo) is always true when the next three characters are
                   6433:        "bar". A lookbehind assertion is needed to achieve the other effect.
                   6434: 
                   6435:        If you want to force a matching failure at some point in a pattern, the
1.1.1.4 ! misho    6436:        most convenient way to do it is  with  (?!)  because  an  empty  string
        !          6437:        always  matches, so an assertion that requires there not to be an empty
1.1       misho    6438:        string must always fail.  The backtracking control verb (*FAIL) or (*F)
                   6439:        is a synonym for (?!).
                   6440: 
                   6441:    Lookbehind assertions
                   6442: 
1.1.1.4 ! misho    6443:        Lookbehind  assertions start with (?<= for positive assertions and (?<!
1.1       misho    6444:        for negative assertions. For example,
                   6445: 
                   6446:          (?<!foo)bar
                   6447: 
1.1.1.4 ! misho    6448:        does find an occurrence of "bar" that is not  preceded  by  "foo".  The
        !          6449:        contents  of  a  lookbehind  assertion are restricted such that all the
1.1       misho    6450:        strings it matches must have a fixed length. However, if there are sev-
1.1.1.4 ! misho    6451:        eral  top-level  alternatives,  they  do  not all have to have the same
1.1       misho    6452:        fixed length. Thus
                   6453: 
                   6454:          (?<=bullock|donkey)
                   6455: 
                   6456:        is permitted, but
                   6457: 
                   6458:          (?<!dogs?|cats?)
                   6459: 
1.1.1.4 ! misho    6460:        causes an error at compile time. Branches that match  different  length
        !          6461:        strings  are permitted only at the top level of a lookbehind assertion.
1.1       misho    6462:        This is an extension compared with Perl, which requires all branches to
                   6463:        match the same length of string. An assertion such as
                   6464: 
                   6465:          (?<=ab(c|de))
                   6466: 
1.1.1.4 ! misho    6467:        is  not  permitted,  because  its single top-level branch can match two
1.1       misho    6468:        different lengths, but it is acceptable to PCRE if rewritten to use two
                   6469:        top-level branches:
                   6470: 
                   6471:          (?<=abc|abde)
                   6472: 
1.1.1.4 ! misho    6473:        In  some  cases, the escape sequence \K (see above) can be used instead
1.1       misho    6474:        of a lookbehind assertion to get round the fixed-length restriction.
                   6475: 
1.1.1.4 ! misho    6476:        The implementation of lookbehind assertions is, for  each  alternative,
        !          6477:        to  temporarily  move the current position back by the fixed length and
1.1       misho    6478:        then try to match. If there are insufficient characters before the cur-
                   6479:        rent position, the assertion fails.
                   6480: 
1.1.1.4 ! misho    6481:        In  a UTF mode, PCRE does not allow the \C escape (which matches a sin-
        !          6482:        gle data unit even in a UTF mode) to appear in  lookbehind  assertions,
        !          6483:        because  it  makes it impossible to calculate the length of the lookbe-
        !          6484:        hind. The \X and \R escapes, which can match different numbers of  data
1.1.1.2   misho    6485:        units, are also not permitted.
1.1       misho    6486: 
1.1.1.4 ! misho    6487:        "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
        !          6488:        lookbehinds, as long as the subpattern matches a  fixed-length  string.
1.1       misho    6489:        Recursion, however, is not supported.
                   6490: 
1.1.1.4 ! misho    6491:        Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
1.1       misho    6492:        assertions to specify efficient matching of fixed-length strings at the
                   6493:        end of subject strings. Consider a simple pattern such as
                   6494: 
                   6495:          abcd$
                   6496: 
1.1.1.4 ! misho    6497:        when  applied  to  a  long string that does not match. Because matching
1.1       misho    6498:        proceeds from left to right, PCRE will look for each "a" in the subject
1.1.1.4 ! misho    6499:        and  then  see  if what follows matches the rest of the pattern. If the
1.1       misho    6500:        pattern is specified as
                   6501: 
                   6502:          ^.*abcd$
                   6503: 
1.1.1.4 ! misho    6504:        the initial .* matches the entire string at first, but when this  fails
1.1       misho    6505:        (because there is no following "a"), it backtracks to match all but the
1.1.1.4 ! misho    6506:        last character, then all but the last two characters, and so  on.  Once
        !          6507:        again  the search for "a" covers the entire string, from right to left,
1.1       misho    6508:        so we are no better off. However, if the pattern is written as
                   6509: 
                   6510:          ^.*+(?<=abcd)
                   6511: 
1.1.1.4 ! misho    6512:        there can be no backtracking for the .*+ item; it can  match  only  the
        !          6513:        entire  string.  The subsequent lookbehind assertion does a single test
        !          6514:        on the last four characters. If it fails, the match fails  immediately.
        !          6515:        For  long  strings, this approach makes a significant difference to the
1.1       misho    6516:        processing time.
                   6517: 
                   6518:    Using multiple assertions
                   6519: 
                   6520:        Several assertions (of any sort) may occur in succession. For example,
                   6521: 
                   6522:          (?<=\d{3})(?<!999)foo
                   6523: 
1.1.1.4 ! misho    6524:        matches "foo" preceded by three digits that are not "999". Notice  that
        !          6525:        each  of  the  assertions is applied independently at the same point in
        !          6526:        the subject string. First there is a  check  that  the  previous  three
        !          6527:        characters  are  all  digits,  and  then there is a check that the same
1.1       misho    6528:        three characters are not "999".  This pattern does not match "foo" pre-
1.1.1.4 ! misho    6529:        ceded  by  six  characters,  the first of which are digits and the last
        !          6530:        three of which are not "999". For example, it  doesn't  match  "123abc-
1.1       misho    6531:        foo". A pattern to do that is
                   6532: 
                   6533:          (?<=\d{3}...)(?<!999)foo
                   6534: 
1.1.1.4 ! misho    6535:        This  time  the  first assertion looks at the preceding six characters,
1.1       misho    6536:        checking that the first three are digits, and then the second assertion
                   6537:        checks that the preceding three characters are not "999".
                   6538: 
                   6539:        Assertions can be nested in any combination. For example,
                   6540: 
                   6541:          (?<=(?<!foo)bar)baz
                   6542: 
1.1.1.4 ! misho    6543:        matches  an occurrence of "baz" that is preceded by "bar" which in turn
1.1       misho    6544:        is not preceded by "foo", while
                   6545: 
                   6546:          (?<=\d{3}(?!999)...)foo
                   6547: 
1.1.1.4 ! misho    6548:        is another pattern that matches "foo" preceded by three digits and  any
1.1       misho    6549:        three characters that are not "999".
                   6550: 
                   6551: 
                   6552: CONDITIONAL SUBPATTERNS
                   6553: 
1.1.1.4 ! misho    6554:        It  is possible to cause the matching process to obey a subpattern con-
        !          6555:        ditionally or to choose between two alternative subpatterns,  depending
        !          6556:        on  the result of an assertion, or whether a specific capturing subpat-
        !          6557:        tern has already been matched. The two possible  forms  of  conditional
1.1       misho    6558:        subpattern are:
                   6559: 
                   6560:          (?(condition)yes-pattern)
                   6561:          (?(condition)yes-pattern|no-pattern)
                   6562: 
1.1.1.4 ! misho    6563:        If  the  condition is satisfied, the yes-pattern is used; otherwise the
        !          6564:        no-pattern (if present) is used. If there are more  than  two  alterna-
        !          6565:        tives  in  the subpattern, a compile-time error occurs. Each of the two
1.1       misho    6566:        alternatives may itself contain nested subpatterns of any form, includ-
                   6567:        ing  conditional  subpatterns;  the  restriction  to  two  alternatives
                   6568:        applies only at the level of the condition. This pattern fragment is an
                   6569:        example where the alternatives are complex:
                   6570: 
                   6571:          (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
                   6572: 
                   6573: 
1.1.1.4 ! misho    6574:        There  are  four  kinds of condition: references to subpatterns, refer-
1.1       misho    6575:        ences to recursion, a pseudo-condition called DEFINE, and assertions.
                   6576: 
                   6577:    Checking for a used subpattern by number
                   6578: 
1.1.1.4 ! misho    6579:        If the text between the parentheses consists of a sequence  of  digits,
1.1       misho    6580:        the condition is true if a capturing subpattern of that number has pre-
1.1.1.4 ! misho    6581:        viously matched. If there is more than one  capturing  subpattern  with
        !          6582:        the  same  number  (see  the earlier section about duplicate subpattern
        !          6583:        numbers), the condition is true if any of them have matched. An  alter-
        !          6584:        native  notation is to precede the digits with a plus or minus sign. In
        !          6585:        this case, the subpattern number is relative rather than absolute.  The
        !          6586:        most  recently opened parentheses can be referenced by (?(-1), the next
        !          6587:        most recent by (?(-2), and so on. Inside loops it can also  make  sense
1.1       misho    6588:        to refer to subsequent groups. The next parentheses to be opened can be
1.1.1.4 ! misho    6589:        referenced as (?(+1), and so on. (The value zero in any of these  forms
1.1       misho    6590:        is not used; it provokes a compile-time error.)
                   6591: 
1.1.1.4 ! misho    6592:        Consider  the  following  pattern, which contains non-significant white
1.1       misho    6593:        space to make it more readable (assume the PCRE_EXTENDED option) and to
                   6594:        divide it into three parts for ease of discussion:
                   6595: 
                   6596:          ( \( )?    [^()]+    (?(1) \) )
                   6597: 
1.1.1.4 ! misho    6598:        The  first  part  matches  an optional opening parenthesis, and if that
1.1       misho    6599:        character is present, sets it as the first captured substring. The sec-
1.1.1.4 ! misho    6600:        ond  part  matches one or more characters that are not parentheses. The
        !          6601:        third part is a conditional subpattern that tests whether  or  not  the
        !          6602:        first  set  of  parentheses  matched.  If they did, that is, if subject
        !          6603:        started with an opening parenthesis, the condition is true, and so  the
        !          6604:        yes-pattern  is  executed and a closing parenthesis is required. Other-
        !          6605:        wise, since no-pattern is not present, the subpattern matches  nothing.
        !          6606:        In  other  words,  this  pattern matches a sequence of non-parentheses,
1.1       misho    6607:        optionally enclosed in parentheses.
                   6608: 
1.1.1.4 ! misho    6609:        If you were embedding this pattern in a larger one,  you  could  use  a
1.1       misho    6610:        relative reference:
                   6611: 
                   6612:          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
                   6613: 
1.1.1.4 ! misho    6614:        This  makes  the  fragment independent of the parentheses in the larger
1.1       misho    6615:        pattern.
                   6616: 
                   6617:    Checking for a used subpattern by name
                   6618: 
1.1.1.4 ! misho    6619:        Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
        !          6620:        used  subpattern  by  name.  For compatibility with earlier versions of
        !          6621:        PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
        !          6622:        also  recognized. However, there is a possible ambiguity with this syn-
        !          6623:        tax, because subpattern names may  consist  entirely  of  digits.  PCRE
        !          6624:        looks  first for a named subpattern; if it cannot find one and the name
        !          6625:        consists entirely of digits, PCRE looks for a subpattern of  that  num-
        !          6626:        ber,  which must be greater than zero. Using subpattern names that con-
1.1       misho    6627:        sist entirely of digits is not recommended.
                   6628: 
                   6629:        Rewriting the above example to use a named subpattern gives this:
                   6630: 
                   6631:          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
                   6632: 
1.1.1.4 ! misho    6633:        If the name used in a condition of this kind is a duplicate,  the  test
        !          6634:        is  applied to all subpatterns of the same name, and is true if any one
1.1       misho    6635:        of them has matched.
                   6636: 
                   6637:    Checking for pattern recursion
                   6638: 
                   6639:        If the condition is the string (R), and there is no subpattern with the
1.1.1.4 ! misho    6640:        name  R, the condition is true if a recursive call to the whole pattern
1.1       misho    6641:        or any subpattern has been made. If digits or a name preceded by amper-
                   6642:        sand follow the letter R, for example:
                   6643: 
                   6644:          (?(R3)...) or (?(R&name)...)
                   6645: 
                   6646:        the condition is true if the most recent recursion is into a subpattern
                   6647:        whose number or name is given. This condition does not check the entire
1.1.1.4 ! misho    6648:        recursion  stack.  If  the  name  used in a condition of this kind is a
1.1       misho    6649:        duplicate, the test is applied to all subpatterns of the same name, and
                   6650:        is true if any one of them is the most recent recursion.
                   6651: 
1.1.1.4 ! misho    6652:        At  "top  level",  all  these recursion test conditions are false.  The
1.1       misho    6653:        syntax for recursive patterns is described below.
                   6654: 
                   6655:    Defining subpatterns for use by reference only
                   6656: 
1.1.1.4 ! misho    6657:        If the condition is the string (DEFINE), and  there  is  no  subpattern
        !          6658:        with  the  name  DEFINE,  the  condition is always false. In this case,
        !          6659:        there may be only one alternative  in  the  subpattern.  It  is  always
        !          6660:        skipped  if  control  reaches  this  point  in the pattern; the idea of
        !          6661:        DEFINE is that it can be used to define subroutines that can be  refer-
        !          6662:        enced  from elsewhere. (The use of subroutines is described below.) For
        !          6663:        example, a pattern to match an IPv4 address  such  as  "192.168.23.245"
1.1.1.3   misho    6664:        could be written like this (ignore white space and line breaks):
1.1       misho    6665: 
                   6666:          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
                   6667:          \b (?&byte) (\.(?&byte)){3} \b
                   6668: 
1.1.1.4 ! misho    6669:        The  first part of the pattern is a DEFINE group inside which a another
        !          6670:        group named "byte" is defined. This matches an individual component  of
        !          6671:        an  IPv4  address  (a number less than 256). When matching takes place,
        !          6672:        this part of the pattern is skipped because DEFINE acts  like  a  false
        !          6673:        condition.  The  rest of the pattern uses references to the named group
        !          6674:        to match the four dot-separated components of an IPv4 address,  insist-
1.1       misho    6675:        ing on a word boundary at each end.
                   6676: 
                   6677:    Assertion conditions
                   6678: 
1.1.1.4 ! misho    6679:        If  the  condition  is  not  in any of the above formats, it must be an
        !          6680:        assertion.  This may be a positive or negative lookahead or  lookbehind
        !          6681:        assertion.  Consider  this  pattern,  again  containing non-significant
1.1       misho    6682:        white space, and with the two alternatives on the second line:
                   6683: 
                   6684:          (?(?=[^a-z]*[a-z])
                   6685:          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
                   6686: 
1.1.1.4 ! misho    6687:        The condition  is  a  positive  lookahead  assertion  that  matches  an
        !          6688:        optional  sequence of non-letters followed by a letter. In other words,
        !          6689:        it tests for the presence of at least one letter in the subject.  If  a
        !          6690:        letter  is found, the subject is matched against the first alternative;
        !          6691:        otherwise it is  matched  against  the  second.  This  pattern  matches
        !          6692:        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1.1       misho    6693:        letters and dd are digits.
                   6694: 
                   6695: 
                   6696: COMMENTS
                   6697: 
                   6698:        There are two ways of including comments in patterns that are processed
                   6699:        by PCRE. In both cases, the start of the comment must not be in a char-
                   6700:        acter class, nor in the middle of any other sequence of related charac-
1.1.1.4 ! misho    6701:        ters  such  as  (?: or a subpattern name or number. The characters that
1.1       misho    6702:        make up a comment play no part in the pattern matching.
                   6703: 
1.1.1.4 ! misho    6704:        The sequence (?# marks the start of a comment that continues up to  the
        !          6705:        next  closing parenthesis. Nested parentheses are not permitted. If the
1.1       misho    6706:        PCRE_EXTENDED option is set, an unescaped # character also introduces a
1.1.1.4 ! misho    6707:        comment,  which  in  this  case continues to immediately after the next
        !          6708:        newline character or character sequence in the pattern.  Which  charac-
1.1       misho    6709:        ters are interpreted as newlines is controlled by the options passed to
1.1.1.4 ! misho    6710:        a compiling function or by a special sequence at the start of the  pat-
1.1.1.2   misho    6711:        tern, as described in the section entitled "Newline conventions" above.
                   6712:        Note that the end of this type of comment is a literal newline sequence
1.1.1.4 ! misho    6713:        in  the pattern; escape sequences that happen to represent a newline do
        !          6714:        not count. For example, consider this  pattern  when  PCRE_EXTENDED  is
1.1.1.2   misho    6715:        set, and the default newline convention is in force:
1.1       misho    6716: 
                   6717:          abc #comment \n still comment
                   6718: 
1.1.1.4 ! misho    6719:        On  encountering  the  # character, pcre_compile() skips along, looking
        !          6720:        for a newline in the pattern. The sequence \n is still literal at  this
        !          6721:        stage,  so  it does not terminate the comment. Only an actual character
1.1       misho    6722:        with the code value 0x0a (the default newline) does so.
                   6723: 
                   6724: 
                   6725: RECURSIVE PATTERNS
                   6726: 
1.1.1.4 ! misho    6727:        Consider the problem of matching a string in parentheses, allowing  for
        !          6728:        unlimited  nested  parentheses.  Without the use of recursion, the best
        !          6729:        that can be done is to use a pattern that  matches  up  to  some  fixed
        !          6730:        depth  of  nesting.  It  is not possible to handle an arbitrary nesting
1.1       misho    6731:        depth.
                   6732: 
                   6733:        For some time, Perl has provided a facility that allows regular expres-
1.1.1.4 ! misho    6734:        sions  to recurse (amongst other things). It does this by interpolating
        !          6735:        Perl code in the expression at run time, and the code can refer to  the
1.1       misho    6736:        expression itself. A Perl pattern using code interpolation to solve the
                   6737:        parentheses problem can be created like this:
                   6738: 
                   6739:          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
                   6740: 
                   6741:        The (?p{...}) item interpolates Perl code at run time, and in this case
                   6742:        refers recursively to the pattern in which it appears.
                   6743: 
                   6744:        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1.1.1.4 ! misho    6745:        it supports special syntax for recursion of  the  entire  pattern,  and
        !          6746:        also  for  individual  subpattern  recursion. After its introduction in
        !          6747:        PCRE and Python, this kind of  recursion  was  subsequently  introduced
1.1       misho    6748:        into Perl at release 5.10.
                   6749: 
1.1.1.4 ! misho    6750:        A  special  item  that consists of (? followed by a number greater than
        !          6751:        zero and a closing parenthesis is a recursive subroutine  call  of  the
        !          6752:        subpattern  of  the  given  number, provided that it occurs inside that
        !          6753:        subpattern. (If not, it is a non-recursive subroutine  call,  which  is
        !          6754:        described  in  the  next  section.)  The special item (?R) or (?0) is a
1.1       misho    6755:        recursive call of the entire regular expression.
                   6756: 
1.1.1.4 ! misho    6757:        This PCRE pattern solves the nested  parentheses  problem  (assume  the
1.1       misho    6758:        PCRE_EXTENDED option is set so that white space is ignored):
                   6759: 
                   6760:          \( ( [^()]++ | (?R) )* \)
                   6761: 
1.1.1.4 ! misho    6762:        First  it matches an opening parenthesis. Then it matches any number of
        !          6763:        substrings which can either be a  sequence  of  non-parentheses,  or  a
        !          6764:        recursive  match  of the pattern itself (that is, a correctly parenthe-
1.1       misho    6765:        sized substring).  Finally there is a closing parenthesis. Note the use
                   6766:        of a possessive quantifier to avoid backtracking into sequences of non-
                   6767:        parentheses.
                   6768: 
1.1.1.4 ! misho    6769:        If this were part of a larger pattern, you would not  want  to  recurse
1.1       misho    6770:        the entire pattern, so instead you could use this:
                   6771: 
                   6772:          ( \( ( [^()]++ | (?1) )* \) )
                   6773: 
1.1.1.4 ! misho    6774:        We  have  put the pattern into parentheses, and caused the recursion to
1.1       misho    6775:        refer to them instead of the whole pattern.
                   6776: 
1.1.1.4 ! misho    6777:        In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
        !          6778:        tricky.  This is made easier by the use of relative references. Instead
1.1       misho    6779:        of (?1) in the pattern above you can write (?-2) to refer to the second
1.1.1.4 ! misho    6780:        most  recently  opened  parentheses  preceding  the recursion. In other
        !          6781:        words, a negative number counts capturing  parentheses  leftwards  from
1.1       misho    6782:        the point at which it is encountered.
                   6783: 
1.1.1.4 ! misho    6784:        It  is  also  possible  to refer to subsequently opened parentheses, by
        !          6785:        writing references such as (?+2). However, these  cannot  be  recursive
        !          6786:        because  the  reference  is  not inside the parentheses that are refer-
        !          6787:        enced. They are always non-recursive subroutine calls, as described  in
1.1       misho    6788:        the next section.
                   6789: 
1.1.1.4 ! misho    6790:        An  alternative  approach is to use named parentheses instead. The Perl
        !          6791:        syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
1.1       misho    6792:        supported. We could rewrite the above example as follows:
                   6793: 
                   6794:          (?<pn> \( ( [^()]++ | (?&pn) )* \) )
                   6795: 
1.1.1.4 ! misho    6796:        If  there  is more than one subpattern with the same name, the earliest
1.1       misho    6797:        one is used.
                   6798: 
1.1.1.4 ! misho    6799:        This particular example pattern that we have been looking  at  contains
1.1       misho    6800:        nested unlimited repeats, and so the use of a possessive quantifier for
                   6801:        matching strings of non-parentheses is important when applying the pat-
1.1.1.4 ! misho    6802:        tern  to  strings  that do not match. For example, when this pattern is
1.1       misho    6803:        applied to
                   6804: 
                   6805:          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
                   6806: 
1.1.1.4 ! misho    6807:        it yields "no match" quickly. However, if a  possessive  quantifier  is
        !          6808:        not  used, the match runs for a very long time indeed because there are
        !          6809:        so many different ways the + and * repeats can carve  up  the  subject,
1.1       misho    6810:        and all have to be tested before failure can be reported.
                   6811: 
1.1.1.4 ! misho    6812:        At  the  end  of a match, the values of capturing parentheses are those
        !          6813:        from the outermost level. If you want to obtain intermediate values,  a
        !          6814:        callout  function can be used (see below and the pcrecallout documenta-
1.1       misho    6815:        tion). If the pattern above is matched against
                   6816: 
                   6817:          (ab(cd)ef)
                   6818: 
1.1.1.4 ! misho    6819:        the value for the inner capturing parentheses  (numbered  2)  is  "ef",
        !          6820:        which  is the last value taken on at the top level. If a capturing sub-
        !          6821:        pattern is not matched at the top level, its final  captured  value  is
        !          6822:        unset,  even  if  it was (temporarily) set at a deeper level during the
1.1       misho    6823:        matching process.
                   6824: 
1.1.1.4 ! misho    6825:        If there are more than 15 capturing parentheses in a pattern, PCRE  has
        !          6826:        to  obtain extra memory to store data during a recursion, which it does
1.1       misho    6827:        by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
                   6828:        can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
                   6829: 
1.1.1.4 ! misho    6830:        Do  not  confuse  the (?R) item with the condition (R), which tests for
        !          6831:        recursion.  Consider this pattern, which matches text in  angle  brack-
        !          6832:        ets,  allowing for arbitrary nesting. Only digits are allowed in nested
        !          6833:        brackets (that is, when recursing), whereas any characters are  permit-
1.1       misho    6834:        ted at the outer level.
                   6835: 
                   6836:          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
                   6837: 
1.1.1.4 ! misho    6838:        In  this  pattern, (?(R) is the start of a conditional subpattern, with
        !          6839:        two different alternatives for the recursive and  non-recursive  cases.
1.1       misho    6840:        The (?R) item is the actual recursive call.
                   6841: 
                   6842:    Differences in recursion processing between PCRE and Perl
                   6843: 
1.1.1.4 ! misho    6844:        Recursion  processing  in PCRE differs from Perl in two important ways.
        !          6845:        In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
1.1       misho    6846:        always treated as an atomic group. That is, once it has matched some of
                   6847:        the subject string, it is never re-entered, even if it contains untried
1.1.1.4 ! misho    6848:        alternatives  and  there  is a subsequent matching failure. This can be
        !          6849:        illustrated by the following pattern, which purports to match a  palin-
        !          6850:        dromic  string  that contains an odd number of characters (for example,
1.1       misho    6851:        "a", "aba", "abcba", "abcdcba"):
                   6852: 
                   6853:          ^(.|(.)(?1)\2)$
                   6854: 
                   6855:        The idea is that it either matches a single character, or two identical
1.1.1.4 ! misho    6856:        characters  surrounding  a sub-palindrome. In Perl, this pattern works;
        !          6857:        in PCRE it does not if the pattern is  longer  than  three  characters.
1.1       misho    6858:        Consider the subject string "abcba":
                   6859: 
1.1.1.4 ! misho    6860:        At  the  top level, the first character is matched, but as it is not at
1.1       misho    6861:        the end of the string, the first alternative fails; the second alterna-
                   6862:        tive is taken and the recursion kicks in. The recursive call to subpat-
1.1.1.4 ! misho    6863:        tern 1 successfully matches the next character ("b").  (Note  that  the
1.1       misho    6864:        beginning and end of line tests are not part of the recursion).
                   6865: 
1.1.1.4 ! misho    6866:        Back  at  the top level, the next character ("c") is compared with what
        !          6867:        subpattern 2 matched, which was "a". This fails. Because the  recursion
        !          6868:        is  treated  as  an atomic group, there are now no backtracking points,
        !          6869:        and so the entire match fails. (Perl is able, at  this  point,  to  re-
        !          6870:        enter  the  recursion  and try the second alternative.) However, if the
1.1       misho    6871:        pattern is written with the alternatives in the other order, things are
                   6872:        different:
                   6873: 
                   6874:          ^((.)(?1)\2|.)$
                   6875: 
1.1.1.4 ! misho    6876:        This  time,  the recursing alternative is tried first, and continues to
        !          6877:        recurse until it runs out of characters, at which point  the  recursion
        !          6878:        fails.  But  this  time  we  do  have another alternative to try at the
        !          6879:        higher level. That is the big difference:  in  the  previous  case  the
1.1       misho    6880:        remaining alternative is at a deeper recursion level, which PCRE cannot
                   6881:        use.
                   6882: 
1.1.1.4 ! misho    6883:        To change the pattern so that it matches all palindromic  strings,  not
        !          6884:        just  those  with an odd number of characters, it is tempting to change
1.1       misho    6885:        the pattern to this:
                   6886: 
                   6887:          ^((.)(?1)\2|.?)$
                   6888: 
1.1.1.4 ! misho    6889:        Again, this works in Perl, but not in PCRE, and for  the  same  reason.
        !          6890:        When  a  deeper  recursion has matched a single character, it cannot be
        !          6891:        entered again in order to match an empty string.  The  solution  is  to
        !          6892:        separate  the two cases, and write out the odd and even cases as alter-
1.1       misho    6893:        natives at the higher level:
                   6894: 
                   6895:          ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
                   6896: 
1.1.1.4 ! misho    6897:        If you want to match typical palindromic phrases, the  pattern  has  to
1.1       misho    6898:        ignore all non-word characters, which can be done like this:
                   6899: 
                   6900:          ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
                   6901: 
                   6902:        If run with the PCRE_CASELESS option, this pattern matches phrases such
                   6903:        as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
1.1.1.4 ! misho    6904:        Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-
        !          6905:        ing into sequences of non-word characters. Without this, PCRE  takes  a
        !          6906:        great  deal  longer  (ten  times or more) to match typical phrases, and
1.1       misho    6907:        Perl takes so long that you think it has gone into a loop.
                   6908: 
1.1.1.4 ! misho    6909:        WARNING: The palindrome-matching patterns above work only if  the  sub-
        !          6910:        ject  string  does not start with a palindrome that is shorter than the
        !          6911:        entire string.  For example, although "abcba" is correctly matched,  if
        !          6912:        the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
        !          6913:        then fails at top level because the end of the string does not  follow.
        !          6914:        Once  again, it cannot jump back into the recursion to try other alter-
1.1       misho    6915:        natives, so the entire match fails.
                   6916: 
1.1.1.4 ! misho    6917:        The second way in which PCRE and Perl differ in  their  recursion  pro-
        !          6918:        cessing  is in the handling of captured values. In Perl, when a subpat-
        !          6919:        tern is called recursively or as a subpattern (see the  next  section),
        !          6920:        it  has  no  access to any values that were captured outside the recur-
        !          6921:        sion, whereas in PCRE these values can  be  referenced.  Consider  this
1.1       misho    6922:        pattern:
                   6923: 
                   6924:          ^(.)(\1|a(?2))
                   6925: 
1.1.1.4 ! misho    6926:        In  PCRE,  this  pattern matches "bab". The first capturing parentheses
        !          6927:        match "b", then in the second group, when the back reference  \1  fails
        !          6928:        to  match "b", the second alternative matches "a" and then recurses. In
        !          6929:        the recursion, \1 does now match "b" and so the whole  match  succeeds.
        !          6930:        In  Perl,  the pattern fails to match because inside the recursive call
1.1       misho    6931:        \1 cannot access the externally set value.
                   6932: 
                   6933: 
                   6934: SUBPATTERNS AS SUBROUTINES
                   6935: 
1.1.1.4 ! misho    6936:        If the syntax for a recursive subpattern call (either by number  or  by
        !          6937:        name)  is  used outside the parentheses to which it refers, it operates
        !          6938:        like a subroutine in a programming language. The called subpattern  may
        !          6939:        be  defined  before or after the reference. A numbered reference can be
1.1       misho    6940:        absolute or relative, as in these examples:
                   6941: 
                   6942:          (...(absolute)...)...(?2)...
                   6943:          (...(relative)...)...(?-1)...
                   6944:          (...(?+1)...(relative)...
                   6945: 
                   6946:        An earlier example pointed out that the pattern
                   6947: 
                   6948:          (sens|respons)e and \1ibility
                   6949: 
1.1.1.4 ! misho    6950:        matches "sense and sensibility" and "response and responsibility",  but
1.1       misho    6951:        not "sense and responsibility". If instead the pattern
                   6952: 
                   6953:          (sens|respons)e and (?1)ibility
                   6954: 
1.1.1.4 ! misho    6955:        is  used, it does match "sense and responsibility" as well as the other
        !          6956:        two strings. Another example is  given  in  the  discussion  of  DEFINE
1.1       misho    6957:        above.
                   6958: 
1.1.1.4 ! misho    6959:        All  subroutine  calls, whether recursive or not, are always treated as
        !          6960:        atomic groups. That is, once a subroutine has matched some of the  sub-
1.1       misho    6961:        ject string, it is never re-entered, even if it contains untried alter-
1.1.1.4 ! misho    6962:        natives and there is  a  subsequent  matching  failure.  Any  capturing
        !          6963:        parentheses  that  are  set  during the subroutine call revert to their
1.1       misho    6964:        previous values afterwards.
                   6965: 
1.1.1.4 ! misho    6966:        Processing options such as case-independence are fixed when  a  subpat-
        !          6967:        tern  is defined, so if it is used as a subroutine, such options cannot
1.1       misho    6968:        be changed for different calls. For example, consider this pattern:
                   6969: 
                   6970:          (abc)(?i:(?-1))
                   6971: 
1.1.1.4 ! misho    6972:        It matches "abcabc". It does not match "abcABC" because the  change  of
1.1       misho    6973:        processing option does not affect the called subpattern.
                   6974: 
                   6975: 
                   6976: ONIGURUMA SUBROUTINE SYNTAX
                   6977: 
1.1.1.4 ! misho    6978:        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
1.1       misho    6979:        name or a number enclosed either in angle brackets or single quotes, is
1.1.1.4 ! misho    6980:        an  alternative  syntax  for  referencing a subpattern as a subroutine,
        !          6981:        possibly recursively. Here are two of the examples used above,  rewrit-
1.1       misho    6982:        ten using this syntax:
                   6983: 
                   6984:          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
                   6985:          (sens|respons)e and \g'1'ibility
                   6986: 
1.1.1.4 ! misho    6987:        PCRE  supports  an extension to Oniguruma: if a number is preceded by a
1.1       misho    6988:        plus or a minus sign it is taken as a relative reference. For example:
                   6989: 
                   6990:          (abc)(?i:\g<-1>)
                   6991: 
1.1.1.4 ! misho    6992:        Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
        !          6993:        synonymous.  The former is a back reference; the latter is a subroutine
1.1       misho    6994:        call.
                   6995: 
                   6996: 
                   6997: CALLOUTS
                   6998: 
                   6999:        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1.1.1.4 ! misho    7000:        Perl  code to be obeyed in the middle of matching a regular expression.
1.1       misho    7001:        This makes it possible, amongst other things, to extract different sub-
                   7002:        strings that match the same pair of parentheses when there is a repeti-
                   7003:        tion.
                   7004: 
                   7005:        PCRE provides a similar feature, but of course it cannot obey arbitrary
                   7006:        Perl code. The feature is called "callout". The caller of PCRE provides
1.1.1.4 ! misho    7007:        an external function by putting its entry point in the global  variable
        !          7008:        pcre_callout  (8-bit  library) or pcre[16|32]_callout (16-bit or 32-bit
        !          7009:        library).  By default, this variable contains NULL, which disables  all
        !          7010:        calling out.
1.1       misho    7011: 
1.1.1.3   misho    7012:        Within  a  regular  expression,  (?C) indicates the points at which the
                   7013:        external function is to be called. If you want  to  identify  different
                   7014:        callout  points, you can put a number less than 256 after the letter C.
                   7015:        The default value is zero.  For example, this pattern has  two  callout
1.1       misho    7016:        points:
                   7017: 
                   7018:          (?C1)abc(?C2)def
                   7019: 
1.1.1.3   misho    7020:        If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
                   7021:        outs are automatically installed before each item in the pattern.  They
1.1.1.4 ! misho    7022:        are  all  numbered  255. If there is a conditional group in the pattern
        !          7023:        whose condition is an assertion, an additional callout is inserted just
        !          7024:        before the condition. An explicit callout may also be set at this posi-
        !          7025:        tion, as in this example:
        !          7026: 
        !          7027:          (?(?C9)(?=a)abc|def)
        !          7028: 
        !          7029:        Note that this applies only to assertion conditions, not to other types
        !          7030:        of condition.
1.1.1.2   misho    7031: 
1.1.1.3   misho    7032:        During  matching, when PCRE reaches a callout point, the external func-
                   7033:        tion is called. It is provided with the  number  of  the  callout,  the
                   7034:        position  in  the pattern, and, optionally, one item of data originally
                   7035:        supplied by the caller of the matching function. The  callout  function
                   7036:        may  cause  matching to proceed, to backtrack, or to fail altogether. A
                   7037:        complete description of the interface to the callout function is  given
1.1.1.2   misho    7038:        in the pcrecallout documentation.
1.1       misho    7039: 
                   7040: 
                   7041: BACKTRACKING CONTROL
                   7042: 
1.1.1.3   misho    7043:        Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
1.1.1.4 ! misho    7044:        which are still described in the Perl  documentation  as  "experimental
        !          7045:        and  subject to change or removal in a future version of Perl". It goes
        !          7046:        on to say: "Their usage in production code should  be  noted  to  avoid
        !          7047:        problems  during upgrades." The same remarks apply to the PCRE features
        !          7048:        described in this section.
1.1       misho    7049: 
1.1.1.4 ! misho    7050:        The new verbs make use of what was previously invalid syntax: an  open-
1.1       misho    7051:        ing parenthesis followed by an asterisk. They are generally of the form
1.1.1.4 ! misho    7052:        (*VERB) or (*VERB:NAME). Some may take either form,  possibly  behaving
        !          7053:        differently  depending  on  whether or not a name is present. A name is
1.1       misho    7054:        any sequence of characters that does not include a closing parenthesis.
1.1.1.3   misho    7055:        The maximum length of name is 255 in the 8-bit library and 65535 in the
1.1.1.4 ! misho    7056:        16-bit and 32-bit libraries. If the name is  empty,  that  is,  if  the
        !          7057:        closing  parenthesis immediately follows the colon, the effect is as if
        !          7058:        the colon were not there.  Any number of these verbs  may  occur  in  a
        !          7059:        pattern.
        !          7060: 
        !          7061:        Since  these  verbs  are  specifically related to backtracking, most of
        !          7062:        them can be used only when the pattern is to be matched  using  one  of
        !          7063:        the  traditional  matching  functions, because these use a backtracking
        !          7064:        algorithm. With the exception of (*FAIL), which behaves like a  failing
        !          7065:        negative  assertion,  the  backtracking control verbs cause an error if
        !          7066:        encountered by a DFA matching function.
        !          7067: 
        !          7068:        The behaviour of these verbs in repeated  groups,  assertions,  and  in
        !          7069:        subpatterns called as subroutines (whether or not recursively) is docu-
        !          7070:        mented below.
1.1.1.3   misho    7071: 
                   7072:    Optimizations that affect backtracking verbs
1.1       misho    7073: 
1.1.1.4 ! misho    7074:        PCRE contains some optimizations that are used to speed up matching  by
1.1       misho    7075:        running some checks at the start of each match attempt. For example, it
1.1.1.4 ! misho    7076:        may know the minimum length of matching subject, or that  a  particular
        !          7077:        character must be present. When one of these optimizations bypasses the
        !          7078:        running of a match,  any  included  backtracking  verbs  will  not,  of
1.1       misho    7079:        course, be processed. You can suppress the start-of-match optimizations
1.1.1.4 ! misho    7080:        by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
1.1       misho    7081:        pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
1.1.1.3   misho    7082:        There is more discussion of this option in the section entitled "Option
                   7083:        bits for pcre_exec()" in the pcreapi documentation.
1.1       misho    7084: 
1.1.1.4 ! misho    7085:        Experiments  with  Perl  suggest that it too has similar optimizations,
1.1       misho    7086:        sometimes leading to anomalous results.
                   7087: 
                   7088:    Verbs that act immediately
                   7089: 
1.1.1.4 ! misho    7090:        The following verbs act as soon as they are encountered. They  may  not
1.1       misho    7091:        be followed by a name.
                   7092: 
                   7093:           (*ACCEPT)
                   7094: 
1.1.1.4 ! misho    7095:        This  verb causes the match to end successfully, skipping the remainder
        !          7096:        of the pattern. However, when it is inside a subpattern that is  called
        !          7097:        as  a  subroutine, only that subpattern is ended successfully. Matching
        !          7098:        then continues at the outer level. If (*ACCEPT) in triggered in a posi-
        !          7099:        tive  assertion,  the  assertion succeeds; in a negative assertion, the
        !          7100:        assertion fails.
        !          7101: 
        !          7102:        If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
        !          7103:        tured. For example:
1.1       misho    7104: 
                   7105:          A((?:A|B(*ACCEPT)|C)D)
                   7106: 
1.1.1.4 ! misho    7107:        This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
1.1       misho    7108:        tured by the outer parentheses.
                   7109: 
                   7110:          (*FAIL) or (*F)
                   7111: 
1.1.1.4 ! misho    7112:        This verb causes a matching failure, forcing backtracking to occur.  It
        !          7113:        is  equivalent to (?!) but easier to read. The Perl documentation notes
        !          7114:        that it is probably useful only when combined  with  (?{})  or  (??{}).
        !          7115:        Those  are,  of course, Perl features that are not present in PCRE. The
        !          7116:        nearest equivalent is the callout feature, as for example in this  pat-
1.1       misho    7117:        tern:
                   7118: 
                   7119:          a+(?C)(*FAIL)
                   7120: 
1.1.1.4 ! misho    7121:        A  match  with the string "aaaa" always fails, but the callout is taken
1.1       misho    7122:        before each backtrack happens (in this example, 10 times).
                   7123: 
                   7124:    Recording which path was taken
                   7125: 
1.1.1.4 ! misho    7126:        There is one verb whose main purpose  is  to  track  how  a  match  was
        !          7127:        arrived  at,  though  it  also  has a secondary use in conjunction with
1.1       misho    7128:        advancing the match starting point (see (*SKIP) below).
                   7129: 
                   7130:          (*MARK:NAME) or (*:NAME)
                   7131: 
1.1.1.4 ! misho    7132:        A name is always  required  with  this  verb.  There  may  be  as  many
        !          7133:        instances  of  (*MARK) as you like in a pattern, and their names do not
1.1       misho    7134:        have to be unique.
                   7135: 
1.1.1.4 ! misho    7136:        When a match succeeds, the name of the  last-encountered  (*MARK:NAME),
        !          7137:        (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
        !          7138:        the caller as  described  in  the  section  entitled  "Extra  data  for
        !          7139:        pcre_exec()"  in  the  pcreapi  documentation.  Here  is  an example of
        !          7140:        pcretest output, where the /K modifier requests the retrieval and  out-
        !          7141:        putting of (*MARK) data:
1.1       misho    7142: 
                   7143:            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
                   7144:          data> XY
                   7145:           0: XY
                   7146:          MK: A
                   7147:          XZ
                   7148:           0: XZ
                   7149:          MK: B
                   7150: 
                   7151:        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
1.1.1.2   misho    7152:        ple it indicates which of the two alternatives matched. This is a  more
                   7153:        efficient  way of obtaining this information than putting each alterna-
1.1       misho    7154:        tive in its own capturing parentheses.
                   7155: 
1.1.1.4 ! misho    7156:        If a verb with a name is encountered in a positive  assertion  that  is
        !          7157:        true,  the  name  is recorded and passed back if it is the last-encoun-
        !          7158:        tered. This does not happen for negative assertions or failing positive
        !          7159:        assertions.
1.1       misho    7160: 
1.1.1.4 ! misho    7161:        After  a  partial match or a failed match, the last encountered name in
        !          7162:        the entire match process is returned. For example:
1.1       misho    7163: 
                   7164:            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
                   7165:          data> XP
                   7166:          No match, mark = B
                   7167: 
1.1.1.4 ! misho    7168:        Note that in this unanchored example the  mark  is  retained  from  the
1.1.1.3   misho    7169:        match attempt that started at the letter "X" in the subject. Subsequent
                   7170:        match attempts starting at "P" and then with an empty string do not get
                   7171:        as far as the (*MARK) item, but nevertheless do not reset it.
                   7172: 
1.1.1.4 ! misho    7173:        If  you  are  interested  in  (*MARK)  values after failed matches, you
        !          7174:        should probably set the PCRE_NO_START_OPTIMIZE option  (see  above)  to
1.1.1.3   misho    7175:        ensure that the match is always attempted.
1.1       misho    7176: 
                   7177:    Verbs that act after backtracking
                   7178: 
                   7179:        The following verbs do nothing when they are encountered. Matching con-
1.1.1.4 ! misho    7180:        tinues with what follows, but if there is no subsequent match,  causing
        !          7181:        a  backtrack  to  the  verb, a failure is forced. That is, backtracking
        !          7182:        cannot pass to the left of the verb. However, when one of  these  verbs
        !          7183:        appears inside an atomic group or an assertion that is true, its effect
        !          7184:        is confined to that group, because once the  group  has  been  matched,
        !          7185:        there  is never any backtracking into it. In this situation, backtrack-
        !          7186:        ing can "jump back" to the left of the entire atomic  group  or  asser-
        !          7187:        tion.  (Remember  also,  as  stated  above, that this localization also
        !          7188:        applies in subroutine calls.)
1.1       misho    7189: 
1.1.1.2   misho    7190:        These verbs differ in exactly what kind of failure  occurs  when  back-
1.1.1.4 ! misho    7191:        tracking  reaches  them.  The behaviour described below is what happens
        !          7192:        when the verb is not in a subroutine or an assertion.  Subsequent  sec-
        !          7193:        tions cover these special cases.
1.1       misho    7194: 
                   7195:          (*COMMIT)
                   7196: 
1.1.1.2   misho    7197:        This  verb, which may not be followed by a name, causes the whole match
1.1.1.4 ! misho    7198:        to fail outright if there is a later matching failure that causes back-
        !          7199:        tracking  to  reach  it.  Even if the pattern is unanchored, no further
        !          7200:        attempts to find a match by advancing the starting point take place. If
        !          7201:        (*COMMIT)  is  the  only backtracking verb that is encountered, once it
        !          7202:        has been passed pcre_exec() is committed to finding a match at the cur-
        !          7203:        rent starting point, or not at all. For example:
1.1       misho    7204: 
                   7205:          a+(*COMMIT)b
                   7206: 
1.1.1.4 ! misho    7207:        This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
1.1       misho    7208:        of dynamic anchor, or "I've started, so I must finish." The name of the
1.1.1.4 ! misho    7209:        most  recently passed (*MARK) in the path is passed back when (*COMMIT)
1.1       misho    7210:        forces a match failure.
                   7211: 
1.1.1.4 ! misho    7212:        If there is more than one backtracking verb in a pattern,  a  different
        !          7213:        one  that  follows  (*COMMIT) may be triggered first, so merely passing
        !          7214:        (*COMMIT) during a match does not always guarantee that a match must be
        !          7215:        at this starting point.
        !          7216: 
1.1.1.2   misho    7217:        Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
                   7218:        anchor, unless PCRE's start-of-match optimizations are turned  off,  as
1.1       misho    7219:        shown in this pcretest example:
                   7220: 
                   7221:            re> /(*COMMIT)abc/
                   7222:          data> xyzabc
                   7223:           0: abc
                   7224:          xyzabc\Y
                   7225:          No match
                   7226: 
1.1.1.2   misho    7227:        PCRE  knows  that  any  match  must start with "a", so the optimization
                   7228:        skips along the subject to "a" before running the first match  attempt,
                   7229:        which  succeeds.  When the optimization is disabled by the \Y escape in
1.1       misho    7230:        the second subject, the match starts at "x" and so the (*COMMIT) causes
                   7231:        it to fail without trying any other starting points.
                   7232: 
                   7233:          (*PRUNE) or (*PRUNE:NAME)
                   7234: 
1.1.1.2   misho    7235:        This  verb causes the match to fail at the current starting position in
1.1.1.4 ! misho    7236:        the subject if there is a later matching failure that causes backtrack-
        !          7237:        ing  to  reach it. If the pattern is unanchored, the normal "bumpalong"
        !          7238:        advance to the next starting character then happens.  Backtracking  can
        !          7239:        occur  as  usual to the left of (*PRUNE), before it is reached, or when
        !          7240:        matching to the right of (*PRUNE), but if there  is  no  match  to  the
        !          7241:        right,  backtracking cannot cross (*PRUNE). In simple cases, the use of
        !          7242:        (*PRUNE) is just an alternative to an atomic group or possessive  quan-
        !          7243:        tifier, but there are some uses of (*PRUNE) that cannot be expressed in
        !          7244:        any other way. In an anchored pattern (*PRUNE) has the same  effect  as
        !          7245:        (*COMMIT).
        !          7246: 
        !          7247:        The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
        !          7248:        (*MARK:NAME)(*PRUNE).  It is like (*MARK:NAME)  in  that  the  name  is
        !          7249:        remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
        !          7250:        searches only for names set with (*MARK).
1.1       misho    7251: 
                   7252:          (*SKIP)
                   7253: 
1.1.1.4 ! misho    7254:        This verb, when given without a name, is like (*PRUNE), except that  if
        !          7255:        the  pattern  is unanchored, the "bumpalong" advance is not to the next
1.1       misho    7256:        character, but to the position in the subject where (*SKIP) was encoun-
1.1.1.4 ! misho    7257:        tered.  (*SKIP)  signifies that whatever text was matched leading up to
1.1       misho    7258:        it cannot be part of a successful match. Consider:
                   7259: 
                   7260:          a+(*SKIP)b
                   7261: 
1.1.1.4 ! misho    7262:        If the subject is "aaaac...",  after  the  first  match  attempt  fails
        !          7263:        (starting  at  the  first  character in the string), the starting point
1.1       misho    7264:        skips on to start the next attempt at "c". Note that a possessive quan-
1.1.1.4 ! misho    7265:        tifer  does not have the same effect as this example; although it would
        !          7266:        suppress backtracking  during  the  first  match  attempt,  the  second
        !          7267:        attempt  would  start at the second character instead of skipping on to
1.1       misho    7268:        "c".
                   7269: 
                   7270:          (*SKIP:NAME)
                   7271: 
1.1.1.4 ! misho    7272:        When (*SKIP) has an associated name, its behaviour is modified. When it
        !          7273:        is triggered, the previous path through the pattern is searched for the
        !          7274:        most recent (*MARK) that has the  same  name.  If  one  is  found,  the
        !          7275:        "bumpalong" advance is to the subject position that corresponds to that
        !          7276:        (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
        !          7277:        a matching name is found, the (*SKIP) is ignored.
        !          7278: 
        !          7279:        Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
        !          7280:        ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
1.1       misho    7281: 
                   7282:          (*THEN) or (*THEN:NAME)
                   7283: 
1.1.1.4 ! misho    7284:        This verb causes a skip to the next innermost  alternative  when  back-
        !          7285:        tracking  reaches  it.  That  is,  it  cancels any further backtracking
        !          7286:        within the current alternative. Its name  comes  from  the  observation
        !          7287:        that it can be used for a pattern-based if-then-else block:
1.1       misho    7288: 
                   7289:          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
                   7290: 
1.1.1.2   misho    7291:        If  the COND1 pattern matches, FOO is tried (and possibly further items
                   7292:        after the end of the group if FOO succeeds); on  failure,  the  matcher
                   7293:        skips  to  the second alternative and tries COND2, without backtracking
1.1.1.4 ! misho    7294:        into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
        !          7295:        quently  BAZ fails, there are no more alternatives, so there is a back-
        !          7296:        track to whatever came before the  entire  group.  If  (*THEN)  is  not
        !          7297:        inside an alternation, it acts like (*PRUNE).
        !          7298: 
        !          7299:        The    behaviour   of   (*THEN:NAME)   is   the   not   the   same   as
        !          7300:        (*MARK:NAME)(*THEN).  It is like  (*MARK:NAME)  in  that  the  name  is
        !          7301:        remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
        !          7302:        searches only for names set with (*MARK).
        !          7303: 
        !          7304:        A subpattern that does not contain a | character is just a part of  the
        !          7305:        enclosing  alternative;  it  is  not a nested alternation with only one
        !          7306:        alternative. The effect of (*THEN) extends beyond such a subpattern  to
        !          7307:        the  enclosing alternative. Consider this pattern, where A, B, etc. are
        !          7308:        complex pattern fragments that do not contain any | characters at  this
        !          7309:        level:
1.1       misho    7310: 
                   7311:          A (B(*THEN)C) | D
                   7312: 
1.1.1.2   misho    7313:        If  A and B are matched, but there is a failure in C, matching does not
1.1       misho    7314:        backtrack into A; instead it moves to the next alternative, that is, D.
1.1.1.2   misho    7315:        However,  if the subpattern containing (*THEN) is given an alternative,
1.1       misho    7316:        it behaves differently:
                   7317: 
                   7318:          A (B(*THEN)C | (*FAIL)) | D
                   7319: 
1.1.1.2   misho    7320:        The effect of (*THEN) is now confined to the inner subpattern. After  a
1.1       misho    7321:        failure in C, matching moves to (*FAIL), which causes the whole subpat-
1.1.1.2   misho    7322:        tern to fail because there are no more alternatives  to  try.  In  this
1.1       misho    7323:        case, matching does now backtrack into A.
                   7324: 
1.1.1.4 ! misho    7325:        Note  that  a  conditional  subpattern  is not considered as having two
1.1.1.2   misho    7326:        alternatives, because only one is ever used.  In  other  words,  the  |
1.1       misho    7327:        character in a conditional subpattern has a different meaning. Ignoring
                   7328:        white space, consider:
                   7329: 
                   7330:          ^.*? (?(?=a) a | b(*THEN)c )
                   7331: 
1.1.1.2   misho    7332:        If the subject is "ba", this pattern does not  match.  Because  .*?  is
                   7333:        ungreedy,  it  initially  matches  zero characters. The condition (?=a)
                   7334:        then fails, the character "b" is matched,  but  "c"  is  not.  At  this
                   7335:        point,  matching does not backtrack to .*? as might perhaps be expected
                   7336:        from the presence of the | character.  The  conditional  subpattern  is
1.1       misho    7337:        part of the single alternative that comprises the whole pattern, and so
1.1.1.2   misho    7338:        the match fails. (If there was a backtrack into  .*?,  allowing  it  to
1.1       misho    7339:        match "b", the match would succeed.)
                   7340: 
1.1.1.2   misho    7341:        The  verbs just described provide four different "strengths" of control
1.1       misho    7342:        when subsequent matching fails. (*THEN) is the weakest, carrying on the
1.1.1.2   misho    7343:        match  at  the next alternative. (*PRUNE) comes next, failing the match
                   7344:        at the current starting position, but allowing an advance to  the  next
                   7345:        character  (for an unanchored pattern). (*SKIP) is similar, except that
1.1       misho    7346:        the advance may be more than one character. (*COMMIT) is the strongest,
                   7347:        causing the entire match to fail.
                   7348: 
1.1.1.4 ! misho    7349:    More than one backtracking verb
        !          7350: 
        !          7351:        If  more  than  one  backtracking verb is present in a pattern, the one
        !          7352:        that is backtracked onto first acts. For example,  consider  this  pat-
        !          7353:        tern, where A, B, etc. are complex pattern fragments:
        !          7354: 
        !          7355:          (A(*COMMIT)B(*THEN)C|ABD)
        !          7356: 
        !          7357:        If  A matches but B fails, the backtrack to (*COMMIT) causes the entire
        !          7358:        match to fail. However, if A and B match, but C fails, the backtrack to
        !          7359:        (*THEN)  causes  the next alternative (ABD) to be tried. This behaviour
        !          7360:        is consistent, but is not always the same as Perl's. It means  that  if
        !          7361:        two  or  more backtracking verbs appear in succession, all the the last
        !          7362:        of them has no effect. Consider this example:
        !          7363: 
        !          7364:          ...(*COMMIT)(*PRUNE)...
        !          7365: 
        !          7366:        If there is a matching failure to the right, backtracking onto (*PRUNE)
        !          7367:        cases it to be triggered, and its action is taken. There can never be a
        !          7368:        backtrack onto (*COMMIT).
        !          7369: 
        !          7370:    Backtracking verbs in repeated groups
        !          7371: 
        !          7372:        PCRE differs from  Perl  in  its  handling  of  backtracking  verbs  in
        !          7373:        repeated groups. For example, consider:
        !          7374: 
        !          7375:          /(a(*COMMIT)b)+ac/
        !          7376: 
        !          7377:        If  the  subject  is  "abac",  Perl matches, but PCRE fails because the
        !          7378:        (*COMMIT) in the second repeat of the group acts.
        !          7379: 
        !          7380:    Backtracking verbs in assertions
        !          7381: 
        !          7382:        (*FAIL) in an assertion has its normal effect: it forces  an  immediate
        !          7383:        backtrack.
        !          7384: 
        !          7385:        (*ACCEPT) in a positive assertion causes the assertion to succeed with-
        !          7386:        out any further processing. In a negative assertion,  (*ACCEPT)  causes
        !          7387:        the assertion to fail without any further processing.
        !          7388: 
        !          7389:        The  other  backtracking verbs are not treated specially if they appear
        !          7390:        in a positive assertion. In  particular,  (*THEN)  skips  to  the  next
        !          7391:        alternative  in  the  innermost  enclosing group that has alternations,
        !          7392:        whether or not this is within the assertion.
        !          7393: 
        !          7394:        Negative assertions are, however, different, in order  to  ensure  that
        !          7395:        changing  a  positive  assertion  into a negative assertion changes its
        !          7396:        result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
        !          7397:        ative assertion to be true, without considering any further alternative
        !          7398:        branches in the assertion.  Backtracking into (*THEN) causes it to skip
        !          7399:        to  the next enclosing alternative within the assertion (the normal be-
        !          7400:        haviour), but if the assertion  does  not  have  such  an  alternative,
        !          7401:        (*THEN) behaves like (*PRUNE).
        !          7402: 
        !          7403:    Backtracking verbs in subroutines
        !          7404: 
        !          7405:        These  behaviours  occur whether or not the subpattern is called recur-
        !          7406:        sively.  Perl's treatment of subroutines is different in some cases.
        !          7407: 
        !          7408:        (*FAIL) in a subpattern called as a subroutine has its  normal  effect:
        !          7409:        it forces an immediate backtrack.
        !          7410: 
        !          7411:        (*ACCEPT)  in a subpattern called as a subroutine causes the subroutine
        !          7412:        match to succeed without any further processing. Matching then  contin-
        !          7413:        ues after the subroutine call.
        !          7414: 
        !          7415:        (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
        !          7416:        cause the subroutine match to fail.
        !          7417: 
        !          7418:        (*THEN) skips to the next alternative in the innermost enclosing  group
        !          7419:        within  the subpattern that has alternatives. If there is no such group
        !          7420:        within the subpattern, (*THEN) causes the subroutine match to fail.
1.1       misho    7421: 
                   7422: 
                   7423: SEE ALSO
                   7424: 
1.1.1.2   misho    7425:        pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),
1.1.1.4 ! misho    7426:        pcre16(3), pcre32(3).
1.1       misho    7427: 
                   7428: 
                   7429: AUTHOR
                   7430: 
                   7431:        Philip Hazel
                   7432:        University Computing Service
                   7433:        Cambridge CB2 3QH, England.
                   7434: 
                   7435: 
                   7436: REVISION
                   7437: 
1.1.1.4 ! misho    7438:        Last updated: 26 April 2013
        !          7439:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    7440: ------------------------------------------------------------------------------
                   7441: 
                   7442: 
1.1.1.4 ! misho    7443: PCRESYNTAX(3)              Library Functions Manual              PCRESYNTAX(3)
        !          7444: 
1.1       misho    7445: 
                   7446: 
                   7447: NAME
                   7448:        PCRE - Perl-compatible regular expressions
                   7449: 
                   7450: PCRE REGULAR EXPRESSION SYNTAX SUMMARY
                   7451: 
                   7452:        The  full syntax and semantics of the regular expressions that are sup-
                   7453:        ported by PCRE are described in  the  pcrepattern  documentation.  This
1.1.1.2   misho    7454:        document contains a quick-reference summary of the syntax.
1.1       misho    7455: 
                   7456: 
                   7457: QUOTING
                   7458: 
                   7459:          \x         where x is non-alphanumeric is a literal x
                   7460:          \Q...\E    treat enclosed characters as literal
                   7461: 
                   7462: 
                   7463: CHARACTERS
                   7464: 
                   7465:          \a         alarm, that is, the BEL character (hex 07)
                   7466:          \cx        "control-x", where x is any ASCII character
                   7467:          \e         escape (hex 1B)
1.1.1.3   misho    7468:          \f         form feed (hex 0C)
1.1       misho    7469:          \n         newline (hex 0A)
                   7470:          \r         carriage return (hex 0D)
                   7471:          \t         tab (hex 09)
                   7472:          \ddd       character with octal code ddd, or backreference
                   7473:          \xhh       character with hex code hh
                   7474:          \x{hhh..}  character with hex code hhh..
                   7475: 
                   7476: 
                   7477: CHARACTER TYPES
                   7478: 
                   7479:          .          any character except newline;
                   7480:                       in dotall mode, any character whatsoever
1.1.1.2   misho    7481:          \C         one data unit, even in UTF mode (best avoided)
1.1       misho    7482:          \d         a decimal digit
                   7483:          \D         a character that is not a decimal digit
1.1.1.3   misho    7484:          \h         a horizontal white space character
                   7485:          \H         a character that is not a horizontal white space character
1.1       misho    7486:          \N         a character that is not a newline
                   7487:          \p{xx}     a character with the xx property
                   7488:          \P{xx}     a character without the xx property
                   7489:          \R         a newline sequence
1.1.1.3   misho    7490:          \s         a white space character
                   7491:          \S         a character that is not a white space character
                   7492:          \v         a vertical white space character
                   7493:          \V         a character that is not a vertical white space character
1.1       misho    7494:          \w         a "word" character
                   7495:          \W         a "non-word" character
1.1.1.4 ! misho    7496:          \X         a Unicode extended grapheme cluster
1.1       misho    7497: 
                   7498:        In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
1.1.1.2   misho    7499:        characters, even in a UTF mode. However, this can be changed by setting
1.1       misho    7500:        the PCRE_UCP option.
                   7501: 
                   7502: 
                   7503: GENERAL CATEGORY PROPERTIES FOR \p and \P
                   7504: 
                   7505:          C          Other
                   7506:          Cc         Control
                   7507:          Cf         Format
                   7508:          Cn         Unassigned
                   7509:          Co         Private use
                   7510:          Cs         Surrogate
                   7511: 
                   7512:          L          Letter
                   7513:          Ll         Lower case letter
                   7514:          Lm         Modifier letter
                   7515:          Lo         Other letter
                   7516:          Lt         Title case letter
                   7517:          Lu         Upper case letter
                   7518:          L&         Ll, Lu, or Lt
                   7519: 
                   7520:          M          Mark
                   7521:          Mc         Spacing mark
                   7522:          Me         Enclosing mark
                   7523:          Mn         Non-spacing mark
                   7524: 
                   7525:          N          Number
                   7526:          Nd         Decimal number
                   7527:          Nl         Letter number
                   7528:          No         Other number
                   7529: 
                   7530:          P          Punctuation
                   7531:          Pc         Connector punctuation
                   7532:          Pd         Dash punctuation
                   7533:          Pe         Close punctuation
                   7534:          Pf         Final punctuation
                   7535:          Pi         Initial punctuation
                   7536:          Po         Other punctuation
                   7537:          Ps         Open punctuation
                   7538: 
                   7539:          S          Symbol
                   7540:          Sc         Currency symbol
                   7541:          Sk         Modifier symbol
                   7542:          Sm         Mathematical symbol
                   7543:          So         Other symbol
                   7544: 
                   7545:          Z          Separator
                   7546:          Zl         Line separator
                   7547:          Zp         Paragraph separator
                   7548:          Zs         Space separator
                   7549: 
                   7550: 
                   7551: PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
                   7552: 
                   7553:          Xan        Alphanumeric: union of properties L and N
                   7554:          Xps        POSIX space: property Z or tab, NL, VT, FF, CR
                   7555:          Xsp        Perl space: property Z or tab, NL, FF, CR
1.1.1.4 ! misho    7556:          Xuc        Univerally-named character: one that can be
        !          7557:                       represented by a Universal Character Name
1.1       misho    7558:          Xwd        Perl word: property Xan or underscore
                   7559: 
                   7560: 
                   7561: SCRIPT NAMES FOR \p AND \P
                   7562: 
1.1.1.3   misho    7563:        Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
                   7564:        Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
                   7565:        Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
                   7566:        Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
                   7567:        Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
                   7568:        gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
                   7569:        tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
                   7570:        Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
                   7571:        Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
                   7572:        Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
                   7573:        Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
                   7574:        Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
                   7575:        tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
                   7576:        Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
                   7577:        Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
                   7578:        Yi.
1.1       misho    7579: 
                   7580: 
                   7581: CHARACTER CLASSES
                   7582: 
                   7583:          [...]       positive character class
                   7584:          [^...]      negative character class
                   7585:          [x-y]       range (can be used for hex characters)
                   7586:          [[:xxx:]]   positive POSIX named set
                   7587:          [[:^xxx:]]  negative POSIX named set
                   7588: 
                   7589:          alnum       alphanumeric
                   7590:          alpha       alphabetic
                   7591:          ascii       0-127
                   7592:          blank       space or tab
                   7593:          cntrl       control character
                   7594:          digit       decimal digit
                   7595:          graph       printing, excluding space
                   7596:          lower       lower case letter
                   7597:          print       printing, including space
                   7598:          punct       printing, excluding alphanumeric
1.1.1.3   misho    7599:          space       white space
1.1       misho    7600:          upper       upper case letter
                   7601:          word        same as \w
                   7602:          xdigit      hexadecimal digit
                   7603: 
                   7604:        In PCRE, POSIX character set names recognize only ASCII  characters  by
                   7605:        default,  but  some  of them use Unicode properties if PCRE_UCP is set.
                   7606:        You can use \Q...\E inside a character class.
                   7607: 
                   7608: 
                   7609: QUANTIFIERS
                   7610: 
                   7611:          ?           0 or 1, greedy
                   7612:          ?+          0 or 1, possessive
                   7613:          ??          0 or 1, lazy
                   7614:          *           0 or more, greedy
                   7615:          *+          0 or more, possessive
                   7616:          *?          0 or more, lazy
                   7617:          +           1 or more, greedy
                   7618:          ++          1 or more, possessive
                   7619:          +?          1 or more, lazy
                   7620:          {n}         exactly n
                   7621:          {n,m}       at least n, no more than m, greedy
                   7622:          {n,m}+      at least n, no more than m, possessive
                   7623:          {n,m}?      at least n, no more than m, lazy
                   7624:          {n,}        n or more, greedy
                   7625:          {n,}+       n or more, possessive
                   7626:          {n,}?       n or more, lazy
                   7627: 
                   7628: 
                   7629: ANCHORS AND SIMPLE ASSERTIONS
                   7630: 
                   7631:          \b          word boundary
                   7632:          \B          not a word boundary
                   7633:          ^           start of subject
                   7634:                       also after internal newline in multiline mode
                   7635:          \A          start of subject
                   7636:          $           end of subject
                   7637:                       also before newline at end of subject
                   7638:                       also before internal newline in multiline mode
                   7639:          \Z          end of subject
                   7640:                       also before newline at end of subject
                   7641:          \z          end of subject
                   7642:          \G          first matching position in subject
                   7643: 
                   7644: 
                   7645: MATCH POINT RESET
                   7646: 
                   7647:          \K          reset start of match
                   7648: 
                   7649: 
                   7650: ALTERNATION
                   7651: 
                   7652:          expr|expr|expr...
                   7653: 
                   7654: 
                   7655: CAPTURING
                   7656: 
                   7657:          (...)           capturing group
                   7658:          (?<name>...)    named capturing group (Perl)
                   7659:          (?'name'...)    named capturing group (Perl)
                   7660:          (?P<name>...)   named capturing group (Python)
                   7661:          (?:...)         non-capturing group
                   7662:          (?|...)         non-capturing group; reset group numbers for
                   7663:                           capturing groups in each alternative
                   7664: 
                   7665: 
                   7666: ATOMIC GROUPS
                   7667: 
                   7668:          (?>...)         atomic, non-capturing group
                   7669: 
                   7670: 
                   7671: COMMENT
                   7672: 
                   7673:          (?#....)        comment (not nestable)
                   7674: 
                   7675: 
                   7676: OPTION SETTING
                   7677: 
                   7678:          (?i)            caseless
                   7679:          (?J)            allow duplicate names
                   7680:          (?m)            multiline
                   7681:          (?s)            single line (dotall)
                   7682:          (?U)            default ungreedy (lazy)
                   7683:          (?x)            extended (ignore white space)
                   7684:          (?-...)         unset option(s)
                   7685: 
                   7686:        The following are recognized only at the start of a  pattern  or  after
                   7687:        one of the newline-setting options with similar syntax:
                   7688: 
1.1.1.4 ! misho    7689:          (*LIMIT_MATCH=d) set the match limit to d (decimal number)
        !          7690:          (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
1.1       misho    7691:          (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
1.1.1.2   misho    7692:          (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
                   7693:          (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
1.1.1.4 ! misho    7694:          (*UTF32)        set UTF-32 mode: 32-bit library (PCRE_UTF32)
        !          7695:          (*UTF)          set appropriate UTF mode for the library in use
1.1       misho    7696:          (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
                   7697: 
                   7698: 
                   7699: LOOKAHEAD AND LOOKBEHIND ASSERTIONS
                   7700: 
                   7701:          (?=...)         positive look ahead
                   7702:          (?!...)         negative look ahead
                   7703:          (?<=...)        positive look behind
                   7704:          (?<!...)        negative look behind
                   7705: 
                   7706:        Each top-level branch of a look behind must be of a fixed length.
                   7707: 
                   7708: 
                   7709: BACKREFERENCES
                   7710: 
                   7711:          \n              reference by number (can be ambiguous)
                   7712:          \gn             reference by number
                   7713:          \g{n}           reference by number
                   7714:          \g{-n}          relative reference by number
                   7715:          \k<name>        reference by name (Perl)
                   7716:          \k'name'        reference by name (Perl)
                   7717:          \g{name}        reference by name (Perl)
                   7718:          \k{name}        reference by name (.NET)
                   7719:          (?P=name)       reference by name (Python)
                   7720: 
                   7721: 
                   7722: SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
                   7723: 
                   7724:          (?R)            recurse whole pattern
                   7725:          (?n)            call subpattern by absolute number
                   7726:          (?+n)           call subpattern by relative number
                   7727:          (?-n)           call subpattern by relative number
                   7728:          (?&name)        call subpattern by name (Perl)
                   7729:          (?P>name)       call subpattern by name (Python)
                   7730:          \g<name>        call subpattern by name (Oniguruma)
                   7731:          \g'name'        call subpattern by name (Oniguruma)
                   7732:          \g<n>           call subpattern by absolute number (Oniguruma)
                   7733:          \g'n'           call subpattern by absolute number (Oniguruma)
                   7734:          \g<+n>          call subpattern by relative number (PCRE extension)
                   7735:          \g'+n'          call subpattern by relative number (PCRE extension)
                   7736:          \g<-n>          call subpattern by relative number (PCRE extension)
                   7737:          \g'-n'          call subpattern by relative number (PCRE extension)
                   7738: 
                   7739: 
                   7740: CONDITIONAL PATTERNS
                   7741: 
                   7742:          (?(condition)yes-pattern)
                   7743:          (?(condition)yes-pattern|no-pattern)
                   7744: 
                   7745:          (?(n)...        absolute reference condition
                   7746:          (?(+n)...       relative reference condition
                   7747:          (?(-n)...       relative reference condition
                   7748:          (?(<name>)...   named reference condition (Perl)
                   7749:          (?('name')...   named reference condition (Perl)
                   7750:          (?(name)...     named reference condition (PCRE)
                   7751:          (?(R)...        overall recursion condition
                   7752:          (?(Rn)...       specific group recursion condition
                   7753:          (?(R&name)...   specific recursion condition
                   7754:          (?(DEFINE)...   define subpattern for reference
                   7755:          (?(assert)...   assertion condition
                   7756: 
                   7757: 
                   7758: BACKTRACKING CONTROL
                   7759: 
                   7760:        The following act immediately they are reached:
                   7761: 
                   7762:          (*ACCEPT)       force successful match
                   7763:          (*FAIL)         force backtrack; synonym (*F)
1.1.1.2   misho    7764:          (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
1.1       misho    7765: 
                   7766:        The  following  act only when a subsequent match failure causes a back-
                   7767:        track to reach them. They all force a match failure, but they differ in
                   7768:        what happens afterwards. Those that advance the start-of-match point do
                   7769:        so only if the pattern is not anchored.
                   7770: 
                   7771:          (*COMMIT)       overall failure, no advance of starting point
                   7772:          (*PRUNE)        advance to next starting character
1.1.1.2   misho    7773:          (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
                   7774:          (*SKIP)         advance to current matching position
                   7775:          (*SKIP:NAME)    advance to position corresponding to an earlier
                   7776:                          (*MARK:NAME); if not found, the (*SKIP) is ignored
1.1       misho    7777:          (*THEN)         local failure, backtrack to next alternation
1.1.1.2   misho    7778:          (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
1.1       misho    7779: 
                   7780: 
                   7781: NEWLINE CONVENTIONS
                   7782: 
                   7783:        These are recognized only at the very start of the pattern or  after  a
1.1.1.4 ! misho    7784:        (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
1.1       misho    7785: 
                   7786:          (*CR)           carriage return only
                   7787:          (*LF)           linefeed only
                   7788:          (*CRLF)         carriage return followed by linefeed
                   7789:          (*ANYCRLF)      all three of the above
                   7790:          (*ANY)          any Unicode newline sequence
                   7791: 
                   7792: 
                   7793: WHAT \R MATCHES
                   7794: 
                   7795:        These  are  recognized only at the very start of the pattern or after a
1.1.1.2   misho    7796:        (*...) option that sets the newline convention or a UTF or UCP mode.
1.1       misho    7797: 
                   7798:          (*BSR_ANYCRLF)  CR, LF, or CRLF
                   7799:          (*BSR_UNICODE)  any Unicode newline sequence
                   7800: 
                   7801: 
                   7802: CALLOUTS
                   7803: 
                   7804:          (?C)      callout
                   7805:          (?Cn)     callout with data n
                   7806: 
                   7807: 
                   7808: SEE ALSO
                   7809: 
                   7810:        pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
                   7811: 
                   7812: 
                   7813: AUTHOR
                   7814: 
                   7815:        Philip Hazel
                   7816:        University Computing Service
                   7817:        Cambridge CB2 3QH, England.
                   7818: 
                   7819: 
                   7820: REVISION
                   7821: 
1.1.1.4 ! misho    7822:        Last updated: 26 April 2013
        !          7823:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    7824: ------------------------------------------------------------------------------
                   7825: 
                   7826: 
1.1.1.4 ! misho    7827: PCREUNICODE(3)             Library Functions Manual             PCREUNICODE(3)
        !          7828: 
1.1       misho    7829: 
                   7830: 
                   7831: NAME
                   7832:        PCRE - Perl-compatible regular expressions
                   7833: 
1.1.1.4 ! misho    7834: UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT
1.1       misho    7835: 
1.1.1.4 ! misho    7836:        As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30)
        !          7837:        and UTF-32 (from release 8.32), by means of two  additional  libraries.
        !          7838:        They can be built as well as, or instead of, the 8-bit library.
1.1.1.2   misho    7839: 
                   7840: 
                   7841: UTF-8 SUPPORT
1.1       misho    7842: 
1.1.1.2   misho    7843:        In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
                   7844:        with UTF support, and, in addition, you must call  pcre_compile()  with
                   7845:        the  PCRE_UTF8 option flag, or the pattern must start with the sequence
1.1.1.4 ! misho    7846:        (*UTF8) or (*UTF). When either of these is the case, both  the  pattern
        !          7847:        and  any  subject  strings  that  are matched against it are treated as
        !          7848:        UTF-8 strings instead of strings of individual 1-byte characters.
        !          7849: 
        !          7850: 
        !          7851: UTF-16 AND UTF-32 SUPPORT
        !          7852: 
        !          7853:        In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit
        !          7854:        or  32-bit  library  with  UTF support, and, in addition, you must call
        !          7855:        pcre16_compile() or pcre32_compile() with the PCRE_UTF16 or  PCRE_UTF32
        !          7856:        option flag, as appropriate. Alternatively, the pattern must start with
        !          7857:        the sequence (*UTF16), (*UTF32), as appropriate, or (*UTF),  which  can
        !          7858:        be used with either library. When UTF mode is set, both the pattern and
        !          7859:        any subject strings that are matched against it are treated  as  UTF-16
        !          7860:        or  UTF-32  strings  instead  of strings of individual 16-bit or 32-bit
        !          7861:        characters.
1.1.1.2   misho    7862: 
                   7863: 
                   7864: UTF SUPPORT OVERHEAD
                   7865: 
1.1.1.4 ! misho    7866:        If you compile PCRE with UTF support, but do not use it  at  run  time,
        !          7867:        the  library will be a bit bigger, but the additional run time overhead
        !          7868:        is limited to  testing  the  PCRE_UTF[8|16|32]  flag  occasionally,  so
        !          7869:        should not be very big.
1.1.1.2   misho    7870: 
                   7871: 
                   7872: UNICODE PROPERTY SUPPORT
1.1       misho    7873: 
                   7874:        If PCRE is built with Unicode character property support (which implies
1.1.1.4 ! misho    7875:        UTF support), the escape sequences \p{..}, \P{..}, and \X can be  used.
        !          7876:        The  available properties that can be tested are limited to the general
        !          7877:        category properties such as Lu for an upper case letter  or  Nd  for  a
1.1.1.2   misho    7878:        decimal number, the Unicode script names such as Arabic or Han, and the
1.1.1.4 ! misho    7879:        derived properties Any and L&. Full lists is given in  the  pcrepattern
        !          7880:        and  pcresyntax  documentation. Only the short names for properties are
        !          7881:        supported. For example, \p{L}  matches  a  letter.  Its  Perl  synonym,
        !          7882:        \p{Letter},  is  not  supported.  Furthermore, in Perl, many properties
        !          7883:        may optionally be prefixed by "Is", for compatibility  with  Perl  5.6.
        !          7884:        PCRE does not support this.
1.1       misho    7885: 
                   7886:    Validity of UTF-8 strings
                   7887: 
1.1.1.4 ! misho    7888:        When  you  set  the PCRE_UTF8 flag, the byte strings passed as patterns
1.1.1.2   misho    7889:        and subjects are (by default) checked for validity on entry to the rel-
1.1.1.3   misho    7890:        evant functions. The entire string is checked before any other process-
1.1.1.4 ! misho    7891:        ing takes place. From release 7.3 of PCRE, the check is  according  the
1.1.1.2   misho    7892:        rules of RFC 3629, which are themselves derived from the Unicode speci-
1.1.1.4 ! misho    7893:        fication. Earlier releases of PCRE followed  the  rules  of  RFC  2279,
        !          7894:        which  allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The
        !          7895:        current check allows only values in the range U+0 to U+10FFFF,  exclud-
        !          7896:        ing  the  surrogate area. (From release 8.33 the so-called "non-charac-
        !          7897:        ter" code points are no longer excluded because Unicode corrigendum  #9
        !          7898:        makes it clear that they should not be.)
        !          7899: 
        !          7900:        Characters  in  the "Surrogate Area" of Unicode are reserved for use by
        !          7901:        UTF-16, where they are used in pairs to encode codepoints  with  values
        !          7902:        greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
        !          7903:        are available independently in the  UTF-8  and  UTF-32  encodings.  (In
        !          7904:        other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
        !          7905:        unfortunately messes up UTF-8 and UTF-32.)
1.1       misho    7906: 
                   7907:        If an invalid UTF-8 string is passed to PCRE, an error return is given.
1.1.1.4 ! misho    7908:        At  compile  time, the only additional information is the offset to the
1.1.1.3   misho    7909:        first byte of the failing character. The run-time functions pcre_exec()
1.1.1.4 ! misho    7910:        and  pcre_dfa_exec() also pass back this information, as well as a more
        !          7911:        detailed reason code if the caller has provided memory in which  to  do
1.1       misho    7912:        this.
                   7913: 
1.1.1.4 ! misho    7914:        In  some  situations, you may already know that your strings are valid,
        !          7915:        and therefore want to skip these checks in  order  to  improve  perfor-
        !          7916:        mance,  for  example in the case of a long subject string that is being
        !          7917:        scanned repeatedly.  If you set the PCRE_NO_UTF8_CHECK flag at  compile
        !          7918:        time  or  at  run  time, PCRE assumes that the pattern or subject it is
        !          7919:        given (respectively) contains only valid UTF-8 codes. In this case,  it
        !          7920:        does not diagnose an invalid UTF-8 string.
        !          7921: 
        !          7922:        Note  that  passing  PCRE_NO_UTF8_CHECK to pcre_compile() just disables
        !          7923:        the check for the pattern; it does not also apply to  subject  strings.
        !          7924:        If  you  want  to  disable the check for a subject string you must pass
        !          7925:        this option to pcre_exec() or pcre_dfa_exec().
        !          7926: 
        !          7927:        If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the
        !          7928:        result is undefined and your program may crash.
1.1       misho    7929: 
1.1.1.2   misho    7930:    Validity of UTF-16 strings
1.1       misho    7931: 
1.1.1.2   misho    7932:        When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
                   7933:        are passed as patterns and subjects are (by default) checked for valid-
1.1.1.4 ! misho    7934:        ity  on entry to the relevant functions. Values other than those in the
1.1.1.2   misho    7935:        surrogate range U+D800 to U+DFFF are independent code points. Values in
                   7936:        the surrogate range must be used in pairs in the correct manner.
                   7937: 
1.1.1.4 ! misho    7938:        If  an  invalid  UTF-16  string  is  passed to PCRE, an error return is
        !          7939:        given. At compile time, the only additional information is  the  offset
1.1.1.3   misho    7940:        to the first data unit of the failing character. The run-time functions
1.1.1.2   misho    7941:        pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
1.1.1.4 ! misho    7942:        well  as  a more detailed reason code if the caller has provided memory
        !          7943:        in which to do this.
        !          7944: 
        !          7945:        In some situations, you may already know that your strings  are  valid,
        !          7946:        and  therefore  want  to  skip these checks in order to improve perfor-
        !          7947:        mance. If you set the PCRE_NO_UTF16_CHECK flag at compile  time  or  at
        !          7948:        run time, PCRE assumes that the pattern or subject it is given (respec-
        !          7949:        tively) contains only valid UTF-16 sequences. In this case, it does not
        !          7950:        diagnose  an  invalid  UTF-16 string.  However, if an invalid string is
        !          7951:        passed, the result is undefined.
        !          7952: 
        !          7953:    Validity of UTF-32 strings
        !          7954: 
        !          7955:        When you set the PCRE_UTF32 flag, the strings of 32-bit data units that
        !          7956:        are passed as patterns and subjects are (by default) checked for valid-
        !          7957:        ity on entry to the relevant functions.  This check allows only  values
        !          7958:        in  the  range  U+0 to U+10FFFF, excluding the surrogate area U+D800 to
        !          7959:        U+DFFF.
        !          7960: 
        !          7961:        If an invalid UTF-32 string is passed  to  PCRE,  an  error  return  is
        !          7962:        given.  At  compile time, the only additional information is the offset
        !          7963:        to the first data unit of the failing character. The run-time functions
        !          7964:        pcre32_exec() and pcre32_dfa_exec() also pass back this information, as
1.1.1.3   misho    7965:        well as a more detailed reason code if the caller has  provided  memory
1.1.1.2   misho    7966:        in which to do this.
                   7967: 
1.1.1.3   misho    7968:        In  some  situations, you may already know that your strings are valid,
                   7969:        and therefore want to skip these checks in  order  to  improve  perfor-
1.1.1.4 ! misho    7970:        mance.  If  you  set the PCRE_NO_UTF32_CHECK flag at compile time or at
1.1.1.2   misho    7971:        run time, PCRE assumes that the pattern or subject it is given (respec-
1.1.1.4 ! misho    7972:        tively) contains only valid UTF-32 sequences. In this case, it does not
        !          7973:        diagnose an invalid UTF-32 string.  However, if an  invalid  string  is
        !          7974:        passed, the result is undefined.
1.1.1.2   misho    7975: 
                   7976:    General comments about UTF modes
                   7977: 
1.1.1.4 ! misho    7978:        1.  Codepoints  less  than  256  can be specified in patterns by either
        !          7979:        braced or unbraced hexadecimal escape sequences (for example, \x{b3} or
        !          7980:        \xb3). Larger values have to use braced sequences.
1.1.1.2   misho    7981: 
1.1.1.4 ! misho    7982:        2.  Octal  numbers  up  to  \777 are recognized, and in UTF-8 mode they
1.1.1.2   misho    7983:        match two-byte characters for values greater than \177.
                   7984: 
                   7985:        3. Repeat quantifiers apply to complete UTF characters, not to individ-
                   7986:        ual data units, for example: \x{100}{3}.
                   7987: 
1.1.1.4 ! misho    7988:        4.  The dot metacharacter matches one UTF character instead of a single
1.1.1.2   misho    7989:        data unit.
                   7990: 
1.1.1.4 ! misho    7991:        5. The escape sequence \C can be used to match a single byte  in  UTF-8
        !          7992:        mode,  or  a single 16-bit data unit in UTF-16 mode, or a single 32-bit
        !          7993:        data unit in UTF-32 mode, but its use can lead to some strange  effects
        !          7994:        because  it  breaks up multi-unit characters (see the description of \C
        !          7995:        in the pcrepattern documentation). The use of \C is  not  supported  in
        !          7996:        the  alternative  matching  function  pcre[16|32]_dfa_exec(), nor is it
        !          7997:        supported in UTF mode by the JIT optimization of pcre[16|32]_exec(). If
        !          7998:        JIT  optimization  is  requested for a UTF pattern that contains \C, it
        !          7999:        will not succeed, and so the matching will be carried out by the normal
        !          8000:        interpretive function.
1.1       misho    8001: 
1.1.1.3   misho    8002:        6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
1.1       misho    8003:        test characters of any code value, but, by default, the characters that
1.1.1.3   misho    8004:        PCRE  recognizes  as digits, spaces, or word characters remain the same
                   8005:        set as in non-UTF mode, all with values less  than  256.  This  remains
                   8006:        true  even  when  PCRE  is  built  to include Unicode property support,
1.1.1.2   misho    8007:        because to do otherwise would slow down PCRE in many common cases. Note
1.1.1.3   misho    8008:        in  particular that this applies to \b and \B, because they are defined
1.1.1.2   misho    8009:        in terms of \w and \W. If you really want to test for a wider sense of,
1.1.1.3   misho    8010:        say,  "digit",  you  can  use  explicit  Unicode property tests such as
1.1.1.2   misho    8011:        \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
1.1.1.3   misho    8012:        character  escapes  work is changed so that Unicode properties are used
1.1.1.2   misho    8013:        to determine which characters match. There are more details in the sec-
                   8014:        tion on generic character types in the pcrepattern documentation.
1.1       misho    8015: 
1.1.1.3   misho    8016:        7.  Similarly,  characters that match the POSIX named character classes
1.1       misho    8017:        are all low-valued characters, unless the PCRE_UCP option is set.
                   8018: 
1.1.1.3   misho    8019:        8. However, the horizontal and vertical white  space  matching  escapes
                   8020:        (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
1.1       misho    8021:        whether or not PCRE_UCP is set.
                   8022: 
1.1.1.3   misho    8023:        9. Case-insensitive matching applies only to  characters  whose  values
                   8024:        are  less than 128, unless PCRE is built with Unicode property support.
1.1.1.4 ! misho    8025:        A few Unicode characters such as Greek sigma have more than  two  code-
        !          8026:        points that are case-equivalent. Up to and including PCRE release 8.31,
        !          8027:        only one-to-one case mappings were supported, but later releases  (with
        !          8028:        Unicode  property  support) do treat as case-equivalent all versions of
        !          8029:        characters such as Greek sigma.
1.1       misho    8030: 
                   8031: 
                   8032: AUTHOR
                   8033: 
                   8034:        Philip Hazel
                   8035:        University Computing Service
                   8036:        Cambridge CB2 3QH, England.
                   8037: 
                   8038: 
                   8039: REVISION
                   8040: 
1.1.1.4 ! misho    8041:        Last updated: 27 February 2013
        !          8042:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    8043: ------------------------------------------------------------------------------
                   8044: 
                   8045: 
1.1.1.4 ! misho    8046: PCREJIT(3)                 Library Functions Manual                 PCREJIT(3)
        !          8047: 
1.1       misho    8048: 
                   8049: 
                   8050: NAME
                   8051:        PCRE - Perl-compatible regular expressions
                   8052: 
                   8053: PCRE JUST-IN-TIME COMPILER SUPPORT
                   8054: 
                   8055:        Just-in-time  compiling  is a heavyweight optimization that can greatly
                   8056:        speed up pattern matching. However, it comes at the cost of extra  pro-
                   8057:        cessing before the match is performed. Therefore, it is of most benefit
                   8058:        when the same pattern is going to be matched many times. This does  not
1.1.1.2   misho    8059:        necessarily  mean  many calls of a matching function; if the pattern is
                   8060:        not anchored, matching attempts may take place many  times  at  various
                   8061:        positions  in  the  subject, even for a single call.  Therefore, if the
1.1       misho    8062:        subject string is very long, it may still pay to use  JIT  for  one-off
                   8063:        matches.
                   8064: 
1.1.1.2   misho    8065:        JIT  support  applies  only to the traditional Perl-compatible matching
                   8066:        function.  It does not apply when the DFA matching  function  is  being
                   8067:        used. The code for this support was written by Zoltan Herczeg.
                   8068: 
                   8069: 
1.1.1.4 ! misho    8070: 8-BIT, 16-BIT AND 32-BIT SUPPORT
1.1.1.2   misho    8071: 
1.1.1.4 ! misho    8072:        JIT  support  is available for all of the 8-bit, 16-bit and 32-bit PCRE
        !          8073:        libraries. To keep this documentation simple, only the 8-bit  interface
        !          8074:        is described in what follows. If you are using the 16-bit library, sub-
        !          8075:        stitute the  16-bit  functions  and  16-bit  structures  (for  example,
        !          8076:        pcre16_jit_stack  instead  of  pcre_jit_stack).  If  you  are using the
        !          8077:        32-bit library, substitute the 32-bit functions and  32-bit  structures
        !          8078:        (for example, pcre32_jit_stack instead of pcre_jit_stack).
1.1       misho    8079: 
                   8080: 
                   8081: AVAILABILITY OF JIT SUPPORT
                   8082: 
                   8083:        JIT  support  is  an  optional  feature of PCRE. The "configure" option
                   8084:        --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
                   8085:        built  if  you want to use JIT. The support is limited to the following
                   8086:        hardware platforms:
                   8087: 
                   8088:          ARM v5, v7, and Thumb2
                   8089:          Intel x86 32-bit and 64-bit
                   8090:          MIPS 32-bit
1.1.1.2   misho    8091:          Power PC 32-bit and 64-bit
1.1.1.4 ! misho    8092:          SPARC 32-bit (experimental)
1.1       misho    8093: 
1.1.1.3   misho    8094:        If --enable-jit is set on an unsupported platform, compilation fails.
1.1       misho    8095: 
                   8096:        A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
                   8097:        port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
                   8098:        option. The result is 1 when JIT is available, and  0  otherwise.  How-
                   8099:        ever, a simple program does not need to check this in order to use JIT.
1.1.1.4 ! misho    8100:        The normal API is implemented in a way that falls back to the interpre-
        !          8101:        tive code if JIT is not available. For programs that need the best pos-
        !          8102:        sible performance, there is also a "fast path"  API  that  is  JIT-spe-
        !          8103:        cific.
1.1       misho    8104: 
                   8105:        If  your program may sometimes be linked with versions of PCRE that are
                   8106:        older than 8.20, but you want to use JIT when it is available, you  can
                   8107:        test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
                   8108:        macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
                   8109: 
                   8110: 
                   8111: SIMPLE USE OF JIT
                   8112: 
                   8113:        You have to do two things to make use of the JIT support  in  the  sim-
                   8114:        plest way:
                   8115: 
                   8116:          (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
                   8117:              each compiled pattern, and pass the resulting pcre_extra block to
                   8118:              pcre_exec().
                   8119: 
                   8120:          (2) Use pcre_free_study() to free the pcre_extra block when it is
1.1.1.4 ! misho    8121:              no  longer  needed,  instead  of  just  freeing it yourself. This
        !          8122:        ensures that
        !          8123:              any JIT data is also freed.
1.1       misho    8124: 
1.1.1.4 ! misho    8125:        For a program that may be linked with pre-8.20 versions  of  PCRE,  you
1.1       misho    8126:        can insert
                   8127: 
                   8128:          #ifndef PCRE_STUDY_JIT_COMPILE
                   8129:          #define PCRE_STUDY_JIT_COMPILE 0
                   8130:          #endif
                   8131: 
1.1.1.4 ! misho    8132:        so  that  no  option  is passed to pcre_study(), and then use something
1.1       misho    8133:        like this to free the study data:
                   8134: 
                   8135:          #ifdef PCRE_CONFIG_JIT
                   8136:              pcre_free_study(study_ptr);
                   8137:          #else
                   8138:              pcre_free(study_ptr);
                   8139:          #endif
                   8140: 
1.1.1.4 ! misho    8141:        PCRE_STUDY_JIT_COMPILE requests the JIT compiler to generate  code  for
        !          8142:        complete  matches.  If  you  want  to  run  partial  matches  using the
        !          8143:        PCRE_PARTIAL_HARD or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(),  you
        !          8144:        should  set  one  or  both  of the following options in addition to, or
1.1.1.3   misho    8145:        instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
                   8146: 
                   8147:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
                   8148:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
                   8149: 
1.1.1.4 ! misho    8150:        The JIT compiler generates different optimized code  for  each  of  the
        !          8151:        three  modes  (normal, soft partial, hard partial). When pcre_exec() is
        !          8152:        called, the appropriate code is run if it is available. Otherwise,  the
1.1.1.3   misho    8153:        pattern is matched using interpretive code.
                   8154: 
1.1.1.4 ! misho    8155:        In  some circumstances you may need to call additional functions. These
        !          8156:        are described in the  section  entitled  "Controlling  the  JIT  stack"
1.1       misho    8157:        below.
                   8158: 
1.1.1.4 ! misho    8159:        If  JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc. are
1.1.1.3   misho    8160:        ignored, and no JIT data is created. Otherwise, the compiled pattern is
1.1.1.4 ! misho    8161:        passed  to the JIT compiler, which turns it into machine code that exe-
        !          8162:        cutes much faster than the normal interpretive code.  When  pcre_exec()
        !          8163:        is  passed  a  pcre_extra block containing a pointer to JIT code of the
        !          8164:        appropriate mode (normal or hard/soft  partial),  it  obeys  that  code
        !          8165:        instead  of  running  the interpreter. The result is identical, but the
1.1.1.3   misho    8166:        compiled JIT code runs much faster.
1.1       misho    8167: 
1.1.1.4 ! misho    8168:        There are some pcre_exec() options that are not supported for JIT  exe-
        !          8169:        cution.  There  are  also  some  pattern  items that JIT cannot handle.
        !          8170:        Details are given below. In both cases, execution  automatically  falls
        !          8171:        back  to  the  interpretive  code.  If you want to know whether JIT was
        !          8172:        actually used for a particular match, you  should  arrange  for  a  JIT
        !          8173:        callback  function  to  be  set up as described in the section entitled
        !          8174:        "Controlling the JIT stack" below, even if you do not need to supply  a
        !          8175:        non-default  JIT stack. Such a callback function is called whenever JIT
        !          8176:        code is about to be obeyed. If the execution options are not right  for
1.1.1.3   misho    8177:        JIT execution, the callback function is not obeyed.
1.1       misho    8178: 
1.1.1.4 ! misho    8179:        If  the  JIT  compiler finds an unsupported item, no JIT data is gener-
        !          8180:        ated. You can find out if JIT execution is available after  studying  a
        !          8181:        pattern  by  calling  pcre_fullinfo()  with the PCRE_INFO_JIT option. A
        !          8182:        result of 1 means that JIT compilation was successful. A  result  of  0
1.1       misho    8183:        means that JIT support is not available, or the pattern was not studied
1.1.1.4 ! misho    8184:        with PCRE_STUDY_JIT_COMPILE etc., or the JIT compiler was not  able  to
1.1.1.3   misho    8185:        handle the pattern.
1.1       misho    8186: 
                   8187:        Once a pattern has been studied, with or without JIT, it can be used as
                   8188:        many times as you like for matching different subject strings.
                   8189: 
                   8190: 
                   8191: UNSUPPORTED OPTIONS AND PATTERN ITEMS
                   8192: 
1.1.1.4 ! misho    8193:        The only pcre_exec() options that are supported for JIT  execution  are
        !          8194:        PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK, PCRE_NO_UTF32_CHECK, PCRE_NOT-
        !          8195:        BOL,  PCRE_NOTEOL,  PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,   PCRE_PAR-
        !          8196:        TIAL_HARD, and PCRE_PARTIAL_SOFT.
        !          8197: 
        !          8198:        The  only  unsupported  pattern items are \C (match a single data unit)
        !          8199:        when running in a UTF mode, and a callout immediately before an  asser-
        !          8200:        tion condition in a conditional group.
1.1       misho    8201: 
                   8202: 
                   8203: RETURN VALUES FROM JIT EXECUTION
                   8204: 
1.1.1.4 ! misho    8205:        When  a  pattern  is matched using JIT execution, the return values are
        !          8206:        the same as those given by the interpretive pcre_exec() code, with  the
        !          8207:        addition  of  one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means
        !          8208:        that the memory used for the JIT stack was insufficient. See  "Control-
1.1       misho    8209:        ling the JIT stack" below for a discussion of JIT stack usage. For com-
1.1.1.4 ! misho    8210:        patibility with the interpretive pcre_exec() code, no  more  than  two-
        !          8211:        thirds  of  the ovector argument is used for passing back captured sub-
1.1       misho    8212:        strings.
                   8213: 
1.1.1.4 ! misho    8214:        The error code PCRE_ERROR_MATCHLIMIT is returned by  the  JIT  code  if
        !          8215:        searching  a  very large pattern tree goes on for too long, as it is in
        !          8216:        the same circumstance when JIT is not used, but the details of  exactly
        !          8217:        what  is  counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error
1.1       misho    8218:        code is never returned by JIT execution.
                   8219: 
                   8220: 
                   8221: SAVING AND RESTORING COMPILED PATTERNS
                   8222: 
1.1.1.4 ! misho    8223:        The code that is generated by the  JIT  compiler  is  architecture-spe-
        !          8224:        cific,  and  is also position dependent. For those reasons it cannot be
        !          8225:        saved (in a file or database) and restored later like the bytecode  and
        !          8226:        other  data  of  a compiled pattern. Saving and restoring compiled pat-
        !          8227:        terns is not something many people do. More detail about this  facility
        !          8228:        is  given in the pcreprecompile documentation. It should be possible to
        !          8229:        run pcre_study() on a saved and restored pattern, and thereby  recreate
        !          8230:        the  JIT  data, but because JIT compilation uses significant resources,
        !          8231:        it is probably not worth doing this; you might as  well  recompile  the
1.1       misho    8232:        original pattern.
                   8233: 
                   8234: 
                   8235: CONTROLLING THE JIT STACK
                   8236: 
                   8237:        When the compiled JIT code runs, it needs a block of memory to use as a
1.1.1.4 ! misho    8238:        stack.  By default, it uses 32K on the  machine  stack.  However,  some
        !          8239:        large   or   complicated  patterns  need  more  than  this.  The  error
        !          8240:        PCRE_ERROR_JIT_STACKLIMIT is given when  there  is  not  enough  stack.
        !          8241:        Three  functions  are provided for managing blocks of memory for use as
        !          8242:        JIT stacks. There is further discussion about the use of JIT stacks  in
1.1       misho    8243:        the section entitled "JIT stack FAQ" below.
                   8244: 
1.1.1.4 ! misho    8245:        The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments
        !          8246:        are a starting size and a maximum size, and it returns a pointer to  an
        !          8247:        opaque  structure of type pcre_jit_stack, or NULL if there is an error.
        !          8248:        The pcre_jit_stack_free() function can be used to free a stack that  is
        !          8249:        no  longer  needed.  (For  the technically minded: the address space is
1.1       misho    8250:        allocated by mmap or VirtualAlloc.)
                   8251: 
1.1.1.4 ! misho    8252:        JIT uses far less memory for recursion than the interpretive code,  and
        !          8253:        a  maximum  stack size of 512K to 1M should be more than enough for any
1.1       misho    8254:        pattern.
                   8255: 
1.1.1.4 ! misho    8256:        The pcre_assign_jit_stack() function specifies  which  stack  JIT  code
1.1       misho    8257:        should use. Its arguments are as follows:
                   8258: 
                   8259:          pcre_extra         *extra
                   8260:          pcre_jit_callback  callback
                   8261:          void               *data
                   8262: 
1.1.1.4 ! misho    8263:        The  extra  argument  must  be  the  result  of studying a pattern with
1.1.1.3   misho    8264:        PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
1.1       misho    8265:        other two options:
                   8266: 
                   8267:          (1) If callback is NULL and data is NULL, an internal 32K block
                   8268:              on the machine stack is used.
                   8269: 
                   8270:          (2) If callback is NULL and data is not NULL, data must be
                   8271:              a valid JIT stack, the result of calling pcre_jit_stack_alloc().
                   8272: 
1.1.1.3   misho    8273:          (3) If callback is not NULL, it must point to a function that is
                   8274:              called with data as an argument at the start of matching, in
                   8275:              order to set up a JIT stack. If the return from the callback
                   8276:              function is NULL, the internal 32K stack is used; otherwise the
                   8277:              return value must be a valid JIT stack, the result of calling
                   8278:              pcre_jit_stack_alloc().
                   8279: 
1.1.1.4 ! misho    8280:        A  callback function is obeyed whenever JIT code is about to be run; it
        !          8281:        is not obeyed when pcre_exec() is called with options that  are  incom-
1.1.1.3   misho    8282:        patible for JIT execution. A callback function can therefore be used to
1.1.1.4 ! misho    8283:        determine whether a match operation was  executed  by  JIT  or  by  the
1.1.1.3   misho    8284:        interpreter.
                   8285: 
                   8286:        You may safely use the same JIT stack for more than one pattern (either
1.1.1.4 ! misho    8287:        by assigning directly or by callback), as long as the patterns are  all
        !          8288:        matched  sequentially in the same thread. In a multithread application,
        !          8289:        if you do not specify a JIT stack, or if you assign or pass  back  NULL
        !          8290:        from  a  callback, that is thread-safe, because each thread has its own
        !          8291:        machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
        !          8292:        stack,  this  must  be  a  different  stack for each thread so that the
1.1.1.3   misho    8293:        application is thread-safe.
                   8294: 
1.1.1.4 ! misho    8295:        Strictly speaking, even more is allowed. You can assign the  same  non-
        !          8296:        NULL  stack  to any number of patterns as long as they are not used for
        !          8297:        matching by multiple threads at the same time.  For  example,  you  can
        !          8298:        assign  the same stack to all compiled patterns, and use a global mutex
        !          8299:        in the callback to wait until the stack is available for use.  However,
1.1.1.3   misho    8300:        this is an inefficient solution, and not recommended.
1.1       misho    8301: 
1.1.1.4 ! misho    8302:        This  is a suggestion for how a multithreaded program that needs to set
1.1.1.3   misho    8303:        up non-default JIT stacks might operate:
1.1       misho    8304: 
                   8305:          During thread initalization
                   8306:            thread_local_var = pcre_jit_stack_alloc(...)
                   8307: 
                   8308:          During thread exit
                   8309:            pcre_jit_stack_free(thread_local_var)
                   8310: 
                   8311:          Use a one-line callback function
                   8312:            return thread_local_var
                   8313: 
1.1.1.4 ! misho    8314:        All the functions described in this section do nothing if  JIT  is  not
        !          8315:        available,  and  pcre_assign_jit_stack()  does nothing unless the extra
        !          8316:        argument is non-NULL and points to  a  pcre_extra  block  that  is  the
1.1.1.3   misho    8317:        result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
1.1       misho    8318: 
                   8319: 
                   8320: JIT STACK FAQ
                   8321: 
                   8322:        (1) Why do we need JIT stacks?
                   8323: 
1.1.1.4 ! misho    8324:        PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack
        !          8325:        where the local data of the current node is pushed before checking  its
1.1       misho    8326:        child nodes.  Allocating real machine stack on some platforms is diffi-
                   8327:        cult. For example, the stack chain needs to be updated every time if we
1.1.1.4 ! misho    8328:        extend  the  stack  on  PowerPC.  Although it is possible, its updating
1.1       misho    8329:        time overhead decreases performance. So we do the recursion in memory.
                   8330: 
                   8331:        (2) Why don't we simply allocate blocks of memory with malloc()?
                   8332: 
1.1.1.4 ! misho    8333:        Modern operating systems have a  nice  feature:  they  can  reserve  an
1.1       misho    8334:        address space instead of allocating memory. We can safely allocate mem-
1.1.1.4 ! misho    8335:        ory pages inside this address space, so the stack  could  grow  without
1.1       misho    8336:        moving memory data (this is important because of pointers). Thus we can
1.1.1.4 ! misho    8337:        allocate 1M address space, and use only a single memory  page  (usually
        !          8338:        4K)  if  that is enough. However, we can still grow up to 1M anytime if
1.1       misho    8339:        needed.
                   8340: 
                   8341:        (3) Who "owns" a JIT stack?
                   8342: 
                   8343:        The owner of the stack is the user program, not the JIT studied pattern
1.1.1.4 ! misho    8344:        or  anything else. The user program must ensure that if a stack is used
        !          8345:        by pcre_exec(), (that is, it is assigned to the pattern currently  run-
1.1       misho    8346:        ning), that stack must not be used by any other threads (to avoid over-
                   8347:        writing the same memory area). The best practice for multithreaded pro-
1.1.1.4 ! misho    8348:        grams  is  to  allocate  a stack for each thread, and return this stack
1.1       misho    8349:        through the JIT callback function.
                   8350: 
                   8351:        (4) When should a JIT stack be freed?
                   8352: 
                   8353:        You can free a JIT stack at any time, as long as it will not be used by
1.1.1.4 ! misho    8354:        pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a
        !          8355:        pointer is set. There is no reference counting or any other magic.  You
        !          8356:        can  free  the  patterns  and stacks in any order, anytime. Just do not
        !          8357:        call pcre_exec() with a pattern pointing to an already freed stack,  as
        !          8358:        that  will cause SEGFAULT. (Also, do not free a stack currently used by
        !          8359:        pcre_exec() in another thread). You can also replace the  stack  for  a
        !          8360:        pattern  at  any  time.  You  can  even  free the previous stack before
1.1       misho    8361:        assigning a replacement.
                   8362: 
1.1.1.4 ! misho    8363:        (5) Should I allocate/free a  stack  every  time  before/after  calling
1.1       misho    8364:        pcre_exec()?
                   8365: 
1.1.1.4 ! misho    8366:        No,  because  this  is  too  costly in terms of resources. However, you
        !          8367:        could implement some clever idea which release the stack if it  is  not
        !          8368:        used  in  let's  say  two minutes. The JIT callback can help to achieve
        !          8369:        this without keeping a list of the currently JIT studied patterns.
1.1       misho    8370: 
1.1.1.4 ! misho    8371:        (6) OK, the stack is for long term memory allocation. But what  happens
        !          8372:        if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
1.1       misho    8373:        until the stack is freed?
                   8374: 
1.1.1.4 ! misho    8375:        Especially on embedded sytems, it might be a good idea to release  mem-
        !          8376:        ory  sometimes  without  freeing the stack. There is no API for this at
        !          8377:        the moment.  Probably a function call which returns with the  currently
        !          8378:        allocated  memory for any stack and another which allows releasing mem-
1.1       misho    8379:        ory (shrinking the stack) would be a good idea if someone needs this.
                   8380: 
                   8381:        (7) This is too much of a headache. Isn't there any better solution for
                   8382:        JIT stack handling?
                   8383: 
1.1.1.4 ! misho    8384:        No,  thanks to Windows. If POSIX threads were used everywhere, we could
1.1       misho    8385:        throw out this complicated API.
                   8386: 
                   8387: 
                   8388: EXAMPLE CODE
                   8389: 
1.1.1.4 ! misho    8390:        This is a single-threaded example that specifies a  JIT  stack  without
1.1       misho    8391:        using a callback.
                   8392: 
                   8393:          int rc;
                   8394:          int ovector[30];
                   8395:          pcre *re;
                   8396:          pcre_extra *extra;
                   8397:          pcre_jit_stack *jit_stack;
                   8398: 
                   8399:          re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
                   8400:          /* Check for errors */
                   8401:          extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
                   8402:          jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
                   8403:          /* Check for error (NULL) */
                   8404:          pcre_assign_jit_stack(extra, NULL, jit_stack);
                   8405:          rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
                   8406:          /* Check results */
                   8407:          pcre_free(re);
                   8408:          pcre_free_study(extra);
                   8409:          pcre_jit_stack_free(jit_stack);
                   8410: 
                   8411: 
1.1.1.4 ! misho    8412: JIT FAST PATH API
        !          8413: 
        !          8414:        Because  the  API  described  above falls back to interpreted execution
        !          8415:        when JIT is not available, it is convenient for programs that are writ-
        !          8416:        ten  for  general  use  in  many environments. However, calling JIT via
        !          8417:        pcre_exec() does have a performance impact. Programs that  are  written
        !          8418:        for  use  where  JIT  is known to be available, and which need the best
        !          8419:        possible performance, can instead use a "fast path"  API  to  call  JIT
        !          8420:        execution  directly  instead of calling pcre_exec() (obviously only for
        !          8421:        patterns that have been successfully studied by JIT).
        !          8422: 
        !          8423:        The fast path function is called pcre_jit_exec(), and it takes  exactly
        !          8424:        the  same  arguments  as pcre_exec(), plus one additional argument that
        !          8425:        must point to a JIT stack. The JIT stack arrangements  described  above
        !          8426:        do not apply. The return values are the same as for pcre_exec().
        !          8427: 
        !          8428:        When  you  call  pcre_exec(), as well as testing for invalid options, a
        !          8429:        number of other sanity checks are performed on the arguments. For exam-
        !          8430:        ple,  if  the  subject  pointer  is NULL, or its length is negative, an
        !          8431:        immediate error is given. Also, unless PCRE_NO_UTF[8|16|32] is  set,  a
        !          8432:        UTF  subject  string is tested for validity. In the interests of speed,
        !          8433:        these checks do not happen on the JIT fast path, and if invalid data is
        !          8434:        passed, the result is undefined.
        !          8435: 
        !          8436:        Bypassing  the  sanity  checks  and  the  pcre_exec() wrapping can give
        !          8437:        speedups of more than 10%.
        !          8438: 
        !          8439: 
1.1       misho    8440: SEE ALSO
                   8441: 
                   8442:        pcreapi(3)
                   8443: 
                   8444: 
                   8445: AUTHOR
                   8446: 
                   8447:        Philip Hazel (FAQ by Zoltan Herczeg)
                   8448:        University Computing Service
                   8449:        Cambridge CB2 3QH, England.
                   8450: 
                   8451: 
                   8452: REVISION
                   8453: 
1.1.1.4 ! misho    8454:        Last updated: 17 March 2013
        !          8455:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    8456: ------------------------------------------------------------------------------
                   8457: 
                   8458: 
1.1.1.4 ! misho    8459: PCREPARTIAL(3)             Library Functions Manual             PCREPARTIAL(3)
        !          8460: 
1.1       misho    8461: 
                   8462: 
                   8463: NAME
                   8464:        PCRE - Perl-compatible regular expressions
                   8465: 
                   8466: PARTIAL MATCHING IN PCRE
                   8467: 
1.1.1.2   misho    8468:        In normal use of PCRE, if the subject string that is passed to a match-
                   8469:        ing function matches as far as it goes, but is too short to  match  the
                   8470:        entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
                   8471:        where it might be helpful to distinguish this case from other cases  in
                   8472:        which there is no match.
1.1       misho    8473: 
                   8474:        Consider, for example, an application where a human is required to type
                   8475:        in data for a field with specific formatting requirements.  An  example
                   8476:        might be a date in the form ddmmmyy, defined by this pattern:
                   8477: 
                   8478:          ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
                   8479: 
                   8480:        If the application sees the user's keystrokes one by one, and can check
                   8481:        that what has been typed so far is potentially valid,  it  is  able  to
                   8482:        raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
                   8483:        reflecting the character that has been typed, for example. This immedi-
                   8484:        ate  feedback is likely to be a better user interface than a check that
                   8485:        is delayed until the entire string has been entered.  Partial  matching
                   8486:        can  also be useful when the subject string is very long and is not all
                   8487:        available at once.
                   8488: 
                   8489:        PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
1.1.1.2   misho    8490:        PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
                   8491:        matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
                   8492:        onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
                   8493:        options is whether or not a partial match is preferred to  an  alterna-
                   8494:        tive complete match, though the details differ between the two types of
                   8495:        matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
                   8496:        precedence.
                   8497: 
1.1.1.3   misho    8498:        If  you  want to use partial matching with just-in-time optimized code,
1.1.1.4 ! misho    8499:        you must call pcre_study(), pcre16_study() or  pcre32_study() with  one
        !          8500:        or both of these options:
1.1.1.3   misho    8501: 
                   8502:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
                   8503:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
                   8504: 
                   8505:        PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
                   8506:        partial matches on the same pattern. If the appropriate JIT study  mode
                   8507:        has not been set for a match, the interpretive matching code is used.
                   8508: 
                   8509:        Setting a partial matching option disables two of PCRE's standard opti-
                   8510:        mizations. PCRE remembers the last literal data unit in a pattern,  and
                   8511:        abandons  matching  immediately  if  it  is  not present in the subject
1.1.1.2   misho    8512:        string. This optimization cannot be used  for  a  subject  string  that
                   8513:        might  match only partially. If the pattern was studied, PCRE knows the
                   8514:        minimum length of a matching string, and does not  bother  to  run  the
                   8515:        matching  function  on  shorter strings. This optimization is also dis-
1.1       misho    8516:        abled for partial matching.
                   8517: 
                   8518: 
1.1.1.4 ! misho    8519: PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()
1.1       misho    8520: 
1.1.1.4 ! misho    8521:        A  partial   match   occurs   during   a   call   to   pcre_exec()   or
        !          8522:        pcre[16|32]_exec()  when  the end of the subject string is reached suc-
        !          8523:        cessfully, but matching cannot continue  because  more  characters  are
        !          8524:        needed.   However, at least one character in the subject must have been
        !          8525:        inspected. This character need not  form  part  of  the  final  matched
        !          8526:        string;  lookbehind  assertions and the \K escape sequence provide ways
        !          8527:        of inspecting characters before the start of a matched  substring.  The
        !          8528:        requirement  for  inspecting  at  least one character exists because an
        !          8529:        empty string can always be matched; without such  a  restriction  there
        !          8530:        would  always  be  a partial match of an empty string at the end of the
        !          8531:        subject.
1.1.1.2   misho    8532: 
1.1.1.4 ! misho    8533:        If there are at least two slots in the offsets vector  when  a  partial
        !          8534:        match  is returned, the first slot is set to the offset of the earliest
1.1.1.2   misho    8535:        character that was inspected. For convenience, the second offset points
                   8536:        to the end of the subject so that a substring can easily be identified.
1.1.1.4 ! misho    8537:        If there are at least three slots in the offsets vector, the third slot
        !          8538:        is set to the offset of the character where matching started.
1.1       misho    8539: 
1.1.1.4 ! misho    8540:        For the majority of patterns, the contents of the first and third slots
        !          8541:        will be the same. However, for patterns that contain lookbehind  asser-
        !          8542:        tions, or begin with \b or \B, characters before the one where matching
        !          8543:        started may have been inspected while carrying out the match. For exam-
        !          8544:        ple, consider this pattern:
1.1       misho    8545: 
                   8546:          /(?<=abc)123/
                   8547: 
                   8548:        This pattern matches "123", but only if it is preceded by "abc". If the
1.1.1.4 ! misho    8549:        subject string is "xyzabc12", the first two  offsets  after  a  partial
        !          8550:        match  are for the substring "abc12", because all these characters were
        !          8551:        inspected. However, the third offset is set to 6, because that  is  the
        !          8552:        offset where matching began.
1.1       misho    8553: 
                   8554:        What happens when a partial match is identified depends on which of the
                   8555:        two partial matching options are set.
                   8556: 
1.1.1.4 ! misho    8557:    PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre[16|32]_exec()
1.1       misho    8558: 
1.1.1.4 ! misho    8559:        If PCRE_PARTIAL_SOFT is  set  when  pcre_exec()  or  pcre[16|32]_exec()
        !          8560:        identifies a partial match, the partial match is remembered, but match-
        !          8561:        ing continues as normal, and other  alternatives  in  the  pattern  are
        !          8562:        tried.  If  no  complete  match  can  be  found,  PCRE_ERROR_PARTIAL is
        !          8563:        returned instead of PCRE_ERROR_NOMATCH.
        !          8564: 
        !          8565:        This option is "soft" because it prefers a complete match over  a  par-
        !          8566:        tial  match.   All the various matching items in a pattern behave as if
        !          8567:        the subject string is potentially complete. For example, \z, \Z, and  $
        !          8568:        match  at  the end of the subject, as normal, and for \b and \B the end
1.1       misho    8569:        of the subject is treated as a non-alphanumeric.
                   8570: 
1.1.1.4 ! misho    8571:        If there is more than one partial match, the first one that  was  found
1.1       misho    8572:        provides the data that is returned. Consider this pattern:
                   8573: 
                   8574:          /123\w+X|dogY/
                   8575: 
1.1.1.4 ! misho    8576:        If  this is matched against the subject string "abc123dog", both alter-
        !          8577:        natives fail to match, but the end of the  subject  is  reached  during
        !          8578:        matching,  so  PCRE_ERROR_PARTIAL is returned. The offsets are set to 3
        !          8579:        and 9, identifying "123dog" as the first partial match that was  found.
        !          8580:        (In  this  example, there are two partial matches, because "dog" on its
1.1       misho    8581:        own partially matches the second alternative.)
                   8582: 
1.1.1.4 ! misho    8583:    PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre[16|32]_exec()
1.1       misho    8584: 
1.1.1.4 ! misho    8585:        If PCRE_PARTIAL_HARD is  set  for  pcre_exec()  or  pcre[16|32]_exec(),
        !          8586:        PCRE_ERROR_PARTIAL  is  returned  as  soon as a partial match is found,
1.1.1.2   misho    8587:        without continuing to search for possible complete matches. This option
                   8588:        is "hard" because it prefers an earlier partial match over a later com-
1.1.1.4 ! misho    8589:        plete match. For this reason, the assumption is made that  the  end  of
        !          8590:        the  supplied  subject  string may not be the true end of the available
1.1.1.2   misho    8591:        data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
1.1.1.4 ! misho    8592:        subject,  the  result is PCRE_ERROR_PARTIAL, provided that at least one
1.1.1.2   misho    8593:        character in the subject has been inspected.
                   8594: 
                   8595:        Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
1.1.1.4 ! misho    8596:        strings  are checked for validity. Normally, an invalid sequence causes
        !          8597:        the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16.  However,  in  the
        !          8598:        special  case  of  a  truncated  character  at  the end of the subject,
        !          8599:        PCRE_ERROR_SHORTUTF8  or   PCRE_ERROR_SHORTUTF16   is   returned   when
1.1.1.2   misho    8600:        PCRE_PARTIAL_HARD is set.
1.1       misho    8601: 
                   8602:    Comparing hard and soft partial matching
                   8603: 
1.1.1.4 ! misho    8604:        The  difference  between the two partial matching options can be illus-
1.1       misho    8605:        trated by a pattern such as:
                   8606: 
                   8607:          /dog(sbody)?/
                   8608: 
1.1.1.4 ! misho    8609:        This matches either "dog" or "dogsbody", greedily (that is, it  prefers
        !          8610:        the  longer  string  if  possible). If it is matched against the string
        !          8611:        "dog" with PCRE_PARTIAL_SOFT, it yields a  complete  match  for  "dog".
1.1       misho    8612:        However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
1.1.1.4 ! misho    8613:        On the other hand, if the pattern is made ungreedy the result  is  dif-
1.1       misho    8614:        ferent:
                   8615: 
                   8616:          /dog(sbody)??/
                   8617: 
1.1.1.4 ! misho    8618:        In  this  case  the  result  is always a complete match because that is
        !          8619:        found first, and matching never  continues  after  finding  a  complete
1.1.1.2   misho    8620:        match. It might be easier to follow this explanation by thinking of the
                   8621:        two patterns like this:
1.1       misho    8622: 
                   8623:          /dog(sbody)?/    is the same as  /dogsbody|dog/
                   8624:          /dog(sbody)??/   is the same as  /dog|dogsbody/
                   8625: 
1.1.1.4 ! misho    8626:        The second pattern will never match "dogsbody", because it will  always
1.1.1.2   misho    8627:        find the shorter match first.
1.1       misho    8628: 
                   8629: 
1.1.1.4 ! misho    8630: PARTIAL MATCHING USING pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
1.1       misho    8631: 
1.1.1.2   misho    8632:        The DFA functions move along the subject string character by character,
1.1.1.4 ! misho    8633:        without backtracking, searching for  all  possible  matches  simultane-
        !          8634:        ously.  If the end of the subject is reached before the end of the pat-
        !          8635:        tern, there is the possibility of a partial match, again provided  that
1.1.1.2   misho    8636:        at least one character has been inspected.
1.1       misho    8637: 
1.1.1.4 ! misho    8638:        When  PCRE_PARTIAL_SOFT  is set, PCRE_ERROR_PARTIAL is returned only if
        !          8639:        there have been no complete matches. Otherwise,  the  complete  matches
        !          8640:        are  returned.   However,  if PCRE_PARTIAL_HARD is set, a partial match
        !          8641:        takes precedence over any complete matches. The portion of  the  string
        !          8642:        that  was  inspected when the longest partial match was found is set as
1.1       misho    8643:        the first matching string, provided there are at least two slots in the
                   8644:        offsets vector.
                   8645: 
1.1.1.4 ! misho    8646:        Because  the  DFA functions always search for all possible matches, and
        !          8647:        there is no difference between greedy and  ungreedy  repetition,  their
        !          8648:        behaviour  is  different  from  the  standard  functions when PCRE_PAR-
        !          8649:        TIAL_HARD is  set.  Consider  the  string  "dog"  matched  against  the
1.1.1.2   misho    8650:        ungreedy pattern shown above:
1.1       misho    8651: 
                   8652:          /dog(sbody)??/
                   8653: 
1.1.1.4 ! misho    8654:        Whereas  the  standard functions stop as soon as they find the complete
        !          8655:        match for "dog", the DFA functions also  find  the  partial  match  for
1.1.1.2   misho    8656:        "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
1.1       misho    8657: 
                   8658: 
                   8659: PARTIAL MATCHING AND WORD BOUNDARIES
                   8660: 
1.1.1.4 ! misho    8661:        If  a  pattern ends with one of sequences \b or \B, which test for word
        !          8662:        boundaries, partial matching with PCRE_PARTIAL_SOFT can  give  counter-
1.1       misho    8663:        intuitive results. Consider this pattern:
                   8664: 
                   8665:          /\bcat\b/
                   8666: 
                   8667:        This matches "cat", provided there is a word boundary at either end. If
                   8668:        the subject string is "the cat", the comparison of the final "t" with a
1.1.1.4 ! misho    8669:        following  character  cannot  take  place, so a partial match is found.
        !          8670:        However, normal matching carries on, and \b matches at the end  of  the
        !          8671:        subject  when  the  last  character is a letter, so a complete match is
        !          8672:        found.  The  result,  therefore,  is  not   PCRE_ERROR_PARTIAL.   Using
        !          8673:        PCRE_PARTIAL_HARD  in  this case does yield PCRE_ERROR_PARTIAL, because
1.1.1.2   misho    8674:        then the partial match takes precedence.
1.1       misho    8675: 
                   8676: 
                   8677: FORMERLY RESTRICTED PATTERNS
                   8678: 
                   8679:        For releases of PCRE prior to 8.00, because of the way certain internal
1.1.1.4 ! misho    8680:        optimizations   were  implemented  in  the  pcre_exec()  function,  the
        !          8681:        PCRE_PARTIAL option (predecessor of  PCRE_PARTIAL_SOFT)  could  not  be
        !          8682:        used  with all patterns. From release 8.00 onwards, the restrictions no
        !          8683:        longer apply, and partial matching with can be requested for  any  pat-
1.1.1.2   misho    8684:        tern.
1.1       misho    8685: 
                   8686:        Items that were formerly restricted were repeated single characters and
1.1.1.4 ! misho    8687:        repeated metasequences. If PCRE_PARTIAL was set for a pattern that  did
        !          8688:        not  conform  to  the restrictions, pcre_exec() returned the error code
        !          8689:        PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in  use.  The
        !          8690:        PCRE_INFO_OKPARTIAL  call  to pcre_fullinfo() to find out if a compiled
1.1       misho    8691:        pattern can be used for partial matching now always returns 1.
                   8692: 
                   8693: 
                   8694: EXAMPLE OF PARTIAL MATCHING USING PCRETEST
                   8695: 
1.1.1.4 ! misho    8696:        If the escape sequence \P is present  in  a  pcretest  data  line,  the
        !          8697:        PCRE_PARTIAL_SOFT  option  is  used  for  the  match.  Here is a run of
1.1       misho    8698:        pcretest that uses the date example quoted above:
                   8699: 
                   8700:            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                   8701:          data> 25jun04\P
                   8702:           0: 25jun04
                   8703:           1: jun
                   8704:          data> 25dec3\P
                   8705:          Partial match: 23dec3
                   8706:          data> 3ju\P
                   8707:          Partial match: 3ju
                   8708:          data> 3juj\P
                   8709:          No match
                   8710:          data> j\P
                   8711:          No match
                   8712: 
1.1.1.4 ! misho    8713:        The first data string is matched  completely,  so  pcretest  shows  the
        !          8714:        matched  substrings.  The  remaining four strings do not match the com-
1.1       misho    8715:        plete pattern, but the first two are partial matches. Similar output is
1.1.1.2   misho    8716:        obtained if DFA matching is used.
1.1       misho    8717: 
1.1.1.4 ! misho    8718:        If  the escape sequence \P is present more than once in a pcretest data
1.1       misho    8719:        line, the PCRE_PARTIAL_HARD option is set for the match.
                   8720: 
                   8721: 
1.1.1.4 ! misho    8722: MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre[16|32]_dfa_exec()
1.1       misho    8723: 
1.1.1.4 ! misho    8724:        When a partial match has been found using a DFA matching  function,  it
        !          8725:        is  possible to continue the match by providing additional subject data
        !          8726:        and calling the function again with the same compiled  regular  expres-
        !          8727:        sion,  this time setting the PCRE_DFA_RESTART option. You must pass the
1.1       misho    8728:        same working space as before, because this is where details of the pre-
1.1.1.4 ! misho    8729:        vious  partial  match  are  stored.  Here is an example using pcretest,
        !          8730:        using the \R escape sequence to set  the  PCRE_DFA_RESTART  option  (\D
1.1.1.2   misho    8731:        specifies the use of the DFA matching function):
1.1       misho    8732: 
                   8733:            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                   8734:          data> 23ja\P\D
                   8735:          Partial match: 23ja
                   8736:          data> n05\R\D
                   8737:           0: n05
                   8738: 
1.1.1.4 ! misho    8739:        The  first  call has "23ja" as the subject, and requests partial match-
        !          8740:        ing; the second call  has  "n05"  as  the  subject  for  the  continued
        !          8741:        (restarted)  match.   Notice  that when the match is complete, only the
        !          8742:        last part is shown; PCRE does  not  retain  the  previously  partially-
        !          8743:        matched  string. It is up to the calling program to do that if it needs
1.1       misho    8744:        to.
                   8745: 
1.1.1.4 ! misho    8746:        You can set the PCRE_PARTIAL_SOFT  or  PCRE_PARTIAL_HARD  options  with
        !          8747:        PCRE_DFA_RESTART  to  continue partial matching over multiple segments.
        !          8748:        This facility can be used to pass very long subject strings to the  DFA
1.1.1.2   misho    8749:        matching functions.
                   8750: 
                   8751: 
1.1.1.4 ! misho    8752: MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre[16|32]_exec()
1.1.1.2   misho    8753: 
1.1.1.4 ! misho    8754:        From  release 8.00, the standard matching functions can also be used to
1.1.1.2   misho    8755:        do multi-segment matching. Unlike the DFA functions, it is not possible
1.1.1.4 ! misho    8756:        to  restart the previous match with a new segment of data. Instead, new
1.1.1.2   misho    8757:        data must be added to the previous subject string, and the entire match
1.1.1.4 ! misho    8758:        re-run,  starting from the point where the partial match occurred. Ear-
1.1.1.2   misho    8759:        lier data can be discarded.
                   8760: 
1.1.1.4 ! misho    8761:        It is best to use PCRE_PARTIAL_HARD in this situation, because it  does
        !          8762:        not  treat the end of a segment as the end of the subject when matching
        !          8763:        \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
1.1.1.2   misho    8764:        dates:
1.1       misho    8765: 
                   8766:            re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
                   8767:          data> The date is 23ja\P\P
                   8768:          Partial match: 23ja
                   8769: 
1.1.1.4 ! misho    8770:        At  this stage, an application could discard the text preceding "23ja",
        !          8771:        add on text from the next  segment,  and  call  the  matching  function
        !          8772:        again.  Unlike  the  DFA matching functions, the entire matching string
        !          8773:        must always be available, and the complete matching process occurs  for
1.1.1.2   misho    8774:        each call, so more memory and more processing time is needed.
                   8775: 
1.1.1.4 ! misho    8776:        Note:  If  the pattern contains lookbehind assertions, or \K, or starts
1.1.1.2   misho    8777:        with \b or \B, the string that is returned for a partial match includes
1.1.1.4 ! misho    8778:        characters  that precede the start of what would be returned for a com-
        !          8779:        plete match, because it contains all the characters that were inspected
        !          8780:        during the partial match.
1.1       misho    8781: 
                   8782: 
                   8783: ISSUES WITH MULTI-SEGMENT MATCHING
                   8784: 
                   8785:        Certain types of pattern may give problems with multi-segment matching,
                   8786:        whichever matching function is used.
                   8787: 
                   8788:        1. If the pattern contains a test for the beginning of a line, you need
1.1.1.3   misho    8789:        to  pass  the  PCRE_NOTBOL  option when the subject string for any call
                   8790:        does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
1.1       misho    8791:        option, but in practice when doing multi-segment matching you should be
                   8792:        using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
                   8793: 
1.1.1.3   misho    8794:        2. Lookbehind assertions that have already been obeyed are catered  for
                   8795:        in the offsets that are returned for a partial match. However a lookbe-
                   8796:        hind assertion later in the pattern could require even earlier  charac-
                   8797:        ters   to  be  inspected.  You  can  handle  this  case  by  using  the
                   8798:        PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
1.1.1.4 ! misho    8799:        pcre[16|32]_fullinfo()  functions  to  obtain the length of the longest
        !          8800:        lookbehind in the pattern. This length  is  given  in  characters,  not
        !          8801:        bytes.  If  you  always retain at least that many characters before the
        !          8802:        partially matched string, all should be  well.  (Of  course,  near  the
        !          8803:        start of the subject, fewer characters may be present; in that case all
        !          8804:        characters should be retained.)
        !          8805: 
        !          8806:        From release 8.33, there is a more accurate way of deciding which char-
        !          8807:        acters  to  retain.  Instead  of  subtracting the length of the longest
        !          8808:        lookbehind from the  earliest  inspected  character  (offsets[0]),  the
        !          8809:        match  start  position  (offsets[2]) should be used, and the next match
        !          8810:        attempt started at the offsets[2] character by setting the  startoffset
        !          8811:        argument of pcre_exec() or pcre_dfa_exec().
        !          8812: 
        !          8813:        For  example, if the pattern "(?<=123)abc" is partially matched against
        !          8814:        the string "xx123a", the three offset values returned are 2, 6, and  5.
        !          8815:        This  indicates  that  the  matching  process that gave a partial match
        !          8816:        started at offset 5, but the characters "123a" were all inspected.  The
        !          8817:        maximum  lookbehind  for  that pattern is 3, so taking that away from 5
        !          8818:        shows that we need only keep "123a", and the next match attempt can  be
        !          8819:        started at offset 3 (that is, at "a") when further characters have been
        !          8820:        added. When the match start is not the  earliest  inspected  character,
        !          8821:        pcretest shows it explicitly:
        !          8822: 
        !          8823:            re> "(?<=123)abc"
        !          8824:          data> xx123a\P\P
        !          8825:          Partial match at offset 5: 123a
1.1.1.3   misho    8826: 
1.1.1.4 ! misho    8827:        3.  Because a partial match must always contain at least one character,
        !          8828:        what might be considered a partial match of an  empty  string  actually
1.1.1.3   misho    8829:        gives a "no match" result. For example:
                   8830: 
                   8831:            re> /c(?<=abc)x/
                   8832:          data> ab\P
                   8833:          No match
                   8834: 
                   8835:        If the next segment begins "cx", a match should be found, but this will
1.1.1.4 ! misho    8836:        only happen if characters from the previous segment are  retained.  For
        !          8837:        this  reason,  a  "no  match"  result should be interpreted as "partial
1.1.1.3   misho    8838:        match of an empty string" when the pattern contains lookbehinds.
1.1       misho    8839: 
1.1.1.4 ! misho    8840:        4. Matching a subject string that is split into multiple  segments  may
        !          8841:        not  always produce exactly the same result as matching over one single
        !          8842:        long string, especially when PCRE_PARTIAL_SOFT  is  used.  The  section
        !          8843:        "Partial  Matching  and  Word Boundaries" above describes an issue that
        !          8844:        arises if the pattern ends with \b or \B. Another  kind  of  difference
        !          8845:        may  occur when there are multiple matching possibilities, because (for
        !          8846:        PCRE_PARTIAL_SOFT) a partial match result is given only when there  are
1.1       misho    8847:        no completed matches. This means that as soon as the shortest match has
1.1.1.4 ! misho    8848:        been found, continuation to a new subject segment is no  longer  possi-
1.1       misho    8849:        ble. Consider again this pcretest example:
                   8850: 
                   8851:            re> /dog(sbody)?/
                   8852:          data> dogsb\P
                   8853:           0: dog
                   8854:          data> do\P\D
                   8855:          Partial match: do
                   8856:          data> gsb\R\P\D
                   8857:           0: g
                   8858:          data> dogsbody\D
                   8859:           0: dogsbody
                   8860:           1: dog
                   8861: 
1.1.1.4 ! misho    8862:        The  first  data  line passes the string "dogsb" to a standard matching
        !          8863:        function, setting the PCRE_PARTIAL_SOFT option. Although the string  is
        !          8864:        a  partial  match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
        !          8865:        because the shorter string "dog" is a complete match.  Similarly,  when
        !          8866:        the  subject  is  presented to a DFA matching function in several parts
        !          8867:        ("do" and "gsb" being the first two) the match  stops  when  "dog"  has
        !          8868:        been  found, and it is not possible to continue.  On the other hand, if
        !          8869:        "dogsbody" is presented as a single string,  a  DFA  matching  function
1.1.1.2   misho    8870:        finds both matches.
1.1       misho    8871: 
1.1.1.4 ! misho    8872:        Because  of  these  problems,  it is best to use PCRE_PARTIAL_HARD when
        !          8873:        matching multi-segment data. The example  above  then  behaves  differ-
1.1       misho    8874:        ently:
                   8875: 
                   8876:            re> /dog(sbody)?/
                   8877:          data> dogsb\P\P
                   8878:          Partial match: dogsb
                   8879:          data> do\P\D
                   8880:          Partial match: do
                   8881:          data> gsb\R\P\P\D
                   8882:          Partial match: gsb
                   8883: 
1.1.1.3   misho    8884:        5. Patterns that contain alternatives at the top level which do not all
1.1.1.4 ! misho    8885:        start with the  same  pattern  item  may  not  work  as  expected  when
1.1.1.2   misho    8886:        PCRE_DFA_RESTART is used. For example, consider this pattern:
1.1       misho    8887: 
                   8888:          1234|3789
                   8889: 
1.1.1.4 ! misho    8890:        If  the  first  part of the subject is "ABC123", a partial match of the
        !          8891:        first alternative is found at offset 3. There is no partial  match  for
1.1       misho    8892:        the second alternative, because such a match does not start at the same
1.1.1.4 ! misho    8893:        point in the subject string. Attempting to  continue  with  the  string
        !          8894:        "7890"  does  not  yield  a  match because only those alternatives that
        !          8895:        match at one point in the subject are remembered.  The  problem  arises
        !          8896:        because  the  start  of the second alternative matches within the first
        !          8897:        alternative. There is no problem with  anchored  patterns  or  patterns
1.1       misho    8898:        such as:
                   8899: 
                   8900:          1234|ABCD
                   8901: 
1.1.1.4 ! misho    8902:        where  no  string can be a partial match for both alternatives. This is
        !          8903:        not a problem if a standard matching  function  is  used,  because  the
1.1.1.2   misho    8904:        entire match has to be rerun each time:
1.1       misho    8905: 
                   8906:            re> /1234|3789/
                   8907:          data> ABC123\P\P
                   8908:          Partial match: 123
                   8909:          data> 1237890
                   8910:           0: 3789
                   8911: 
                   8912:        Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
1.1.1.4 ! misho    8913:        running the entire match can also be used with the DFA  matching  func-
        !          8914:        tions.  Another  possibility  is to work with two buffers. If a partial
        !          8915:        match at offset n in the first buffer is followed by  "no  match"  when
        !          8916:        PCRE_DFA_RESTART  is  used on the second buffer, you can then try a new
1.1.1.2   misho    8917:        match starting at offset n+1 in the first buffer.
1.1       misho    8918: 
                   8919: 
                   8920: AUTHOR
                   8921: 
                   8922:        Philip Hazel
                   8923:        University Computing Service
                   8924:        Cambridge CB2 3QH, England.
                   8925: 
                   8926: 
                   8927: REVISION
                   8928: 
1.1.1.4 ! misho    8929:        Last updated: 20 February 2013
        !          8930:        Copyright (c) 1997-2013 University of Cambridge.
1.1       misho    8931: ------------------------------------------------------------------------------
                   8932: 
                   8933: 
1.1.1.4 ! misho    8934: PCREPRECOMPILE(3)          Library Functions Manual          PCREPRECOMPILE(3)
        !          8935: 
1.1       misho    8936: 
                   8937: 
                   8938: NAME
                   8939:        PCRE - Perl-compatible regular expressions
                   8940: 
                   8941: SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
                   8942: 
                   8943:        If  you  are running an application that uses a large number of regular
                   8944:        expression patterns, it may be useful to store them  in  a  precompiled
                   8945:        form  instead  of  having to compile them every time the application is
                   8946:        run.  If you are not  using  any  private  character  tables  (see  the
                   8947:        pcre_maketables()  documentation),  this is relatively straightforward.
                   8948:        If you are using private tables, it is a little bit  more  complicated.
1.1.1.2   misho    8949:        However,  if you are using the just-in-time optimization feature, it is
                   8950:        not possible to save and reload the JIT data.
1.1       misho    8951: 
                   8952:        If you save compiled patterns to a file, you can copy them to a differ-
1.1.1.2   misho    8953:        ent host and run them there. If the two hosts have different endianness
1.1.1.4 ! misho    8954:        (byte    order),    you     should     run     the     pcre[16|32]_pat-
        !          8955:        tern_to_host_byte_order()  function  on  the  new host before trying to
        !          8956:        match the pattern. The matching functions return  PCRE_ERROR_BADENDIAN-
        !          8957:        NESS if they detect a pattern with the wrong endianness.
1.1.1.2   misho    8958: 
                   8959:        Compiling  regular  expressions with one version of PCRE for use with a
                   8960:        different version is not guaranteed to work and may cause crashes,  and
                   8961:        saving  and  restoring  a  compiled  pattern loses any JIT optimization
                   8962:        data.
1.1       misho    8963: 
                   8964: 
                   8965: SAVING A COMPILED PATTERN
                   8966: 
1.1.1.4 ! misho    8967:        The value returned by pcre[16|32]_compile() points to a single block of
1.1.1.2   misho    8968:        memory  that  holds  the  compiled pattern and associated data. You can
1.1.1.4 ! misho    8969:        find   the   length   of   this   block    in    bytes    by    calling
        !          8970:        pcre[16|32]_fullinfo() with an argument of PCRE_INFO_SIZE. You can then
        !          8971:        save the data in any appropriate manner. Here is sample  code  for  the
        !          8972:        8-bit  library  that  compiles  a  pattern  and writes it to a file. It
        !          8973:        assumes that the variable fd refers to a file that is open for output:
1.1       misho    8974: 
                   8975:          int erroroffset, rc, size;
                   8976:          char *error;
                   8977:          pcre *re;
                   8978: 
                   8979:          re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
                   8980:          if (re == NULL) { ... handle errors ... }
                   8981:          rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
                   8982:          if (rc < 0) { ... handle errors ... }
                   8983:          rc = fwrite(re, 1, size, fd);
                   8984:          if (rc != size) { ... handle errors ... }
                   8985: 
1.1.1.2   misho    8986:        In this example, the bytes  that  comprise  the  compiled  pattern  are
                   8987:        copied  exactly.  Note that this is binary data that may contain any of
                   8988:        the 256 possible byte  values.  On  systems  that  make  a  distinction
1.1       misho    8989:        between binary and non-binary data, be sure that the file is opened for
                   8990:        binary output.
                   8991: 
1.1.1.2   misho    8992:        If you want to write more than one pattern to a file, you will have  to
                   8993:        devise  a  way of separating them. For binary data, preceding each pat-
                   8994:        tern with its length is probably  the  most  straightforward  approach.
                   8995:        Another  possibility is to write out the data in hexadecimal instead of
1.1       misho    8996:        binary, one pattern to a line.
                   8997: 
1.1.1.2   misho    8998:        Saving compiled patterns in a file is only one possible way of  storing
                   8999:        them  for later use. They could equally well be saved in a database, or
                   9000:        in the memory of some daemon process that passes them  via  sockets  to
1.1       misho    9001:        the processes that want them.
                   9002: 
                   9003:        If the pattern has been studied, it is also possible to save the normal
                   9004:        study data in a similar way to the compiled pattern itself. However, if
                   9005:        the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
1.1.1.2   misho    9006:        ated cannot be saved because it is too dependent on the  current  envi-
                   9007:        ronment.    When    studying    generates    additional    information,
1.1.1.4 ! misho    9008:        pcre[16|32]_study() returns  a  pointer  to  a  pcre[16|32]_extra  data
        !          9009:        block.  Its  format  is defined in the section on matching a pattern in
        !          9010:        the pcreapi documentation. The study_data field points  to  the  binary
        !          9011:        study  data,  and this is what you must save (not the pcre[16|32]_extra
        !          9012:        block itself). The length of the study data can be obtained by  calling
        !          9013:        pcre[16|32]_fullinfo()  with an argument of PCRE_INFO_STUDYSIZE. Remem-
        !          9014:        ber to check that  pcre[16|32]_study()  did  return  a  non-NULL  value
        !          9015:        before trying to save the study data.
1.1       misho    9016: 
                   9017: 
                   9018: RE-USING A PRECOMPILED PATTERN
                   9019: 
                   9020:        Re-using  a  precompiled pattern is straightforward. Having reloaded it
1.1.1.4 ! misho    9021:        into main memory,  called  pcre[16|32]_pattern_to_host_byte_order()  if
        !          9022:        necessary,    you   pass   its   pointer   to   pcre[16|32]_exec()   or
        !          9023:        pcre[16|32]_dfa_exec() in the usual way.
1.1.1.2   misho    9024: 
                   9025:        However, if you passed a pointer to custom character  tables  when  the
1.1.1.4 ! misho    9026:        pattern  was compiled (the tableptr argument of pcre[16|32]_compile()),
        !          9027:        you  must  now  pass  a  similar  pointer  to   pcre[16|32]_exec()   or
        !          9028:        pcre[16|32]_dfa_exec(),  because the value saved with the compiled pat-
        !          9029:        tern will obviously be nonsense. A field in a pcre[16|32]_extra() block
        !          9030:        is  used  to  pass this data, as described in the section on matching a
        !          9031:        pattern in the pcreapi documentation.
1.1.1.2   misho    9032: 
                   9033:        If you did not provide custom character tables  when  the  pattern  was
                   9034:        compiled, the pointer in the compiled pattern is NULL, which causes the
                   9035:        matching functions to use PCRE's internal tables. Thus, you do not need
                   9036:        to take any special action at run time in this case.
                   9037: 
                   9038:        If  you  saved study data with the compiled pattern, you need to create
1.1.1.4 ! misho    9039:        your own pcre[16|32]_extra data block and set the study_data  field  to
1.1.1.2   misho    9040:        point   to   the   reloaded   study   data.   You  must  also  set  the
                   9041:        PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
1.1.1.4 ! misho    9042:        data  is present. Then pass the pcre[16|32]_extra block to the matching
1.1.1.2   misho    9043:        function in the usual way. If the pattern was studied for  just-in-time
                   9044:        optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
                   9045:        save/restore cycle.
1.1       misho    9046: 
                   9047: 
                   9048: COMPATIBILITY WITH DIFFERENT PCRE RELEASES
                   9049: 
                   9050:        In general, it is safest to  recompile  all  saved  patterns  when  you
                   9051:        update  to  a new PCRE release, though not all updates actually require
                   9052:        this.
                   9053: 
                   9054: 
                   9055: AUTHOR
                   9056: 
                   9057:        Philip Hazel
                   9058:        University Computing Service
                   9059:        Cambridge CB2 3QH, England.
                   9060: 
                   9061: 
                   9062: REVISION
                   9063: 
1.1.1.4 ! misho    9064:        Last updated: 24 June 2012
1.1.1.2   misho    9065:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9066: ------------------------------------------------------------------------------
                   9067: 
                   9068: 
1.1.1.4 ! misho    9069: PCREPERFORM(3)             Library Functions Manual             PCREPERFORM(3)
        !          9070: 
1.1       misho    9071: 
                   9072: 
                   9073: NAME
                   9074:        PCRE - Perl-compatible regular expressions
                   9075: 
                   9076: PCRE PERFORMANCE
                   9077: 
                   9078:        Two  aspects  of performance are discussed below: memory usage and pro-
                   9079:        cessing time. The way you express your pattern as a regular  expression
                   9080:        can affect both of them.
                   9081: 
                   9082: 
                   9083: COMPILED PATTERN MEMORY USAGE
                   9084: 
1.1.1.2   misho    9085:        Patterns  are compiled by PCRE into a reasonably efficient interpretive
                   9086:        code, so that most simple patterns do not  use  much  memory.  However,
                   9087:        there  is  one case where the memory usage of a compiled pattern can be
                   9088:        unexpectedly large. If a parenthesized subpattern has a quantifier with
                   9089:        a minimum greater than 1 and/or a limited maximum, the whole subpattern
                   9090:        is repeated in the compiled code. For example, the pattern
1.1       misho    9091: 
                   9092:          (abc|def){2,4}
                   9093: 
                   9094:        is compiled as if it were
                   9095: 
                   9096:          (abc|def)(abc|def)((abc|def)(abc|def)?)?
                   9097: 
                   9098:        (Technical aside: It is done this way so that backtrack  points  within
                   9099:        each of the repetitions can be independently maintained.)
                   9100: 
                   9101:        For  regular expressions whose quantifiers use only small numbers, this
                   9102:        is not usually a problem. However, if the numbers are large,  and  par-
                   9103:        ticularly  if  such repetitions are nested, the memory usage can become
                   9104:        an embarrassment. For example, the very simple pattern
                   9105: 
                   9106:          ((ab){1,1000}c){1,3}
                   9107: 
1.1.1.2   misho    9108:        uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
                   9109:        piled  with  its  default  internal pointer size of two bytes, the size
                   9110:        limit on a compiled pattern is 64K data units, and this is reached with
                   9111:        the  above  pattern  if  the outer repetition is increased from 3 to 4.
                   9112:        PCRE can be compiled to use larger internal pointers  and  thus  handle
                   9113:        larger  compiled patterns, but it is better to try to rewrite your pat-
                   9114:        tern to use less memory if you can.
1.1       misho    9115: 
1.1.1.2   misho    9116:        One way of reducing the memory usage for such patterns is to  make  use
1.1       misho    9117:        of PCRE's "subroutine" facility. Re-writing the above pattern as
                   9118: 
                   9119:          ((ab)(?2){0,999}c)(?1){0,2}
                   9120: 
                   9121:        reduces the memory requirements to 18K, and indeed it remains under 20K
1.1.1.2   misho    9122:        even with the outer repetition increased to 100. However, this  pattern
                   9123:        is  not  exactly equivalent, because the "subroutine" calls are treated
                   9124:        as atomic groups into which there can be no backtracking if there is  a
                   9125:        subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
                   9126:        rewriting automatically.  Furthermore, there is a  noticeable  loss  of
                   9127:        speed  when executing the modified pattern. Nevertheless, if the atomic
                   9128:        grouping is not a problem and the loss of  speed  is  acceptable,  this
                   9129:        kind  of  rewriting will allow you to process patterns that PCRE cannot
1.1       misho    9130:        otherwise handle.
                   9131: 
                   9132: 
                   9133: STACK USAGE AT RUN TIME
                   9134: 
1.1.1.4 ! misho    9135:        When pcre_exec() or pcre[16|32]_exec() is used  for  matching,  certain
        !          9136:        kinds  of  pattern  can  cause  it  to use large amounts of the process
        !          9137:        stack. In some environments the default process stack is  quite  small,
        !          9138:        and  if it runs out the result is often SIGSEGV. This issue is probably
        !          9139:        the most frequently raised problem with PCRE.  Rewriting  your  pattern
        !          9140:        can  often  help.  The  pcrestack documentation discusses this issue in
        !          9141:        detail.
1.1       misho    9142: 
                   9143: 
                   9144: PROCESSING TIME
                   9145: 
1.1.1.4 ! misho    9146:        Certain items in regular expression patterns are processed  more  effi-
1.1       misho    9147:        ciently than others. It is more efficient to use a character class like
1.1.1.4 ! misho    9148:        [aeiou]  than  a  set  of   single-character   alternatives   such   as
        !          9149:        (a|e|i|o|u).  In  general,  the simplest construction that provides the
1.1       misho    9150:        required behaviour is usually the most efficient. Jeffrey Friedl's book
1.1.1.4 ! misho    9151:        contains  a  lot  of useful general discussion about optimizing regular
        !          9152:        expressions for efficient performance. This  document  contains  a  few
1.1       misho    9153:        observations about PCRE.
                   9154: 
1.1.1.4 ! misho    9155:        Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
        !          9156:        slow, because PCRE has to use a multi-stage table  lookup  whenever  it
        !          9157:        needs  a  character's  property. If you can find an alternative pattern
        !          9158:        that does not use character properties, it will probably be faster.
1.1       misho    9159: 
1.1.1.2   misho    9160:        By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
                   9161:        character  classes  such  as  [:alpha:]  do not use Unicode properties,
1.1       misho    9162:        partly for backwards compatibility, and partly for performance reasons.
1.1.1.2   misho    9163:        However,  you can set PCRE_UCP if you want Unicode character properties
                   9164:        to be used. This can double the matching time for  items  such  as  \d,
                   9165:        when matched with a traditional matching function; the performance loss
                   9166:        is less with a DFA matching function, and in both cases  there  is  not
                   9167:        much difference for \b.
1.1       misho    9168: 
                   9169:        When  a  pattern  begins  with .* not in parentheses, or in parentheses
                   9170:        that are not the subject of a backreference, and the PCRE_DOTALL option
                   9171:        is  set, the pattern is implicitly anchored by PCRE, since it can match
                   9172:        only at the start of a subject string. However, if PCRE_DOTALL  is  not
                   9173:        set,  PCRE  cannot  make this optimization, because the . metacharacter
                   9174:        does not then match a newline, and if the subject string contains  new-
                   9175:        lines,  the  pattern may match from the character immediately following
                   9176:        one of them instead of from the very start. For example, the pattern
                   9177: 
                   9178:          .*second
                   9179: 
                   9180:        matches the subject "first\nand second" (where \n stands for a  newline
                   9181:        character),  with the match starting at the seventh character. In order
                   9182:        to do this, PCRE has to retry the match starting after every newline in
                   9183:        the subject.
                   9184: 
                   9185:        If  you  are using such a pattern with subject strings that do not con-
                   9186:        tain newlines, the best performance is obtained by setting PCRE_DOTALL,
                   9187:        or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
                   9188:        ing. That saves PCRE from having to scan along the subject looking  for
                   9189:        a newline to restart at.
                   9190: 
                   9191:        Beware  of  patterns  that contain nested indefinite repeats. These can
                   9192:        take a long time to run when applied to a string that does  not  match.
                   9193:        Consider the pattern fragment
                   9194: 
                   9195:          ^(a+)*
                   9196: 
                   9197:        This  can  match "aaaa" in 16 different ways, and this number increases
                   9198:        very rapidly as the string gets longer. (The * repeat can match  0,  1,
                   9199:        2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
                   9200:        repeats can match different numbers of times.) When  the  remainder  of
                   9201:        the pattern is such that the entire match is going to fail, PCRE has in
                   9202:        principle to try  every  possible  variation,  and  this  can  take  an
                   9203:        extremely long time, even for relatively short strings.
                   9204: 
                   9205:        An optimization catches some of the more simple cases such as
                   9206: 
                   9207:          (a+)*b
                   9208: 
                   9209:        where  a  literal  character  follows. Before embarking on the standard
                   9210:        matching procedure, PCRE checks that there is a "b" later in  the  sub-
                   9211:        ject  string, and if there is not, it fails the match immediately. How-
                   9212:        ever, when there is no following literal this  optimization  cannot  be
                   9213:        used. You can see the difference by comparing the behaviour of
                   9214: 
                   9215:          (a+)*\d
                   9216: 
                   9217:        with  the  pattern  above.  The former gives a failure almost instantly
                   9218:        when applied to a whole line of  "a"  characters,  whereas  the  latter
                   9219:        takes an appreciable time with strings longer than about 20 characters.
                   9220: 
                   9221:        In many cases, the solution to this kind of performance issue is to use
                   9222:        an atomic group or a possessive quantifier.
                   9223: 
                   9224: 
                   9225: AUTHOR
                   9226: 
                   9227:        Philip Hazel
                   9228:        University Computing Service
                   9229:        Cambridge CB2 3QH, England.
                   9230: 
                   9231: 
                   9232: REVISION
                   9233: 
1.1.1.4 ! misho    9234:        Last updated: 25 August 2012
1.1.1.2   misho    9235:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9236: ------------------------------------------------------------------------------
                   9237: 
                   9238: 
1.1.1.4 ! misho    9239: PCREPOSIX(3)               Library Functions Manual               PCREPOSIX(3)
        !          9240: 
1.1       misho    9241: 
                   9242: 
                   9243: NAME
                   9244:        PCRE - Perl-compatible regular expressions.
                   9245: 
                   9246: SYNOPSIS OF POSIX API
                   9247: 
                   9248:        #include <pcreposix.h>
                   9249: 
                   9250:        int regcomp(regex_t *preg, const char *pattern,
                   9251:             int cflags);
                   9252: 
                   9253:        int regexec(regex_t *preg, const char *string,
                   9254:             size_t nmatch, regmatch_t pmatch[], int eflags);
                   9255: 
                   9256:        size_t regerror(int errcode, const regex_t *preg,
                   9257:             char *errbuf, size_t errbuf_size);
                   9258: 
                   9259:        void regfree(regex_t *preg);
                   9260: 
                   9261: 
                   9262: DESCRIPTION
                   9263: 
1.1.1.2   misho    9264:        This  set  of functions provides a POSIX-style API for the PCRE regular
                   9265:        expression 8-bit library. See the pcreapi documentation for a  descrip-
                   9266:        tion  of  PCRE's native API, which contains much additional functional-
1.1.1.4 ! misho    9267:        ity. There is no POSIX-style  wrapper  for  PCRE's  16-bit  and  32-bit
        !          9268:        library.
1.1       misho    9269: 
                   9270:        The functions described here are just wrapper functions that ultimately
                   9271:        call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
1.1.1.4 ! misho    9272:        pcreposix.h  header  file,  and  on  Unix systems the library itself is
        !          9273:        called pcreposix.a, so can be accessed by  adding  -lpcreposix  to  the
        !          9274:        command  for  linking  an application that uses them. Because the POSIX
1.1       misho    9275:        functions call the native ones, it is also necessary to add -lpcre.
                   9276: 
1.1.1.4 ! misho    9277:        I have implemented only those POSIX option bits that can be  reasonably
        !          9278:        mapped  to PCRE native options. In addition, the option REG_EXTENDED is
        !          9279:        defined with the value zero. This has no  effect,  but  since  programs
        !          9280:        that  are  written  to  the POSIX interface often use it, this makes it
        !          9281:        easier to slot in PCRE as a replacement library.  Other  POSIX  options
1.1       misho    9282:        are not even defined.
                   9283: 
1.1.1.4 ! misho    9284:        There  are also some other options that are not defined by POSIX. These
1.1       misho    9285:        have been added at the request of users who want to make use of certain
                   9286:        PCRE-specific features via the POSIX calling interface.
                   9287: 
1.1.1.4 ! misho    9288:        When  PCRE  is  called  via these functions, it is only the API that is
        !          9289:        POSIX-like in style. The syntax and semantics of  the  regular  expres-
        !          9290:        sions  themselves  are  still  those of Perl, subject to the setting of
        !          9291:        various PCRE options, as described below. "POSIX-like in  style"  means
        !          9292:        that  the  API  approximates  to  the POSIX definition; it is not fully
        !          9293:        POSIX-compatible, and in multi-byte encoding  domains  it  is  probably
1.1       misho    9294:        even less compatible.
                   9295: 
1.1.1.4 ! misho    9296:        The  header for these functions is supplied as pcreposix.h to avoid any
        !          9297:        potential clash with other POSIX  libraries.  It  can,  of  course,  be
1.1       misho    9298:        renamed or aliased as regex.h, which is the "correct" name. It provides
1.1.1.4 ! misho    9299:        two structure types, regex_t for  compiled  internal  forms,  and  reg-
        !          9300:        match_t  for  returning  captured substrings. It also defines some con-
        !          9301:        stants whose names start  with  "REG_";  these  are  used  for  setting
1.1       misho    9302:        options and identifying error codes.
                   9303: 
                   9304: 
                   9305: COMPILING A PATTERN
                   9306: 
1.1.1.4 ! misho    9307:        The  function regcomp() is called to compile a pattern into an internal
        !          9308:        form. The pattern is a C string terminated by a  binary  zero,  and  is
        !          9309:        passed  in  the  argument  pattern. The preg argument is a pointer to a
        !          9310:        regex_t structure that is used as a base for storing information  about
1.1       misho    9311:        the compiled regular expression.
                   9312: 
                   9313:        The argument cflags is either zero, or contains one or more of the bits
                   9314:        defined by the following macros:
                   9315: 
                   9316:          REG_DOTALL
                   9317: 
                   9318:        The PCRE_DOTALL option is set when the regular expression is passed for
                   9319:        compilation to the native function. Note that REG_DOTALL is not part of
                   9320:        the POSIX standard.
                   9321: 
                   9322:          REG_ICASE
                   9323: 
1.1.1.4 ! misho    9324:        The PCRE_CASELESS option is set when the regular expression  is  passed
1.1       misho    9325:        for compilation to the native function.
                   9326: 
                   9327:          REG_NEWLINE
                   9328: 
1.1.1.4 ! misho    9329:        The  PCRE_MULTILINE option is set when the regular expression is passed
        !          9330:        for compilation to the native function. Note that this does  not  mimic
        !          9331:        the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
1.1       misho    9332:        tion).
                   9333: 
                   9334:          REG_NOSUB
                   9335: 
1.1.1.4 ! misho    9336:        The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
1.1       misho    9337:        passed for compilation to the native function. In addition, when a pat-
1.1.1.4 ! misho    9338:        tern that is compiled with this flag is passed to regexec() for  match-
        !          9339:        ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
1.1       misho    9340:        strings are returned.
                   9341: 
                   9342:          REG_UCP
                   9343: 
1.1.1.4 ! misho    9344:        The PCRE_UCP option is set when the regular expression  is  passed  for
        !          9345:        compilation  to  the  native  function. This causes PCRE to use Unicode
        !          9346:        properties when matchine \d, \w,  etc.,  instead  of  just  recognizing
1.1       misho    9347:        ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
                   9348: 
                   9349:          REG_UNGREEDY
                   9350: 
1.1.1.4 ! misho    9351:        The  PCRE_UNGREEDY  option is set when the regular expression is passed
        !          9352:        for compilation to the native function. Note that REG_UNGREEDY  is  not
1.1       misho    9353:        part of the POSIX standard.
                   9354: 
                   9355:          REG_UTF8
                   9356: 
1.1.1.4 ! misho    9357:        The  PCRE_UTF8  option is set when the regular expression is passed for
        !          9358:        compilation to the native function. This causes the pattern itself  and
        !          9359:        all  data  strings used for matching it to be treated as UTF-8 strings.
1.1       misho    9360:        Note that REG_UTF8 is not part of the POSIX standard.
                   9361: 
1.1.1.4 ! misho    9362:        In the absence of these flags, no options  are  passed  to  the  native
        !          9363:        function.   This  means  the  the  regex  is compiled with PCRE default
        !          9364:        semantics. In particular, the way it handles newline characters in  the
        !          9365:        subject  string  is  the Perl way, not the POSIX way. Note that setting
        !          9366:        PCRE_MULTILINE has only some of the effects specified for  REG_NEWLINE.
        !          9367:        It  does not affect the way newlines are matched by . (they are not) or
1.1       misho    9368:        by a negative class such as [^a] (they are).
                   9369: 
1.1.1.4 ! misho    9370:        The yield of regcomp() is zero on success, and non-zero otherwise.  The
1.1       misho    9371:        preg structure is filled in on success, and one member of the structure
1.1.1.4 ! misho    9372:        is public: re_nsub contains the number of capturing subpatterns in  the
1.1       misho    9373:        regular expression. Various error codes are defined in the header file.
                   9374: 
1.1.1.4 ! misho    9375:        NOTE:  If  the  yield of regcomp() is non-zero, you must not attempt to
1.1       misho    9376:        use the contents of the preg structure. If, for example, you pass it to
                   9377:        regexec(), the result is undefined and your program is likely to crash.
                   9378: 
                   9379: 
                   9380: MATCHING NEWLINE CHARACTERS
                   9381: 
                   9382:        This area is not simple, because POSIX and Perl take different views of
1.1.1.4 ! misho    9383:        things.  It is not possible to get PCRE to obey  POSIX  semantics,  but
        !          9384:        then  PCRE was never intended to be a POSIX engine. The following table
        !          9385:        lists the different possibilities for matching  newline  characters  in
1.1       misho    9386:        PCRE:
                   9387: 
                   9388:                                  Default   Change with
                   9389: 
                   9390:          . matches newline          no     PCRE_DOTALL
                   9391:          newline matches [^a]       yes    not changeable
                   9392:          $ matches \n at end        yes    PCRE_DOLLARENDONLY
                   9393:          $ matches \n in middle     no     PCRE_MULTILINE
                   9394:          ^ matches \n in middle     no     PCRE_MULTILINE
                   9395: 
                   9396:        This is the equivalent table for POSIX:
                   9397: 
                   9398:                                  Default   Change with
                   9399: 
                   9400:          . matches newline          yes    REG_NEWLINE
                   9401:          newline matches [^a]       yes    REG_NEWLINE
                   9402:          $ matches \n at end        no     REG_NEWLINE
                   9403:          $ matches \n in middle     no     REG_NEWLINE
                   9404:          ^ matches \n in middle     no     REG_NEWLINE
                   9405: 
                   9406:        PCRE's behaviour is the same as Perl's, except that there is no equiva-
1.1.1.4 ! misho    9407:        lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl,  there  is
1.1       misho    9408:        no way to stop newline from matching [^a].
                   9409: 
1.1.1.4 ! misho    9410:        The   default  POSIX  newline  handling  can  be  obtained  by  setting
        !          9411:        PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to  make  PCRE
1.1       misho    9412:        behave exactly as for the REG_NEWLINE action.
                   9413: 
                   9414: 
                   9415: MATCHING A PATTERN
                   9416: 
1.1.1.4 ! misho    9417:        The  function  regexec()  is  called  to  match a compiled pattern preg
        !          9418:        against a given string, which is by default terminated by a  zero  byte
        !          9419:        (but  see  REG_STARTEND below), subject to the options in eflags. These
1.1       misho    9420:        can be:
                   9421: 
                   9422:          REG_NOTBOL
                   9423: 
                   9424:        The PCRE_NOTBOL option is set when calling the underlying PCRE matching
                   9425:        function.
                   9426: 
                   9427:          REG_NOTEMPTY
                   9428: 
                   9429:        The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
                   9430:        ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
                   9431:        However, setting this option can give more POSIX-like behaviour in some
                   9432:        situations.
                   9433: 
                   9434:          REG_NOTEOL
                   9435: 
                   9436:        The PCRE_NOTEOL option is set when calling the underlying PCRE matching
                   9437:        function.
                   9438: 
                   9439:          REG_STARTEND
                   9440: 
1.1.1.4 ! misho    9441:        The  string  is  considered to start at string + pmatch[0].rm_so and to
        !          9442:        have a terminating NUL located at string + pmatch[0].rm_eo (there  need
        !          9443:        not  actually  be  a  NUL at that location), regardless of the value of
        !          9444:        nmatch. This is a BSD extension, compatible with but not  specified  by
        !          9445:        IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
1.1       misho    9446:        software intended to be portable to other systems. Note that a non-zero
                   9447:        rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
                   9448:        of the string, not how it is matched.
                   9449: 
1.1.1.4 ! misho    9450:        If the pattern was compiled with the REG_NOSUB flag, no data about  any
        !          9451:        matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
1.1       misho    9452:        regexec() are ignored.
                   9453: 
                   9454:        If the value of nmatch is zero, or if the value pmatch is NULL, no data
                   9455:        about any matched strings is returned.
                   9456: 
                   9457:        Otherwise,the portion of the string that was matched, and also any cap-
                   9458:        tured substrings, are returned via the pmatch argument, which points to
1.1.1.4 ! misho    9459:        an  array  of nmatch structures of type regmatch_t, containing the mem-
        !          9460:        bers rm_so and rm_eo. These contain the offset to the  first  character
        !          9461:        of  each  substring and the offset to the first character after the end
        !          9462:        of each substring, respectively. The 0th element of the vector  relates
        !          9463:        to  the  entire portion of string that was matched; subsequent elements
        !          9464:        relate to the capturing subpatterns of the regular  expression.  Unused
1.1       misho    9465:        entries in the array have both structure members set to -1.
                   9466: 
1.1.1.4 ! misho    9467:        A  successful  match  yields  a  zero  return;  various error codes are
        !          9468:        defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
1.1       misho    9469:        failure code.
                   9470: 
                   9471: 
                   9472: ERROR MESSAGES
                   9473: 
                   9474:        The regerror() function maps a non-zero errorcode from either regcomp()
1.1.1.4 ! misho    9475:        or regexec() to a printable message. If preg is  not  NULL,  the  error
1.1       misho    9476:        should have arisen from the use of that structure. A message terminated
1.1.1.4 ! misho    9477:        by a binary zero is placed  in  errbuf.  The  length  of  the  message,
        !          9478:        including  the  zero, is limited to errbuf_size. The yield of the func-
1.1       misho    9479:        tion is the size of buffer needed to hold the whole message.
                   9480: 
                   9481: 
                   9482: MEMORY USAGE
                   9483: 
1.1.1.4 ! misho    9484:        Compiling a regular expression causes memory to be allocated and  asso-
        !          9485:        ciated  with  the preg structure. The function regfree() frees all such
        !          9486:        memory, after which preg may no longer be used as  a  compiled  expres-
1.1       misho    9487:        sion.
                   9488: 
                   9489: 
                   9490: AUTHOR
                   9491: 
                   9492:        Philip Hazel
                   9493:        University Computing Service
                   9494:        Cambridge CB2 3QH, England.
                   9495: 
                   9496: 
                   9497: REVISION
                   9498: 
1.1.1.2   misho    9499:        Last updated: 09 January 2012
                   9500:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9501: ------------------------------------------------------------------------------
                   9502: 
                   9503: 
1.1.1.4 ! misho    9504: PCRECPP(3)                 Library Functions Manual                 PCRECPP(3)
        !          9505: 
1.1       misho    9506: 
                   9507: 
                   9508: NAME
                   9509:        PCRE - Perl-compatible regular expressions.
                   9510: 
                   9511: SYNOPSIS OF C++ WRAPPER
                   9512: 
                   9513:        #include <pcrecpp.h>
                   9514: 
                   9515: 
                   9516: DESCRIPTION
                   9517: 
                   9518:        The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
                   9519:        functionality was added by Giuseppe Maxia. This brief man page was con-
                   9520:        structed  from  the  notes  in the pcrecpp.h file, which should be con-
1.1.1.2   misho    9521:        sulted for further details. Note that the C++ wrapper supports only the
1.1.1.4 ! misho    9522:        original  8-bit  PCRE  library. There is no 16-bit or 32-bit support at
        !          9523:        present.
1.1       misho    9524: 
                   9525: 
                   9526: MATCHING INTERFACE
                   9527: 
1.1.1.4 ! misho    9528:        The "FullMatch" operation checks that supplied text matches a  supplied
        !          9529:        pattern  exactly.  If pointer arguments are supplied, it copies matched
1.1       misho    9530:        sub-strings that match sub-patterns into them.
                   9531: 
                   9532:          Example: successful match
                   9533:             pcrecpp::RE re("h.*o");
                   9534:             re.FullMatch("hello");
                   9535: 
                   9536:          Example: unsuccessful match (requires full match):
                   9537:             pcrecpp::RE re("e");
                   9538:             !re.FullMatch("hello");
                   9539: 
                   9540:          Example: creating a temporary RE object:
                   9541:             pcrecpp::RE("h.*o").FullMatch("hello");
                   9542: 
1.1.1.4 ! misho    9543:        You can pass in a "const char*" or a "string" for "text". The  examples
        !          9544:        below  tend to use a const char*. You can, as in the different examples
        !          9545:        above, store the RE object explicitly in a variable or use a  temporary
        !          9546:        RE  object.  The  examples below use one mode or the other arbitrarily.
1.1       misho    9547:        Either could correctly be used for any of these examples.
                   9548: 
                   9549:        You must supply extra pointer arguments to extract matched subpieces.
                   9550: 
                   9551:          Example: extracts "ruby" into "s" and 1234 into "i"
                   9552:             int i;
                   9553:             string s;
                   9554:             pcrecpp::RE re("(\\w+):(\\d+)");
                   9555:             re.FullMatch("ruby:1234", &s, &i);
                   9556: 
                   9557:          Example: does not try to extract any extra sub-patterns
                   9558:             re.FullMatch("ruby:1234", &s);
                   9559: 
                   9560:          Example: does not try to extract into NULL
                   9561:             re.FullMatch("ruby:1234", NULL, &i);
                   9562: 
                   9563:          Example: integer overflow causes failure
                   9564:             !re.FullMatch("ruby:1234567891234", NULL, &i);
                   9565: 
                   9566:          Example: fails because there aren't enough sub-patterns:
                   9567:             !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
                   9568: 
                   9569:          Example: fails because string cannot be stored in integer
                   9570:             !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
                   9571: 
1.1.1.4 ! misho    9572:        The provided pointer arguments can be pointers to  any  scalar  numeric
1.1       misho    9573:        type, or one of:
                   9574: 
                   9575:           string        (matched piece is copied to string)
                   9576:           StringPiece   (StringPiece is mutated to point to matched piece)
                   9577:           T             (where "bool T::ParseFrom(const char*, int)" exists)
                   9578:           NULL          (the corresponding matched sub-pattern is not copied)
                   9579: 
1.1.1.4 ! misho    9580:        The  function returns true iff all of the following conditions are sat-
1.1       misho    9581:        isfied:
                   9582: 
                   9583:          a. "text" matches "pattern" exactly;
                   9584: 
                   9585:          b. The number of matched sub-patterns is >= number of supplied
                   9586:             pointers;
                   9587: 
                   9588:          c. The "i"th argument has a suitable type for holding the
                   9589:             string captured as the "i"th sub-pattern. If you pass in
                   9590:             void * NULL for the "i"th argument, or a non-void * NULL
                   9591:             of the correct type, or pass fewer arguments than the
                   9592:             number of sub-patterns, "i"th captured sub-pattern is
                   9593:             ignored.
                   9594: 
1.1.1.4 ! misho    9595:        CAVEAT: An optional sub-pattern that does  not  exist  in  the  matched
        !          9596:        string  is  assigned  the  empty  string. Therefore, the following will
1.1       misho    9597:        return false (because the empty string is not a valid number):
                   9598: 
                   9599:           int number;
                   9600:           pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
                   9601: 
1.1.1.4 ! misho    9602:        The matching interface supports at most 16 arguments per call.  If  you
        !          9603:        need    more,    consider    using    the    more   general   interface
1.1       misho    9604:        pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
                   9605: 
1.1.1.4 ! misho    9606:        NOTE: Do not use no_arg, which is used internally to mark the end of  a
        !          9607:        list  of optional arguments, as a placeholder for missing arguments, as
1.1       misho    9608:        this can lead to segfaults.
                   9609: 
                   9610: 
                   9611: QUOTING METACHARACTERS
                   9612: 
1.1.1.4 ! misho    9613:        You can use the "QuoteMeta" operation to insert backslashes before  all
        !          9614:        potentially  meaningful  characters  in  a string. The returned string,
1.1       misho    9615:        used as a regular expression, will exactly match the original string.
                   9616: 
                   9617:          Example:
                   9618:             string quoted = RE::QuoteMeta(unquoted);
                   9619: 
1.1.1.4 ! misho    9620:        Note that it's legal to escape a character even if it  has  no  special
        !          9621:        meaning  in  a  regular expression -- so this function does that. (This
        !          9622:        also makes it identical to the perl function  of  the  same  name;  see
        !          9623:        "perldoc    -f    quotemeta".)    For   example,   "1.5-2.0?"   becomes
1.1       misho    9624:        "1\.5\-2\.0\?".
                   9625: 
                   9626: 
                   9627: PARTIAL MATCHES
                   9628: 
1.1.1.4 ! misho    9629:        You can use the "PartialMatch" operation when you want the  pattern  to
1.1       misho    9630:        match any substring of the text.
                   9631: 
                   9632:          Example: simple search for a string:
                   9633:             pcrecpp::RE("ell").PartialMatch("hello");
                   9634: 
                   9635:          Example: find first number in a string:
                   9636:             int number;
                   9637:             pcrecpp::RE re("(\\d+)");
                   9638:             re.PartialMatch("x*100 + 20", &number);
                   9639:             assert(number == 100);
                   9640: 
                   9641: 
                   9642: UTF-8 AND THE MATCHING INTERFACE
                   9643: 
1.1.1.4 ! misho    9644:        By  default,  pattern  and text are plain text, one byte per character.
        !          9645:        The UTF8 flag, passed to  the  constructor,  causes  both  pattern  and
1.1       misho    9646:        string to be treated as UTF-8 text, still a byte stream but potentially
1.1.1.4 ! misho    9647:        multiple bytes per character. In practice, the text is likelier  to  be
        !          9648:        UTF-8  than  the pattern, but the match returned may depend on the UTF8
        !          9649:        flag, so always use it when matching UTF8 text. For example,  "."  will
        !          9650:        match  one  byte normally but with UTF8 set may match up to three bytes
1.1       misho    9651:        of a multi-byte character.
                   9652: 
                   9653:          Example:
                   9654:             pcrecpp::RE_Options options;
                   9655:             options.set_utf8();
                   9656:             pcrecpp::RE re(utf8_pattern, options);
                   9657:             re.FullMatch(utf8_string);
                   9658: 
                   9659:          Example: using the convenience function UTF8():
                   9660:             pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
                   9661:             re.FullMatch(utf8_string);
                   9662: 
                   9663:        NOTE: The UTF8 flag is ignored if pcre was not configured with the
                   9664:              --enable-utf8 flag.
                   9665: 
                   9666: 
                   9667: PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
                   9668: 
1.1.1.4 ! misho    9669:        PCRE defines some modifiers to  change  the  behavior  of  the  regular
        !          9670:        expression   engine.  The  C++  wrapper  defines  an  auxiliary  class,
        !          9671:        RE_Options, as a vehicle to pass such modifiers to  a  RE  class.  Cur-
1.1       misho    9672:        rently, the following modifiers are supported:
                   9673: 
                   9674:           modifier              description               Perl corresponding
                   9675: 
                   9676:           PCRE_CASELESS         case insensitive match      /i
                   9677:           PCRE_MULTILINE        multiple lines match        /m
                   9678:           PCRE_DOTALL           dot matches newlines        /s
                   9679:           PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
                   9680:           PCRE_EXTRA            strict escape parsing       N/A
1.1.1.3   misho    9681:           PCRE_EXTENDED         ignore white spaces         /x
1.1       misho    9682:           PCRE_UTF8             handles UTF8 chars          built-in
                   9683:           PCRE_UNGREEDY         reverses * and *?           N/A
                   9684:           PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
                   9685: 
1.1.1.4 ! misho    9686:        (*)  Both Perl and PCRE allow non capturing parentheses by means of the
        !          9687:        "?:" modifier within the pattern itself. e.g. (?:ab|cd) does  not  cap-
1.1       misho    9688:        ture, while (ab|cd) does.
                   9689: 
1.1.1.4 ! misho    9690:        For  a  full  account on how each modifier works, please check the PCRE
1.1       misho    9691:        API reference page.
                   9692: 
1.1.1.4 ! misho    9693:        For each modifier, there are two member functions whose  name  is  made
        !          9694:        out  of  the  modifier  in  lowercase,  without the "PCRE_" prefix. For
1.1       misho    9695:        instance, PCRE_CASELESS is handled by
                   9696: 
                   9697:          bool caseless()
                   9698: 
                   9699:        which returns true if the modifier is set, and
                   9700: 
                   9701:          RE_Options & set_caseless(bool)
                   9702: 
                   9703:        which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
1.1.1.4 ! misho    9704:        be  accessed  through  the  set_match_limit()  and match_limit() member
        !          9705:        functions. Setting match_limit to a non-zero value will limit the  exe-
        !          9706:        cution  of pcre to keep it from doing bad things like blowing the stack
        !          9707:        or taking an eternity to return a result.  A  value  of  5000  is  good
        !          9708:        enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
        !          9709:        to  zero  disables  match  limiting.  Alternatively,   you   can   call
        !          9710:        match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
        !          9711:        limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
1.1       misho    9712:        matches PCRE does; match_limit_recursion() limits the depth of internal
                   9713:        recursion, and therefore the amount of stack that is used.
                   9714: 
1.1.1.4 ! misho    9715:        Normally, to pass one or more modifiers to a RE class,  you  declare  a
1.1       misho    9716:        RE_Options object, set the appropriate options, and pass this object to
                   9717:        a RE constructor. Example:
                   9718: 
                   9719:           RE_Options opt;
                   9720:           opt.set_caseless(true);
                   9721:           if (RE("HELLO", opt).PartialMatch("hello world")) ...
                   9722: 
                   9723:        RE_options has two constructors. The default constructor takes no argu-
1.1.1.4 ! misho    9724:        ments  and creates a set of flags that are off by default. The optional
        !          9725:        parameter option_flags is to facilitate transfer of legacy code from  C
1.1       misho    9726:        programs.  This lets you do
                   9727: 
                   9728:           RE(pattern,
                   9729:             RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
                   9730: 
                   9731:        However, new code is better off doing
                   9732: 
                   9733:           RE(pattern,
                   9734:             RE_Options().set_caseless(true).set_multiline(true))
                   9735:               .PartialMatch(str);
                   9736: 
                   9737:        If you are going to pass one of the most used modifiers, there are some
                   9738:        convenience functions that return a RE_Options class with the appropri-
1.1.1.4 ! misho    9739:        ate  modifier  already  set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
1.1       misho    9740:        and EXTENDED().
                   9741: 
1.1.1.4 ! misho    9742:        If you need to set several options at once, and you don't  want  to  go
        !          9743:        through  the pains of declaring a RE_Options object and setting several
        !          9744:        options, there is a parallel method that give you such ability  on  the
        !          9745:        fly.  You  can  concatenate several set_xxxxx() member functions, since
        !          9746:        each of them returns a reference to its class object. For  example,  to
        !          9747:        pass  PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
1.1       misho    9748:        statement, you may write:
                   9749: 
                   9750:           RE(" ^ xyz \\s+ .* blah$",
                   9751:             RE_Options()
                   9752:               .set_caseless(true)
                   9753:               .set_extended(true)
                   9754:               .set_multiline(true)).PartialMatch(sometext);
                   9755: 
                   9756: 
                   9757: SCANNING TEXT INCREMENTALLY
                   9758: 
1.1.1.4 ! misho    9759:        The "Consume" operation may be useful if you want to  repeatedly  match
1.1       misho    9760:        regular expressions at the front of a string and skip over them as they
1.1.1.4 ! misho    9761:        match. This requires use of the "StringPiece" type, which represents  a
        !          9762:        sub-range  of  a  real  string.  Like RE, StringPiece is defined in the
1.1       misho    9763:        pcrecpp namespace.
                   9764: 
                   9765:          Example: read lines of the form "var = value" from a string.
                   9766:             string contents = ...;                 // Fill string somehow
                   9767:             pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
                   9768: 
                   9769:             string var;
                   9770:             int value;
                   9771:             pcrecpp::RE re("(\\w+) = (\\d+)\n");
                   9772:             while (re.Consume(&input, &var, &value)) {
                   9773:               ...;
                   9774:             }
                   9775: 
1.1.1.4 ! misho    9776:        Each successful call  to  "Consume"  will  set  "var/value",  and  also
1.1       misho    9777:        advance "input" so it points past the matched text.
                   9778: 
1.1.1.4 ! misho    9779:        The  "FindAndConsume"  operation  is  similar to "Consume" but does not
        !          9780:        anchor your match at the beginning of  the  string.  For  example,  you
1.1       misho    9781:        could extract all words from a string by repeatedly calling
                   9782: 
                   9783:          pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
                   9784: 
                   9785: 
                   9786: PARSING HEX/OCTAL/C-RADIX NUMBERS
                   9787: 
                   9788:        By default, if you pass a pointer to a numeric value, the corresponding
1.1.1.4 ! misho    9789:        text is interpreted as a base-10  number.  You  can  instead  wrap  the
1.1       misho    9790:        pointer with a call to one of the operators Hex(), Octal(), or CRadix()
1.1.1.4 ! misho    9791:        to interpret the text in another base. The CRadix  operator  interprets
        !          9792:        C-style  "0"  (base-8)  and  "0x"  (base-16)  prefixes, but defaults to
1.1       misho    9793:        base-10.
                   9794: 
                   9795:          Example:
                   9796:            int a, b, c, d;
                   9797:            pcrecpp::RE re("(.*) (.*) (.*) (.*)");
                   9798:            re.FullMatch("100 40 0100 0x40",
                   9799:                         pcrecpp::Octal(&a), pcrecpp::Hex(&b),
                   9800:                         pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
                   9801: 
                   9802:        will leave 64 in a, b, c, and d.
                   9803: 
                   9804: 
                   9805: REPLACING PARTS OF STRINGS
                   9806: 
1.1.1.4 ! misho    9807:        You can replace the first match of "pattern" in "str"  with  "rewrite".
        !          9808:        Within  "rewrite",  backslash-escaped  digits (\1 to \9) can be used to
        !          9809:        insert text matching corresponding parenthesized group  from  the  pat-
1.1       misho    9810:        tern. \0 in "rewrite" refers to the entire matching text. For example:
                   9811: 
                   9812:          string s = "yabba dabba doo";
                   9813:          pcrecpp::RE("b+").Replace("d", &s);
                   9814: 
1.1.1.4 ! misho    9815:        will  leave  "s" containing "yada dabba doo". The result is true if the
1.1       misho    9816:        pattern matches and a replacement occurs, false otherwise.
                   9817: 
1.1.1.4 ! misho    9818:        GlobalReplace is like Replace except that it replaces  all  occurrences
        !          9819:        of  the  pattern  in  the string with the rewrite. Replacements are not
1.1       misho    9820:        subject to re-matching. For example:
                   9821: 
                   9822:          string s = "yabba dabba doo";
                   9823:          pcrecpp::RE("b+").GlobalReplace("d", &s);
                   9824: 
1.1.1.4 ! misho    9825:        will leave "s" containing "yada dada doo". It  returns  the  number  of
1.1       misho    9826:        replacements made.
                   9827: 
1.1.1.4 ! misho    9828:        Extract  is like Replace, except that if the pattern matches, "rewrite"
        !          9829:        is copied into "out" (an additional argument) with substitutions.   The
        !          9830:        non-matching  portions  of "text" are ignored. Returns true iff a match
1.1       misho    9831:        occurred and the extraction happened successfully;  if no match occurs,
                   9832:        the string is left unaffected.
                   9833: 
                   9834: 
                   9835: AUTHOR
                   9836: 
                   9837:        The C++ wrapper was contributed by Google Inc.
                   9838:        Copyright (c) 2007 Google Inc.
                   9839: 
                   9840: 
                   9841: REVISION
                   9842: 
1.1.1.2   misho    9843:        Last updated: 08 January 2012
1.1       misho    9844: ------------------------------------------------------------------------------
                   9845: 
                   9846: 
1.1.1.4 ! misho    9847: PCRESAMPLE(3)              Library Functions Manual              PCRESAMPLE(3)
        !          9848: 
1.1       misho    9849: 
                   9850: 
                   9851: NAME
                   9852:        PCRE - Perl-compatible regular expressions
                   9853: 
                   9854: PCRE SAMPLE PROGRAM
                   9855: 
                   9856:        A simple, complete demonstration program, to get you started with using
                   9857:        PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
                   9858:        listing  of this program is given in the pcredemo documentation. If you
                   9859:        do not have a copy of the PCRE distribution, you can save this  listing
                   9860:        to re-create pcredemo.c.
                   9861: 
1.1.1.2   misho    9862:        The  demonstration program, which uses the original PCRE 8-bit library,
                   9863:        compiles the regular expression that is its first argument, and matches
                   9864:        it  against  the subject string in its second argument. No PCRE options
                   9865:        are set, and default character tables are used. If  matching  succeeds,
                   9866:        the  program  outputs the portion of the subject that matched, together
                   9867:        with the contents of any captured substrings.
1.1       misho    9868: 
                   9869:        If the -g option is given on the command line, the program then goes on
                   9870:        to check for further matches of the same regular expression in the same
1.1.1.2   misho    9871:        subject string. The logic is a little bit tricky because of the  possi-
                   9872:        bility  of  matching an empty string. Comments in the code explain what
1.1       misho    9873:        is going on.
                   9874: 
1.1.1.2   misho    9875:        If PCRE is installed in the standard include  and  library  directories
1.1       misho    9876:        for your operating system, you should be able to compile the demonstra-
                   9877:        tion program using this command:
                   9878: 
                   9879:          gcc -o pcredemo pcredemo.c -lpcre
                   9880: 
1.1.1.2   misho    9881:        If PCRE is installed elsewhere, you may need to add additional  options
                   9882:        to  the  command line. For example, on a Unix-like system that has PCRE
                   9883:        installed in /usr/local, you  can  compile  the  demonstration  program
1.1       misho    9884:        using a command like this:
                   9885: 
                   9886:          gcc -o pcredemo -I/usr/local/include pcredemo.c \
                   9887:              -L/usr/local/lib -lpcre
                   9888: 
1.1.1.2   misho    9889:        In  a  Windows  environment, if you want to statically link the program
1.1       misho    9890:        against a non-dll pcre.a file, you must uncomment the line that defines
1.1.1.2   misho    9891:        PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
1.1       misho    9892:        loc()   and   pcre_free()   exported   functions   will   be   declared
                   9893:        __declspec(dllimport), with unwanted results.
                   9894: 
1.1.1.2   misho    9895:        Once  you  have  compiled and linked the demonstration program, you can
1.1       misho    9896:        run simple tests like this:
                   9897: 
                   9898:          ./pcredemo 'cat|dog' 'the cat sat on the mat'
                   9899:          ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
                   9900: 
1.1.1.2   misho    9901:        Note that there is a  much  more  comprehensive  test  program,  called
                   9902:        pcretest,  which  supports  many  more  facilities  for testing regular
                   9903:        expressions and both PCRE libraries. The pcredemo program  is  provided
                   9904:        as a simple coding example.
1.1       misho    9905: 
1.1.1.2   misho    9906:        If  you  try to run pcredemo when PCRE is not installed in the standard
                   9907:        library directory, you may get an error like  this  on  some  operating
1.1       misho    9908:        systems (e.g. Solaris):
                   9909: 
1.1.1.2   misho    9910:          ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
1.1       misho    9911:        directory
                   9912: 
1.1.1.2   misho    9913:        This is caused by the way shared library support works  on  those  sys-
1.1       misho    9914:        tems. You need to add
                   9915: 
                   9916:          -R/usr/local/lib
                   9917: 
                   9918:        (for example) to the compile command to get round this problem.
                   9919: 
                   9920: 
                   9921: AUTHOR
                   9922: 
                   9923:        Philip Hazel
                   9924:        University Computing Service
                   9925:        Cambridge CB2 3QH, England.
                   9926: 
                   9927: 
                   9928: REVISION
                   9929: 
1.1.1.2   misho    9930:        Last updated: 10 January 2012
                   9931:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9932: ------------------------------------------------------------------------------
1.1.1.4 ! misho    9933: PCRELIMITS(3)              Library Functions Manual              PCRELIMITS(3)
        !          9934: 
1.1       misho    9935: 
                   9936: 
                   9937: NAME
                   9938:        PCRE - Perl-compatible regular expressions
                   9939: 
                   9940: SIZE AND OTHER LIMITATIONS
                   9941: 
                   9942:        There  are some size limitations in PCRE but it is hoped that they will
                   9943:        never in practice be relevant.
                   9944: 
1.1.1.2   misho    9945:        The maximum length of a compiled  pattern  is  approximately  64K  data
1.1.1.4 ! misho    9946:        units  (bytes  for  the  8-bit  library,  32-bit  units  for the 32-bit
        !          9947:        library, and 32-bit units for the 32-bit library) if PCRE  is  compiled
        !          9948:        with  the  default  internal  linkage  size  of 2 bytes. If you want to
        !          9949:        process regular expressions that are truly enormous,  you  can  compile
        !          9950:        PCRE  with an internal linkage size of 3 or 4 (when building the 16-bit
        !          9951:        or 32-bit library, 3 is rounded up to 4). See the README  file  in  the
        !          9952:        source  distribution  and  the  pcrebuild documentation for details. In
        !          9953:        these cases the limit is substantially larger.  However, the  speed  of
        !          9954:        execution is slower.
1.1       misho    9955: 
                   9956:        All values in repeating quantifiers must be less than 65536.
                   9957: 
                   9958:        There is no limit to the number of parenthesized subpatterns, but there
                   9959:        can be no more than 65535 capturing subpatterns.
                   9960: 
                   9961:        There is a limit to the number of forward references to subsequent sub-
1.1.1.4 ! misho    9962:        patterns  of  around  200,000.  Repeated  forward references with fixed
        !          9963:        upper limits, for example, (?2){0,100} when subpattern number 2  is  to
        !          9964:        the  right,  are included in the count. There is no limit to the number
1.1       misho    9965:        of backward references.
                   9966: 
                   9967:        The maximum length of name for a named subpattern is 32 characters, and
                   9968:        the maximum number of named subpatterns is 10000.
                   9969: 
1.1.1.4 ! misho    9970:        The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
        !          9971:        (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
        !          9972:        32-bit library.
1.1.1.3   misho    9973: 
1.1.1.4 ! misho    9974:        The  maximum  length of a subject string is the largest positive number
        !          9975:        that an integer variable can hold. However, when using the  traditional
1.1       misho    9976:        matching function, PCRE uses recursion to handle subpatterns and indef-
1.1.1.4 ! misho    9977:        inite repetition.  This means that the available stack space may  limit
1.1       misho    9978:        the size of a subject string that can be processed by certain patterns.
                   9979:        For a discussion of stack issues, see the pcrestack documentation.
                   9980: 
                   9981: 
                   9982: AUTHOR
                   9983: 
                   9984:        Philip Hazel
                   9985:        University Computing Service
                   9986:        Cambridge CB2 3QH, England.
                   9987: 
                   9988: 
                   9989: REVISION
                   9990: 
1.1.1.3   misho    9991:        Last updated: 04 May 2012
1.1.1.2   misho    9992:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9993: ------------------------------------------------------------------------------
                   9994: 
                   9995: 
1.1.1.4 ! misho    9996: PCRESTACK(3)               Library Functions Manual               PCRESTACK(3)
        !          9997: 
1.1       misho    9998: 
                   9999: 
                   10000: NAME
                   10001:        PCRE - Perl-compatible regular expressions
                   10002: 
                   10003: PCRE DISCUSSION OF STACK USAGE
                   10004: 
1.1.1.4 ! misho    10005:        When  you call pcre[16|32]_exec(), it makes use of an internal function
1.1.1.2   misho    10006:        called match(). This calls itself recursively at branch points  in  the
                   10007:        pattern,  in  order  to  remember the state of the match so that it can
                   10008:        back up and try a different alternative if  the  first  one  fails.  As
                   10009:        matching proceeds deeper and deeper into the tree of possibilities, the
                   10010:        recursion depth increases. The match() function is also called in other
                   10011:        circumstances,  for  example,  whenever  a parenthesized sub-pattern is
                   10012:        entered, and in certain cases of repetition.
1.1       misho    10013: 
                   10014:        Not all calls of match() increase the recursion depth; for an item such
                   10015:        as  a* it may be called several times at the same level, after matching
                   10016:        different numbers of a's. Furthermore, in a number of cases  where  the
                   10017:        result  of  the  recursive call would immediately be passed back as the
                   10018:        result of the current call (a "tail recursion"), the function  is  just
                   10019:        restarted instead.
                   10020: 
1.1.1.4 ! misho    10021:        The  above  comments apply when pcre[16|32]_exec() is run in its normal
1.1.1.2   misho    10022:        interpretive  manner.   If   the   pattern   was   studied   with   the
                   10023:        PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
1.1.1.4 ! misho    10024:        ful, and the options passed to pcre[16|32]_exec() were  not  incompati-
        !          10025:        ble,  the  matching  process  uses the JIT-compiled code instead of the
        !          10026:        match() function. In this case, the  memory  requirements  are  handled
        !          10027:        entirely differently. See the pcrejit documentation for details.
        !          10028: 
        !          10029:        The  pcre[16|32]_dfa_exec()  function operates in an entirely different
        !          10030:        way, and uses recursion only when there is a regular expression  recur-
        !          10031:        sion or subroutine call in the pattern. This includes the processing of
        !          10032:        assertion and "once-only" subpatterns, which are handled  like  subrou-
        !          10033:        tine  calls.  Normally, these are never very deep, and the limit on the
        !          10034:        complexity of pcre[16|32]_dfa_exec() is controlled  by  the  amount  of
        !          10035:        workspace  it is given.  However, it is possible to write patterns with
        !          10036:        runaway    infinite    recursions;    such    patterns    will    cause
        !          10037:        pcre[16|32]_dfa_exec()  to  run  out  of stack. At present, there is no
        !          10038:        protection against this.
        !          10039: 
        !          10040:        The comments that follow do NOT apply to  pcre[16|32]_dfa_exec();  they
        !          10041:        are relevant only for pcre[16|32]_exec() without the JIT optimization.
        !          10042: 
        !          10043:    Reducing pcre[16|32]_exec()'s stack usage
        !          10044: 
        !          10045:        Each  time  that match() is actually called recursively, it uses memory
        !          10046:        from the process stack. For certain kinds of  pattern  and  data,  very
        !          10047:        large  amounts of stack may be needed, despite the recognition of "tail
        !          10048:        recursion".  You can often reduce the amount of recursion,  and  there-
        !          10049:        fore  the  amount of stack used, by modifying the pattern that is being
1.1       misho    10050:        matched. Consider, for example, this pattern:
                   10051: 
                   10052:          ([^<]|<(?!inet))+
                   10053: 
1.1.1.4 ! misho    10054:        It matches from wherever it starts until it encounters "<inet"  or  the
        !          10055:        end  of  the  data,  and is the kind of pattern that might be used when
1.1       misho    10056:        processing an XML file. Each iteration of the outer parentheses matches
1.1.1.4 ! misho    10057:        either  one  character that is not "<" or a "<" that is not followed by
        !          10058:        "inet". However, each time a  parenthesis  is  processed,  a  recursion
1.1       misho    10059:        occurs, so this formulation uses a stack frame for each matched charac-
1.1.1.4 ! misho    10060:        ter. For a long string, a lot of stack is required. Consider  now  this
1.1       misho    10061:        rewritten pattern, which matches exactly the same strings:
                   10062: 
                   10063:          ([^<]++|<(?!inet))+
                   10064: 
1.1.1.4 ! misho    10065:        This  uses very much less stack, because runs of characters that do not
        !          10066:        contain "<" are "swallowed" in one item inside the parentheses.  Recur-
        !          10067:        sion  happens  only when a "<" character that is not followed by "inet"
        !          10068:        is encountered (and we assume this is relatively  rare).  A  possessive
        !          10069:        quantifier  is  used  to stop any backtracking into the runs of non-"<"
1.1       misho    10070:        characters, but that is not related to stack usage.
                   10071: 
1.1.1.4 ! misho    10072:        This example shows that one way of avoiding stack problems when  match-
1.1       misho    10073:        ing long subject strings is to write repeated parenthesized subpatterns
                   10074:        to match more than one character whenever possible.
                   10075: 
1.1.1.4 ! misho    10076:    Compiling PCRE to use heap instead of stack for pcre[16|32]_exec()
1.1       misho    10077: 
1.1.1.4 ! misho    10078:        In environments where stack memory is constrained, you  might  want  to
        !          10079:        compile  PCRE to use heap memory instead of stack for remembering back-
        !          10080:        up points when pcre[16|32]_exec() is running. This makes it run  a  lot
        !          10081:        more slowly, however.  Details of how to do this are given in the pcre-
        !          10082:        build documentation. When built in  this  way,  instead  of  using  the
        !          10083:        stack,  PCRE obtains and frees memory by calling the functions that are
        !          10084:        pointed to by the pcre[16|32]_stack_malloc  and  pcre[16|32]_stack_free
        !          10085:        variables.  By default, these point to malloc() and free(), but you can
        !          10086:        replace the pointers to cause PCRE to use your own functions. Since the
        !          10087:        block sizes are always the same, and are always freed in reverse order,
        !          10088:        it may be possible to implement customized  memory  handlers  that  are
        !          10089:        more efficient than the standard functions.
        !          10090: 
        !          10091:    Limiting pcre[16|32]_exec()'s stack usage
        !          10092: 
        !          10093:        You  can set limits on the number of times that match() is called, both
        !          10094:        in total and recursively. If a limit  is  exceeded,  pcre[16|32]_exec()
        !          10095:        returns  an  error code. Setting suitable limits should prevent it from
        !          10096:        running out of stack. The default values of the limits are very  large,
        !          10097:        and  unlikely  ever to operate. They can be changed when PCRE is built,
        !          10098:        and they can also be set when pcre[16|32]_exec() is called. For details
        !          10099:        of these interfaces, see the pcrebuild documentation and the section on
        !          10100:        extra data for pcre[16|32]_exec() in the pcreapi documentation.
1.1       misho    10101: 
                   10102:        As a very rough rule of thumb, you should reckon on about 500 bytes per
1.1.1.4 ! misho    10103:        recursion.  Thus,  if  you  want  to limit your stack usage to 8Mb, you
        !          10104:        should set the limit at 16000 recursions. A 64Mb stack,  on  the  other
1.1       misho    10105:        hand, can support around 128000 recursions.
                   10106: 
                   10107:        In Unix-like environments, the pcretest test program has a command line
                   10108:        option (-S) that can be used to increase the size of its stack. As long
1.1.1.4 ! misho    10109:        as  the  stack is large enough, another option (-M) can be used to find
        !          10110:        the smallest limits that allow a particular pattern to  match  a  given
        !          10111:        subject  string.  This is done by calling pcre[16|32]_exec() repeatedly
        !          10112:        with different limits.
1.1       misho    10113: 
1.1.1.2   misho    10114:    Obtaining an estimate of stack usage
                   10115: 
1.1.1.4 ! misho    10116:        The actual amount of stack used per recursion can  vary  quite  a  lot,
1.1.1.2   misho    10117:        depending on the compiler that was used to build PCRE and the optimiza-
                   10118:        tion or debugging options that were set for it. The rule of thumb value
1.1.1.4 ! misho    10119:        of  500  bytes  mentioned  above  may be larger or smaller than what is
1.1.1.2   misho    10120:        actually needed. A better approximation can be obtained by running this
                   10121:        command:
                   10122: 
                   10123:          pcretest -m -C
                   10124: 
1.1.1.4 ! misho    10125:        The  -C  option causes pcretest to output information about the options
1.1.1.2   misho    10126:        with which PCRE was compiled. When -m is also given (before -C), infor-
                   10127:        mation about stack use is given in a line like this:
                   10128: 
                   10129:          Match recursion uses stack: approximate frame size = 640 bytes
                   10130: 
                   10131:        The value is approximate because some recursions need a bit more (up to
                   10132:        perhaps 16 more bytes).
                   10133: 
1.1.1.4 ! misho    10134:        If the above command is given when PCRE is compiled  to  use  the  heap
        !          10135:        instead  of  the  stack  for recursion, the value that is output is the
1.1.1.2   misho    10136:        size of each block that is obtained from the heap.
                   10137: 
1.1       misho    10138:    Changing stack size in Unix-like systems
                   10139: 
1.1.1.4 ! misho    10140:        In Unix-like environments, there is not often a problem with the  stack
        !          10141:        unless  very  long  strings  are  involved, though the default limit on
        !          10142:        stack size varies from system to system. Values from 8Mb  to  64Mb  are
1.1       misho    10143:        common. You can find your default limit by running the command:
                   10144: 
                   10145:          ulimit -s
                   10146: 
1.1.1.4 ! misho    10147:        Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
        !          10148:        though sometimes a more explicit error message is given. You  can  nor-
1.1       misho    10149:        mally increase the limit on stack size by code such as this:
                   10150: 
                   10151:          struct rlimit rlim;
                   10152:          getrlimit(RLIMIT_STACK, &rlim);
                   10153:          rlim.rlim_cur = 100*1024*1024;
                   10154:          setrlimit(RLIMIT_STACK, &rlim);
                   10155: 
1.1.1.4 ! misho    10156:        This  reads  the current limits (soft and hard) using getrlimit(), then
        !          10157:        attempts to increase the soft limit to  100Mb  using  setrlimit().  You
        !          10158:        must do this before calling pcre[16|32]_exec().
1.1       misho    10159: 
                   10160:    Changing stack size in Mac OS X
                   10161: 
                   10162:        Using setrlimit(), as described above, should also work on Mac OS X. It
                   10163:        is also possible to set a stack size when linking a program. There is a
1.1.1.4 ! misho    10164:        discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
1.1       misho    10165:        http://developer.apple.com/qa/qa2005/qa1419.html.
                   10166: 
                   10167: 
                   10168: AUTHOR
                   10169: 
                   10170:        Philip Hazel
                   10171:        University Computing Service
                   10172:        Cambridge CB2 3QH, England.
                   10173: 
                   10174: 
                   10175: REVISION
                   10176: 
1.1.1.4 ! misho    10177:        Last updated: 24 June 2012
1.1.1.2   misho    10178:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    10179: ------------------------------------------------------------------------------
                   10180: 
                   10181:
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>