Annotation of embedaddon/pcre/doc/pcre.txt, revision 1.1.1.3

1.1       misho       1: -----------------------------------------------------------------------------
                      2: This file contains a concatenation of the PCRE man pages, converted to plain
                      3: text format for ease of searching with a text editor, or for use on systems
                      4: that do not have a man page processor. The small individual files that give
                      5: synopses of each function in the library have not been included. Neither has
                      6: the pcredemo program. There are separate text files for the pcregrep and
                      7: pcretest commands.
                      8: -----------------------------------------------------------------------------
                      9: 
                     10: 
                     11: PCRE(3)                                                                PCRE(3)
                     12: 
                     13: 
                     14: NAME
                     15:        PCRE - Perl-compatible regular expressions
                     16: 
                     17: 
                     18: INTRODUCTION
                     19: 
                     20:        The  PCRE  library is a set of functions that implement regular expres-
                     21:        sion pattern matching using the same syntax and semantics as Perl, with
                     22:        just  a few differences. Some features that appeared in Python and PCRE
                     23:        before they appeared in Perl are also available using the  Python  syn-
                     24:        tax,  there  is  some  support for one or two .NET and Oniguruma syntax
                     25:        items, and there is an option for requesting some  minor  changes  that
                     26:        give better JavaScript compatibility.
                     27: 
1.1.1.2   misho      28:        Starting with release 8.30, it is possible to compile two separate PCRE
                     29:        libraries:  the  original,  which  supports  8-bit  character   strings
                     30:        (including  UTF-8  strings),  and a second library that supports 16-bit
                     31:        character strings (including UTF-16 strings). The build process  allows
                     32:        either  one  or both to be built. The majority of the work to make this
                     33:        possible was done by Zoltan Herczeg.
                     34: 
                     35:        The two libraries contain identical sets of functions, except that  the
                     36:        names  in  the  16-bit  library start with pcre16_ instead of pcre_. To
                     37:        avoid over-complication and reduce the documentation maintenance  load,
                     38:        most of the documentation describes the 8-bit library, with the differ-
                     39:        ences for the 16-bit library described separately in the  pcre16  page.
                     40:        References  to  functions or structures of the form pcre[16]_xxx should
                     41:        be  read  as  meaning  "pcre_xxx  when  using  the  8-bit  library  and
                     42:        pcre16_xxx when using the 16-bit library".
                     43: 
1.1       misho      44:        The  current implementation of PCRE corresponds approximately with Perl
1.1.1.2   misho      45:        5.12, including support for UTF-8/16 encoded strings and  Unicode  gen-
                     46:        eral  category properties. However, UTF-8/16 and Unicode support has to
                     47:        be explicitly enabled; it is not the default. The Unicode tables corre-
1.1       misho      48:        spond to Unicode release 6.0.0.
                     49: 
                     50:        In  addition to the Perl-compatible matching function, PCRE contains an
                     51:        alternative function that matches the same compiled patterns in a  dif-
                     52:        ferent way. In certain circumstances, the alternative function has some
                     53:        advantages.  For a discussion of the two matching algorithms,  see  the
                     54:        pcrematching page.
                     55: 
                     56:        PCRE  is  written  in C and released as a C library. A number of people
                     57:        have written wrappers and interfaces of various kinds.  In  particular,
1.1.1.2   misho      58:        Google  Inc.   have  provided a comprehensive C++ wrapper for the 8-bit
                     59:        library. This is now included as part of  the  PCRE  distribution.  The
                     60:        pcrecpp  page  has  details of this interface. Other people's contribu-
                     61:        tions can be found in the Contrib directory at the  primary  FTP  site,
                     62:        which is:
1.1       misho      63: 
                     64:        ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
                     65: 
1.1.1.2   misho      66:        Details  of  exactly which Perl regular expression features are and are
1.1       misho      67:        not supported by PCRE are given in separate documents. See the pcrepat-
1.1.1.2   misho      68:        tern  and pcrecompat pages. There is a syntax summary in the pcresyntax
1.1       misho      69:        page.
                     70: 
1.1.1.2   misho      71:        Some features of PCRE can be included, excluded, or  changed  when  the
                     72:        library  is  built.  The pcre_config() function makes it possible for a
                     73:        client to discover which features are  available.  The  features  them-
                     74:        selves  are described in the pcrebuild page. Documentation about build-
                     75:        ing PCRE for various operating systems can be found in the  README  and
1.1       misho      76:        NON-UNIX-USE files in the source distribution.
                     77: 
1.1.1.2   misho      78:        The  libraries contains a number of undocumented internal functions and
                     79:        data tables that are used by more than one  of  the  exported  external
                     80:        functions,  but  which  are  not  intended for use by external callers.
                     81:        Their names all begin with "_pcre_" or "_pcre16_", which hopefully will
                     82:        not  provoke  any name clashes. In some environments, it is possible to
                     83:        control which external symbols are exported when a  shared  library  is
                     84:        built, and in these cases the undocumented symbols are not exported.
1.1       misho      85: 
                     86: 
                     87: USER DOCUMENTATION
                     88: 
1.1.1.2   misho      89:        The  user  documentation  for PCRE comprises a number of different sec-
                     90:        tions. In the "man" format, each of these is a separate "man page".  In
                     91:        the  HTML  format, each is a separate page, linked from the index page.
                     92:        In the plain text format, all the sections, except  the  pcredemo  sec-
1.1       misho      93:        tion, are concatenated, for ease of searching. The sections are as fol-
                     94:        lows:
                     95: 
                     96:          pcre              this document
1.1.1.2   misho      97:          pcre16            details of the 16-bit library
1.1       misho      98:          pcre-config       show PCRE installation configuration information
                     99:          pcreapi           details of PCRE's native C API
                    100:          pcrebuild         options for building PCRE
                    101:          pcrecallout       details of the callout feature
                    102:          pcrecompat        discussion of Perl compatibility
1.1.1.2   misho     103:          pcrecpp           details of the C++ wrapper for the 8-bit library
1.1       misho     104:          pcredemo          a demonstration C program that uses PCRE
1.1.1.2   misho     105:          pcregrep          description of the pcregrep command (8-bit only)
1.1       misho     106:          pcrejit           discussion of the just-in-time optimization support
                    107:          pcrelimits        details of size and other limits
                    108:          pcrematching      discussion of the two matching algorithms
                    109:          pcrepartial       details of the partial matching facility
                    110:          pcrepattern       syntax and semantics of supported
                    111:                              regular expressions
                    112:          pcreperform       discussion of performance issues
1.1.1.2   misho     113:          pcreposix         the POSIX-compatible C API for the 8-bit library
1.1       misho     114:          pcreprecompile    details of saving and re-using precompiled patterns
                    115:          pcresample        discussion of the pcredemo program
                    116:          pcrestack         discussion of stack usage
                    117:          pcresyntax        quick syntax reference
                    118:          pcretest          description of the pcretest testing command
1.1.1.2   misho     119:          pcreunicode       discussion of Unicode and UTF-8/16 support
1.1       misho     120: 
1.1.1.2   misho     121:        In addition, in the "man" and HTML formats, there is a short  page  for
                    122:        each 8-bit C library function, listing its arguments and results.
1.1       misho     123: 
                    124: 
                    125: AUTHOR
                    126: 
                    127:        Philip Hazel
                    128:        University Computing Service
                    129:        Cambridge CB2 3QH, England.
                    130: 
1.1.1.2   misho     131:        Putting  an actual email address here seems to have been a spam magnet,
                    132:        so I've taken it away. If you want to email me, use  my  two  initials,
1.1       misho     133:        followed by the two digits 10, at the domain cam.ac.uk.
                    134: 
                    135: 
                    136: REVISION
                    137: 
1.1.1.2   misho     138:        Last updated: 10 January 2012
                    139:        Copyright (c) 1997-2012 University of Cambridge.
                    140: ------------------------------------------------------------------------------
                    141: 
                    142: 
                    143: PCRE(3)                                                                PCRE(3)
                    144: 
                    145: 
                    146: NAME
                    147:        PCRE - Perl-compatible regular expressions
                    148: 
                    149:        #include <pcre.h>
                    150: 
                    151: 
                    152: PCRE 16-BIT API BASIC FUNCTIONS
                    153: 
                    154:        pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
                    155:             const char **errptr, int *erroffset,
                    156:             const unsigned char *tableptr);
                    157: 
                    158:        pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
                    159:             int *errorcodeptr,
                    160:             const char **errptr, int *erroffset,
                    161:             const unsigned char *tableptr);
                    162: 
                    163:        pcre16_extra *pcre16_study(const pcre16 *code, int options,
                    164:             const char **errptr);
                    165: 
                    166:        void pcre16_free_study(pcre16_extra *extra);
                    167: 
                    168:        int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
                    169:             PCRE_SPTR16 subject, int length, int startoffset,
                    170:             int options, int *ovector, int ovecsize);
                    171: 
                    172:        int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
                    173:             PCRE_SPTR16 subject, int length, int startoffset,
                    174:             int options, int *ovector, int ovecsize,
                    175:             int *workspace, int wscount);
                    176: 
                    177: 
                    178: PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
                    179: 
                    180:        int pcre16_copy_named_substring(const pcre16 *code,
                    181:             PCRE_SPTR16 subject, int *ovector,
                    182:             int stringcount, PCRE_SPTR16 stringname,
                    183:             PCRE_UCHAR16 *buffer, int buffersize);
                    184: 
                    185:        int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
                    186:             int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
                    187:             int buffersize);
                    188: 
                    189:        int pcre16_get_named_substring(const pcre16 *code,
                    190:             PCRE_SPTR16 subject, int *ovector,
                    191:             int stringcount, PCRE_SPTR16 stringname,
                    192:             PCRE_SPTR16 *stringptr);
                    193: 
                    194:        int pcre16_get_stringnumber(const pcre16 *code,
                    195:             PCRE_SPTR16 name);
                    196: 
                    197:        int pcre16_get_stringtable_entries(const pcre16 *code,
                    198:             PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
                    199: 
                    200:        int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
                    201:             int stringcount, int stringnumber,
                    202:             PCRE_SPTR16 *stringptr);
                    203: 
                    204:        int pcre16_get_substring_list(PCRE_SPTR16 subject,
                    205:             int *ovector, int stringcount, PCRE_SPTR16 **listptr);
                    206: 
                    207:        void pcre16_free_substring(PCRE_SPTR16 stringptr);
                    208: 
                    209:        void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
                    210: 
                    211: 
                    212: PCRE 16-BIT API AUXILIARY FUNCTIONS
                    213: 
                    214:        pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
                    215: 
                    216:        void pcre16_jit_stack_free(pcre16_jit_stack *stack);
                    217: 
                    218:        void pcre16_assign_jit_stack(pcre16_extra *extra,
                    219:             pcre16_jit_callback callback, void *data);
                    220: 
                    221:        const unsigned char *pcre16_maketables(void);
                    222: 
                    223:        int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
                    224:             int what, void *where);
                    225: 
                    226:        int pcre16_refcount(pcre16 *code, int adjust);
                    227: 
                    228:        int pcre16_config(int what, void *where);
                    229: 
                    230:        const char *pcre16_version(void);
                    231: 
                    232:        int pcre16_pattern_to_host_byte_order(pcre16 *code,
                    233:             pcre16_extra *extra, const unsigned char *tables);
                    234: 
                    235: 
                    236: PCRE 16-BIT API INDIRECTED FUNCTIONS
                    237: 
                    238:        void *(*pcre16_malloc)(size_t);
                    239: 
                    240:        void (*pcre16_free)(void *);
                    241: 
                    242:        void *(*pcre16_stack_malloc)(size_t);
                    243: 
                    244:        void (*pcre16_stack_free)(void *);
                    245: 
                    246:        int (*pcre16_callout)(pcre16_callout_block *);
                    247: 
                    248: 
                    249: PCRE 16-BIT API 16-BIT-ONLY FUNCTION
                    250: 
                    251:        int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
                    252:             PCRE_SPTR16 input, int length, int *byte_order,
                    253:             int keep_boms);
                    254: 
                    255: 
                    256: THE PCRE 16-BIT LIBRARY
                    257: 
                    258:        Starting  with  release  8.30, it is possible to compile a PCRE library
                    259:        that supports 16-bit character strings, including  UTF-16  strings,  as
                    260:        well  as  or instead of the original 8-bit library. The majority of the
                    261:        work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
                    262:        libraries contain identical sets of functions, used in exactly the same
                    263:        way. Only the names of the functions and the data types of their  argu-
                    264:        ments  and results are different. To avoid over-complication and reduce
                    265:        the documentation maintenance load,  most  of  the  PCRE  documentation
                    266:        describes  the  8-bit  library,  with only occasional references to the
                    267:        16-bit library. This page describes what is different when you use  the
                    268:        16-bit library.
                    269: 
                    270:        WARNING:  A  single  application can be linked with both libraries, but
                    271:        you must take care when processing any particular pattern to use  func-
                    272:        tions  from  just one library. For example, if you want to study a pat-
                    273:        tern that was compiled with  pcre16_compile(),  you  must  do  so  with
                    274:        pcre16_study(), not pcre_study(), and you must free the study data with
                    275:        pcre16_free_study().
                    276: 
                    277: 
                    278: THE HEADER FILE
                    279: 
                    280:        There is only one header file, pcre.h. It contains prototypes  for  all
                    281:        the  functions  in  both  libraries,  as  well as definitions of flags,
                    282:        structures, error codes, etc.
                    283: 
                    284: 
                    285: THE LIBRARY NAME
                    286: 
                    287:        In Unix-like systems, the 16-bit library is called libpcre16,  and  can
                    288:        normally  be  accesss  by adding -lpcre16 to the command for linking an
                    289:        application that uses PCRE.
                    290: 
                    291: 
                    292: STRING TYPES
                    293: 
                    294:        In the 8-bit library, strings are passed to PCRE library  functions  as
                    295:        vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
                    296:        strings are passed as vectors of unsigned 16-bit quantities. The  macro
                    297:        PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
                    298:        defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
                    299:        int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
                    300:        as "short int", but checks that it really is a 16-bit data type. If  it
                    301:        is not, the build fails with an error message telling the maintainer to
                    302:        modify the definition appropriately.
                    303: 
                    304: 
                    305: STRUCTURE TYPES
                    306: 
                    307:        The types of the opaque structures that are used  for  compiled  16-bit
                    308:        patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
                    309:        The  type  of  the  user-accessible  structure  that  is  returned   by
                    310:        pcre16_study()  is  pcre16_extra, and the type of the structure that is
                    311:        used for passing data to a callout  function  is  pcre16_callout_block.
                    312:        These structures contain the same fields, with the same names, as their
                    313:        8-bit counterparts. The only difference is that pointers  to  character
                    314:        strings are 16-bit instead of 8-bit types.
                    315: 
                    316: 
                    317: 16-BIT FUNCTIONS
                    318: 
                    319:        For  every function in the 8-bit library there is a corresponding func-
                    320:        tion in the 16-bit library with a name that starts with pcre16_ instead
                    321:        of  pcre_.  The  prototypes are listed above. In addition, there is one
                    322:        extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
                    323:        function  that converts a UTF-16 character string to host byte order if
                    324:        necessary. The other 16-bit  functions  expect  the  strings  they  are
                    325:        passed to be in host byte order.
                    326: 
                    327:        The input and output arguments of pcre16_utf16_to_host_byte_order() may
                    328:        point to the same address, that is, conversion in place  is  supported.
                    329:        The output buffer must be at least as long as the input.
                    330: 
                    331:        The  length  argument  specifies the number of 16-bit data units in the
                    332:        input string; a negative value specifies a zero-terminated string.
                    333: 
                    334:        If byte_order is NULL, it is assumed that the string starts off in host
                    335:        byte  order. This may be changed by byte-order marks (BOMs) anywhere in
                    336:        the string (commonly as the first character).
                    337: 
                    338:        If byte_order is not NULL, a non-zero value of the integer to which  it
                    339:        points  means  that  the input starts off in host byte order, otherwise
                    340:        the opposite order is assumed. Again, BOMs in  the  string  can  change
                    341:        this. The final byte order is passed back at the end of processing.
                    342: 
                    343:        If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
                    344:        copied into the output string. Otherwise they are discarded.
                    345: 
                    346:        The result of the function is the number of 16-bit  units  placed  into
                    347:        the  output  buffer,  including  the  zero terminator if the string was
                    348:        zero-terminated.
                    349: 
                    350: 
                    351: SUBJECT STRING OFFSETS
                    352: 
                    353:        The offsets within subject strings that are returned  by  the  matching
                    354:        functions are in 16-bit units rather than bytes.
                    355: 
                    356: 
                    357: NAMED SUBPATTERNS
                    358: 
                    359:        The  name-to-number translation table that is maintained for named sub-
                    360:        patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
                    361:        function returns the length of each entry in the table as the number of
                    362:        16-bit data units.
                    363: 
                    364: 
                    365: OPTION NAMES
                    366: 
                    367:        There   are   two   new   general   option   names,   PCRE_UTF16    and
                    368:        PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
                    369:        PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
1.1.1.3 ! misho     370:        define  the  same bits in the options word. There is a discussion about
        !           371:        the validity of UTF-16 strings in the pcreunicode page.
1.1.1.2   misho     372: 
1.1.1.3 ! misho     373:        For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
        !           374:        that  returns  1  if UTF-16 support is configured, otherwise 0. If this
        !           375:        option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option  is
1.1.1.2   misho     376:        given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.
                    377: 
                    378: 
                    379: CHARACTER CODES
                    380: 
1.1.1.3 ! misho     381:        In  16-bit  mode,  when  PCRE_UTF16  is  not  set, character values are
1.1.1.2   misho     382:        treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
1.1.1.3 ! misho     383:        that  they  can  range from 0 to 0xffff instead of 0 to 0xff. Character
        !           384:        types for characters less than 0xff can therefore be influenced by  the
        !           385:        locale  in  the  same way as before.  Characters greater than 0xff have
1.1.1.2   misho     386:        only one case, and no "type" (such as letter or digit).
                    387: 
1.1.1.3 ! misho     388:        In UTF-16 mode, the character code  is  Unicode,  in  the  range  0  to
        !           389:        0x10ffff,  with  the  exception of values in the range 0xd800 to 0xdfff
        !           390:        because those are "surrogate" values that are used in pairs  to  encode
1.1.1.2   misho     391:        values greater than 0xffff.
                    392: 
1.1.1.3 ! misho     393:        A  UTF-16 string can indicate its endianness by special code knows as a
1.1.1.2   misho     394:        byte-order mark (BOM). The PCRE functions do not handle this, expecting
1.1.1.3 ! misho     395:        strings   to   be  in  host  byte  order.  A  utility  function  called
        !           396:        pcre16_utf16_to_host_byte_order() is provided to help  with  this  (see
1.1.1.2   misho     397:        above).
                    398: 
                    399: 
                    400: ERROR NAMES
                    401: 
1.1.1.3 ! misho     402:        The  errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
        !           403:        spond to their 8-bit  counterparts.  The  error  PCRE_ERROR_BADMODE  is
        !           404:        given  when  a  compiled pattern is passed to a function that processes
        !           405:        patterns in the other mode, for example, if  a  pattern  compiled  with
1.1.1.2   misho     406:        pcre_compile() is passed to pcre16_exec().
                    407: 
1.1.1.3 ! misho     408:        There  are  new  error  codes whose names begin with PCRE_UTF16_ERR for
        !           409:        invalid UTF-16 strings, corresponding to the  PCRE_UTF8_ERR  codes  for
        !           410:        UTF-8  strings that are described in the section entitled "Reason codes
        !           411:        for invalid UTF-8 strings" in the main pcreapi page. The UTF-16  errors
1.1.1.2   misho     412:        are:
                    413: 
                    414:          PCRE_UTF16_ERR1  Missing low surrogate at end of string
                    415:          PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
                    416:          PCRE_UTF16_ERR3  Isolated low surrogate
                    417:          PCRE_UTF16_ERR4  Invalid character 0xfffe
                    418: 
                    419: 
                    420: ERROR TEXTS
                    421: 
1.1.1.3 ! misho     422:        If  there is an error while compiling a pattern, the error text that is
        !           423:        passed back by pcre16_compile() or pcre16_compile2() is still an  8-bit
1.1.1.2   misho     424:        character string, zero-terminated.
                    425: 
                    426: 
                    427: CALLOUTS
                    428: 
1.1.1.3 ! misho     429:        The  subject  and  mark fields in the callout block that is passed to a
1.1.1.2   misho     430:        callout function point to 16-bit vectors.
                    431: 
                    432: 
                    433: TESTING
                    434: 
1.1.1.3 ! misho     435:        The pcretest program continues to operate with 8-bit input  and  output
        !           436:        files,  but it can be used for testing the 16-bit library. If it is run
1.1.1.2   misho     437:        with the command line option -16, patterns and subject strings are con-
                    438:        verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
1.1.1.3 ! misho     439:        library functions are used instead of the 8-bit ones.  Returned  16-bit
1.1.1.2   misho     440:        strings are converted to 8-bit for output. If the 8-bit library was not
                    441:        compiled, pcretest defaults to 16-bit and the -16 option is ignored.
                    442: 
1.1.1.3 ! misho     443:        When PCRE is being built, the RunTest script that is  called  by  "make
        !           444:        check"  uses  the pcretest -C option to discover which of the 8-bit and
1.1.1.2   misho     445:        16-bit libraries has been built, and runs the tests appropriately.
                    446: 
                    447: 
                    448: NOT SUPPORTED IN 16-BIT MODE
                    449: 
                    450:        Not all the features of the 8-bit library are available with the 16-bit
1.1.1.3 ! misho     451:        library.  The  C++  and  POSIX wrapper functions support only the 8-bit
1.1.1.2   misho     452:        library, and the pcregrep program is at present 8-bit only.
                    453: 
                    454: 
                    455: AUTHOR
                    456: 
                    457:        Philip Hazel
                    458:        University Computing Service
                    459:        Cambridge CB2 3QH, England.
                    460: 
                    461: 
                    462: REVISION
                    463: 
1.1.1.3 ! misho     464:        Last updated: 14 April 2012
1.1.1.2   misho     465:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho     466: ------------------------------------------------------------------------------
                    467: 
                    468: 
                    469: PCREBUILD(3)                                                      PCREBUILD(3)
                    470: 
                    471: 
                    472: NAME
                    473:        PCRE - Perl-compatible regular expressions
                    474: 
                    475: 
                    476: PCRE BUILD-TIME OPTIONS
                    477: 
                    478:        This  document  describes  the  optional  features  of PCRE that can be
                    479:        selected when the library is compiled. It assumes use of the  configure
                    480:        script,  where the optional features are selected or deselected by pro-
                    481:        viding options to configure before running the make  command.  However,
                    482:        the  same  options  can be selected in both Unix-like and non-Unix-like
                    483:        environments using the GUI facility of cmake-gui if you are using CMake
                    484:        instead of configure to build PCRE.
                    485: 
                    486:        There  is  a  lot more information about building PCRE in non-Unix-like
                    487:        environments in the file called NON_UNIX_USE, which is part of the PCRE
                    488:        distribution.  You  should consult this file as well as the README file
                    489:        if you are building in a non-Unix-like environment.
                    490: 
                    491:        The complete list of options for configure (which includes the standard
                    492:        ones  such  as  the  selection  of  the  installation directory) can be
                    493:        obtained by running
                    494: 
                    495:          ./configure --help
                    496: 
                    497:        The following sections include  descriptions  of  options  whose  names
                    498:        begin with --enable or --disable. These settings specify changes to the
                    499:        defaults for the configure command. Because of the way  that  configure
                    500:        works,  --enable  and --disable always come in pairs, so the complemen-
                    501:        tary option always exists as well, but as it specifies the default,  it
                    502:        is not described.
                    503: 
                    504: 
1.1.1.2   misho     505: BUILDING 8-BIT and 16-BIT LIBRARIES
                    506: 
                    507:        By  default,  a  library  called libpcre is built, containing functions
                    508:        that take string arguments contained in vectors  of  bytes,  either  as
                    509:        single-byte  characters,  or interpreted as UTF-8 strings. You can also
                    510:        build a separate library, called libpcre16, in which strings  are  con-
                    511:        tained  in  vectors of 16-bit data units and interpreted either as sin-
                    512:        gle-unit characters or UTF-16 strings, by adding
                    513: 
                    514:          --enable-pcre16
                    515: 
                    516:        to the configure command. If you do not want the 8-bit library, add
                    517: 
                    518:          --disable-pcre8
                    519: 
                    520:        as well. At least one of the two libraries must be built. Note that the
                    521:        C++  and  POSIX wrappers are for the 8-bit library only, and that pcre-
                    522:        grep is an 8-bit program. None of these are built if  you  select  only
                    523:        the 16-bit library.
                    524: 
                    525: 
1.1       misho     526: BUILDING SHARED AND STATIC LIBRARIES
                    527: 
                    528:        The  PCRE building process uses libtool to build both shared and static
                    529:        Unix libraries by default. You can suppress one of these by adding  one
                    530:        of
                    531: 
                    532:          --disable-shared
                    533:          --disable-static
                    534: 
                    535:        to the configure command, as required.
                    536: 
                    537: 
                    538: C++ SUPPORT
                    539: 
1.1.1.2   misho     540:        By  default,  if the 8-bit library is being built, the configure script
                    541:        will search for a C++ compiler and C++ header files. If it finds  them,
                    542:        it  automatically  builds  the C++ wrapper library (which supports only
                    543:        8-bit strings). You can disable this by adding
1.1       misho     544: 
                    545:          --disable-cpp
                    546: 
                    547:        to the configure command.
                    548: 
                    549: 
1.1.1.2   misho     550: UTF-8 and UTF-16 SUPPORT
1.1       misho     551: 
1.1.1.2   misho     552:        To build PCRE with support for UTF Unicode character strings, add
1.1       misho     553: 
1.1.1.2   misho     554:          --enable-utf
1.1       misho     555: 
1.1.1.2   misho     556:        to the configure command.  This  setting  applies  to  both  libraries,
                    557:        adding support for UTF-8 to the 8-bit library and support for UTF-16 to
                    558:        the 16-bit library. There are no separate options  for  enabling  UTF-8
                    559:        and  UTF-16  independently because that would allow ridiculous settings
                    560:        such as  requesting  UTF-16  support  while  building  only  the  8-bit
                    561:        library.  It  is not possible to build one library with UTF support and
                    562:        the other without in the same configuration. (For backwards compatibil-
                    563:        ity, --enable-utf8 is a synonym of --enable-utf.)
                    564: 
                    565:        Of  itself,  this  setting does not make PCRE treat strings as UTF-8 or
                    566:        UTF-16. As well as compiling PCRE with this option, you also have  have
                    567:        to set the PCRE_UTF8 or PCRE_UTF16 option when you call one of the pat-
                    568:        tern compiling functions.
1.1       misho     569: 
1.1.1.2   misho     570:        If you set --enable-utf when compiling in an EBCDIC  environment,  PCRE
1.1.1.3 ! misho     571:        expects  its  input  to be either ASCII or UTF-8 (depending on the run-
        !           572:        time option). It is not possible to support both EBCDIC and UTF-8 codes
        !           573:        in  the  same  version  of  the library. Consequently, --enable-utf and
1.1       misho     574:        --enable-ebcdic are mutually exclusive.
                    575: 
                    576: 
                    577: UNICODE CHARACTER PROPERTY SUPPORT
                    578: 
1.1.1.2   misho     579:        UTF support allows the libraries to process character codepoints up  to
                    580:        0x10ffff  in the strings that they handle. On its own, however, it does
                    581:        not provide any facilities for accessing the properties of such charac-
                    582:        ters. If you want to be able to use the pattern escapes \P, \p, and \X,
                    583:        which refer to Unicode character properties, you must add
1.1       misho     584: 
                    585:          --enable-unicode-properties
                    586: 
1.1.1.2   misho     587:        to the configure command. This implies UTF support, even  if  you  have
1.1       misho     588:        not explicitly requested it.
                    589: 
                    590:        Including  Unicode  property  support  adds around 30K of tables to the
                    591:        PCRE library. Only the general category properties such as  Lu  and  Nd
                    592:        are supported. Details are given in the pcrepattern documentation.
                    593: 
                    594: 
                    595: JUST-IN-TIME COMPILER SUPPORT
                    596: 
                    597:        Just-in-time compiler support is included in the build by specifying
                    598: 
                    599:          --enable-jit
                    600: 
                    601:        This  support  is available only for certain hardware architectures. If
                    602:        this option is set for an  unsupported  architecture,  a  compile  time
                    603:        error  occurs.   See  the pcrejit documentation for a discussion of JIT
                    604:        usage. When JIT support is enabled, pcregrep automatically makes use of
                    605:        it, unless you add
                    606: 
                    607:          --disable-pcregrep-jit
                    608: 
                    609:        to the "configure" command.
                    610: 
                    611: 
                    612: CODE VALUE OF NEWLINE
                    613: 
                    614:        By  default,  PCRE interprets the linefeed (LF) character as indicating
                    615:        the end of a line. This is the normal newline  character  on  Unix-like
                    616:        systems.  You  can compile PCRE to use carriage return (CR) instead, by
                    617:        adding
                    618: 
                    619:          --enable-newline-is-cr
                    620: 
                    621:        to the  configure  command.  There  is  also  a  --enable-newline-is-lf
                    622:        option, which explicitly specifies linefeed as the newline character.
                    623: 
                    624:        Alternatively, you can specify that line endings are to be indicated by
                    625:        the two character sequence CRLF. If you want this, add
                    626: 
                    627:          --enable-newline-is-crlf
                    628: 
                    629:        to the configure command. There is a fourth option, specified by
                    630: 
                    631:          --enable-newline-is-anycrlf
                    632: 
                    633:        which causes PCRE to recognize any of the three sequences  CR,  LF,  or
                    634:        CRLF as indicating a line ending. Finally, a fifth option, specified by
                    635: 
                    636:          --enable-newline-is-any
                    637: 
                    638:        causes PCRE to recognize any Unicode newline sequence.
                    639: 
                    640:        Whatever  line  ending convention is selected when PCRE is built can be
                    641:        overridden when the library functions are called. At build time  it  is
                    642:        conventional to use the standard for your operating system.
                    643: 
                    644: 
                    645: WHAT \R MATCHES
                    646: 
                    647:        By  default,  the  sequence \R in a pattern matches any Unicode newline
                    648:        sequence, whatever has been selected as the line  ending  sequence.  If
                    649:        you specify
                    650: 
                    651:          --enable-bsr-anycrlf
                    652: 
                    653:        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
                    654:        ever is selected when PCRE is built can be overridden when the  library
                    655:        functions are called.
                    656: 
                    657: 
                    658: POSIX MALLOC USAGE
                    659: 
1.1.1.2   misho     660:        When  the  8-bit library is called through the POSIX interface (see the
                    661:        pcreposix documentation), additional working storage  is  required  for
                    662:        holding  the  pointers  to  capturing substrings, because PCRE requires
                    663:        three integers per substring, whereas the POSIX interface provides only
                    664:        two.  If  the number of expected substrings is small, the wrapper func-
                    665:        tion uses space on the stack, because this is faster  than  using  mal-
                    666:        loc()  for each call. The default threshold above which the stack is no
                    667:        longer used is 10; it can be changed by adding a setting such as
1.1       misho     668: 
                    669:          --with-posix-malloc-threshold=20
                    670: 
                    671:        to the configure command.
                    672: 
                    673: 
                    674: HANDLING VERY LARGE PATTERNS
                    675: 
                    676:        Within a compiled pattern, offset values are used  to  point  from  one
                    677:        part  to another (for example, from an opening parenthesis to an alter-
                    678:        nation metacharacter). By default, two-byte values are used  for  these
                    679:        offsets,  leading  to  a  maximum size for a compiled pattern of around
                    680:        64K. This is sufficient to handle all but the most  gigantic  patterns.
1.1.1.2   misho     681:        Nevertheless,  some  people do want to process truly enormous patterns,
1.1       misho     682:        so it is possible to compile PCRE to use three-byte or  four-byte  off-
                    683:        sets by adding a setting such as
                    684: 
                    685:          --with-link-size=3
                    686: 
1.1.1.2   misho     687:        to  the  configure command. The value given must be 2, 3, or 4. For the
                    688:        16-bit library, a value of 3 is rounded up to 4. Using  longer  offsets
                    689:        slows down the operation of PCRE because it has to load additional data
                    690:        when handling them.
1.1       misho     691: 
                    692: 
                    693: AVOIDING EXCESSIVE STACK USAGE
                    694: 
                    695:        When matching with the pcre_exec() function, PCRE implements backtrack-
1.1.1.2   misho     696:        ing  by  making recursive calls to an internal function called match().
                    697:        In environments where the size of the stack is limited,  this  can  se-
                    698:        verely  limit  PCRE's operation. (The Unix environment does not usually
1.1       misho     699:        suffer from this problem, but it may sometimes be necessary to increase
1.1.1.2   misho     700:        the  maximum  stack size.  There is a discussion in the pcrestack docu-
                    701:        mentation.) An alternative approach to recursion that uses memory  from
                    702:        the  heap  to remember data, instead of using recursive function calls,
                    703:        has been implemented to work round the problem of limited  stack  size.
1.1       misho     704:        If you want to build a version of PCRE that works this way, add
                    705: 
                    706:          --disable-stack-for-recursion
                    707: 
1.1.1.2   misho     708:        to  the  configure  command. With this configuration, PCRE will use the
                    709:        pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
                    710:        ment  functions. By default these point to malloc() and free(), but you
1.1       misho     711:        can replace the pointers so that your own functions are used instead.
                    712: 
1.1.1.2   misho     713:        Separate functions are  provided  rather  than  using  pcre_malloc  and
                    714:        pcre_free  because  the  usage  is  very  predictable:  the block sizes
                    715:        requested are always the same, and  the  blocks  are  always  freed  in
                    716:        reverse  order.  A calling program might be able to implement optimized
                    717:        functions that perform better  than  malloc()  and  free().  PCRE  runs
1.1       misho     718:        noticeably more slowly when built in this way. This option affects only
                    719:        the pcre_exec() function; it is not relevant for pcre_dfa_exec().
                    720: 
                    721: 
                    722: LIMITING PCRE RESOURCE USAGE
                    723: 
1.1.1.2   misho     724:        Internally, PCRE has a function called match(), which it calls  repeat-
                    725:        edly   (sometimes   recursively)  when  matching  a  pattern  with  the
                    726:        pcre_exec() function. By controlling the maximum number of  times  this
                    727:        function  may be called during a single matching operation, a limit can
                    728:        be placed on the resources used by a single call  to  pcre_exec().  The
                    729:        limit  can be changed at run time, as described in the pcreapi documen-
                    730:        tation. The default is 10 million, but this can be changed by adding  a
1.1       misho     731:        setting such as
                    732: 
                    733:          --with-match-limit=500000
                    734: 
1.1.1.2   misho     735:        to   the   configure  command.  This  setting  has  no  effect  on  the
1.1       misho     736:        pcre_dfa_exec() matching function.
                    737: 
1.1.1.2   misho     738:        In some environments it is desirable to limit the  depth  of  recursive
1.1       misho     739:        calls of match() more strictly than the total number of calls, in order
1.1.1.2   misho     740:        to restrict the maximum amount of stack (or heap,  if  --disable-stack-
1.1       misho     741:        for-recursion is specified) that is used. A second limit controls this;
1.1.1.2   misho     742:        it defaults to the value that  is  set  for  --with-match-limit,  which
                    743:        imposes  no  additional constraints. However, you can set a lower limit
1.1       misho     744:        by adding, for example,
                    745: 
                    746:          --with-match-limit-recursion=10000
                    747: 
1.1.1.2   misho     748:        to the configure command. This value can  also  be  overridden  at  run
1.1       misho     749:        time.
                    750: 
                    751: 
                    752: CREATING CHARACTER TABLES AT BUILD TIME
                    753: 
1.1.1.2   misho     754:        PCRE  uses fixed tables for processing characters whose code values are
                    755:        less than 256. By default, PCRE is built with a set of tables that  are
                    756:        distributed  in  the  file pcre_chartables.c.dist. These tables are for
1.1       misho     757:        ASCII codes only. If you add
                    758: 
                    759:          --enable-rebuild-chartables
                    760: 
1.1.1.2   misho     761:        to the configure command, the distributed tables are  no  longer  used.
                    762:        Instead,  a  program  called dftables is compiled and run. This outputs
1.1       misho     763:        the source for new set of tables, created in the default locale of your
1.1.1.3 ! misho     764:        C  run-time  system. (This method of replacing the tables does not work
        !           765:        if you are cross compiling, because dftables is run on the local  host.
        !           766:        If you need to create alternative tables when cross compiling, you will
1.1       misho     767:        have to do so "by hand".)
                    768: 
                    769: 
                    770: USING EBCDIC CODE
                    771: 
1.1.1.2   misho     772:        PCRE assumes by default that it will run in an  environment  where  the
                    773:        character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
                    774:        This is the case for most computer operating systems.  PCRE  can,  how-
1.1       misho     775:        ever, be compiled to run in an EBCDIC environment by adding
                    776: 
                    777:          --enable-ebcdic
                    778: 
                    779:        to the configure command. This setting implies --enable-rebuild-charta-
1.1.1.2   misho     780:        bles. You should only use it if you know that  you  are  in  an  EBCDIC
                    781:        environment  (for  example,  an  IBM  mainframe  operating system). The
                    782:        --enable-ebcdic option is incompatible with --enable-utf.
1.1       misho     783: 
                    784: 
                    785: PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
                    786: 
                    787:        By default, pcregrep reads all files as plain text. You can build it so
                    788:        that it recognizes files whose names end in .gz or .bz2, and reads them
                    789:        with libz or libbz2, respectively, by adding one or both of
                    790: 
                    791:          --enable-pcregrep-libz
                    792:          --enable-pcregrep-libbz2
                    793: 
                    794:        to the configure command. These options naturally require that the rel-
1.1.1.2   misho     795:        evant  libraries  are installed on your system. Configuration will fail
1.1       misho     796:        if they are not.
                    797: 
                    798: 
                    799: PCREGREP BUFFER SIZE
                    800: 
1.1.1.2   misho     801:        pcregrep uses an internal buffer to hold a "window" on the file  it  is
1.1       misho     802:        scanning, in order to be able to output "before" and "after" lines when
1.1.1.2   misho     803:        it finds a match. The size of the buffer is controlled by  a  parameter
1.1       misho     804:        whose default value is 20K. The buffer itself is three times this size,
                    805:        but because of the way it is used for holding "before" lines, the long-
1.1.1.2   misho     806:        est  line  that  is guaranteed to be processable is the parameter size.
1.1       misho     807:        You can change the default parameter value by adding, for example,
                    808: 
                    809:          --with-pcregrep-bufsize=50K
                    810: 
                    811:        to the configure command. The caller of pcregrep can, however, override
                    812:        this value by specifying a run-time option.
                    813: 
                    814: 
                    815: PCRETEST OPTION FOR LIBREADLINE SUPPORT
                    816: 
                    817:        If you add
                    818: 
                    819:          --enable-pcretest-libreadline
                    820: 
1.1.1.2   misho     821:        to  the  configure  command,  pcretest  is  linked with the libreadline
                    822:        library, and when its input is from a terminal, it reads it  using  the
1.1       misho     823:        readline() function. This provides line-editing and history facilities.
                    824:        Note that libreadline is GPL-licensed, so if you distribute a binary of
                    825:        pcretest linked in this way, there may be licensing issues.
                    826: 
1.1.1.2   misho     827:        Setting  this  option  causes  the -lreadline option to be added to the
                    828:        pcretest build. In many operating environments with  a  sytem-installed
1.1       misho     829:        libreadline this is sufficient. However, in some environments (e.g.  if
1.1.1.2   misho     830:        an unmodified distribution version of readline is in use),  some  extra
                    831:        configuration  may  be necessary. The INSTALL file for libreadline says
1.1       misho     832:        this:
                    833: 
                    834:          "Readline uses the termcap functions, but does not link with the
                    835:          termcap or curses library itself, allowing applications which link
                    836:          with readline the to choose an appropriate library."
                    837: 
1.1.1.2   misho     838:        If your environment has not been set up so that an appropriate  library
1.1       misho     839:        is automatically included, you may need to add something like
                    840: 
                    841:          LIBS="-ncurses"
                    842: 
                    843:        immediately before the configure command.
                    844: 
                    845: 
                    846: SEE ALSO
                    847: 
1.1.1.2   misho     848:        pcreapi(3), pcre16, pcre_config(3).
1.1       misho     849: 
                    850: 
                    851: AUTHOR
                    852: 
                    853:        Philip Hazel
                    854:        University Computing Service
                    855:        Cambridge CB2 3QH, England.
                    856: 
                    857: 
                    858: REVISION
                    859: 
1.1.1.2   misho     860:        Last updated: 07 January 2012
                    861:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho     862: ------------------------------------------------------------------------------
                    863: 
                    864: 
                    865: PCREMATCHING(3)                                                PCREMATCHING(3)
                    866: 
                    867: 
                    868: NAME
                    869:        PCRE - Perl-compatible regular expressions
                    870: 
                    871: 
                    872: PCRE MATCHING ALGORITHMS
                    873: 
                    874:        This document describes the two different algorithms that are available
                    875:        in PCRE for matching a compiled regular expression against a given sub-
                    876:        ject  string.  The  "standard"  algorithm  is  the  one provided by the
1.1.1.2   misho     877:        pcre_exec() and pcre16_exec() functions. These work in the same was  as
                    878:        Perl's matching function, and provide a Perl-compatible matching opera-
                    879:        tion. The just-in-time (JIT) optimization  that  is  described  in  the
                    880:        pcrejit documentation is compatible with these functions.
                    881: 
                    882:        An  alternative  algorithm  is  provided  by  the  pcre_dfa_exec()  and
                    883:        pcre16_dfa_exec() functions; they operate in a different way,  and  are
                    884:        not  Perl-compatible. This alternative has advantages and disadvantages
                    885:        compared with the standard algorithm, and these are described below.
1.1       misho     886: 
                    887:        When there is only one possible way in which a given subject string can
                    888:        match  a pattern, the two algorithms give the same answer. A difference
                    889:        arises, however, when there are multiple possibilities. For example, if
                    890:        the pattern
                    891: 
                    892:          ^<.*>
                    893: 
                    894:        is matched against the string
                    895: 
                    896:          <something> <something else> <something further>
                    897: 
                    898:        there are three possible answers. The standard algorithm finds only one
                    899:        of them, whereas the alternative algorithm finds all three.
                    900: 
                    901: 
                    902: REGULAR EXPRESSIONS AS TREES
                    903: 
                    904:        The set of strings that are matched by a regular expression can be rep-
                    905:        resented  as  a  tree structure. An unlimited repetition in the pattern
                    906:        makes the tree of infinite size, but it is still a tree.  Matching  the
                    907:        pattern  to a given subject string (from a given starting point) can be
                    908:        thought of as a search of the tree.  There are two  ways  to  search  a
                    909:        tree:  depth-first  and  breadth-first, and these correspond to the two
                    910:        matching algorithms provided by PCRE.
                    911: 
                    912: 
                    913: THE STANDARD MATCHING ALGORITHM
                    914: 
                    915:        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
                    916:        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
                    917:        depth-first search of the pattern tree. That is, it  proceeds  along  a
                    918:        single path through the tree, checking that the subject matches what is
                    919:        required. When there is a mismatch, the algorithm  tries  any  alterna-
                    920:        tives  at  the  current point, and if they all fail, it backs up to the
                    921:        previous branch point in the  tree,  and  tries  the  next  alternative
                    922:        branch  at  that  level.  This often involves backing up (moving to the
                    923:        left) in the subject string as well.  The  order  in  which  repetition
                    924:        branches  are  tried  is controlled by the greedy or ungreedy nature of
                    925:        the quantifier.
                    926: 
                    927:        If a leaf node is reached, a matching string has  been  found,  and  at
                    928:        that  point the algorithm stops. Thus, if there is more than one possi-
                    929:        ble match, this algorithm returns the first one that it finds.  Whether
                    930:        this  is the shortest, the longest, or some intermediate length depends
                    931:        on the way the greedy and ungreedy repetition quantifiers are specified
                    932:        in the pattern.
                    933: 
                    934:        Because  it  ends  up  with a single path through the tree, it is rela-
                    935:        tively straightforward for this algorithm to keep  track  of  the  sub-
                    936:        strings  that  are  matched  by portions of the pattern in parentheses.
                    937:        This provides support for capturing parentheses and back references.
                    938: 
                    939: 
                    940: THE ALTERNATIVE MATCHING ALGORITHM
                    941: 
                    942:        This algorithm conducts a breadth-first search of  the  tree.  Starting
                    943:        from  the  first  matching  point  in the subject, it scans the subject
                    944:        string from left to right, once, character by character, and as it does
                    945:        this,  it remembers all the paths through the tree that represent valid
                    946:        matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
                    947:        though  it is not implemented as a traditional finite state machine (it
                    948:        keeps multiple states active simultaneously).
                    949: 
                    950:        Although the general principle of this matching algorithm  is  that  it
                    951:        scans  the subject string only once, without backtracking, there is one
                    952:        exception: when a lookaround assertion is encountered,  the  characters
                    953:        following  or  preceding  the  current  point  have to be independently
                    954:        inspected.
                    955: 
                    956:        The scan continues until either the end of the subject is  reached,  or
                    957:        there  are  no more unterminated paths. At this point, terminated paths
                    958:        represent the different matching possibilities (if there are none,  the
                    959:        match  has  failed).   Thus,  if there is more than one possible match,
                    960:        this algorithm finds all of them, and in particular, it finds the long-
                    961:        est.  The  matches are returned in decreasing order of length. There is
                    962:        an option to stop the algorithm after the first match (which is  neces-
                    963:        sarily the shortest) is found.
                    964: 
                    965:        Note that all the matches that are found start at the same point in the
                    966:        subject. If the pattern
                    967: 
                    968:          cat(er(pillar)?)?
                    969: 
                    970:        is matched against the string "the caterpillar catchment",  the  result
                    971:        will  be the three strings "caterpillar", "cater", and "cat" that start
                    972:        at the fifth character of the subject. The algorithm does not automati-
                    973:        cally move on to find matches that start at later positions.
                    974: 
                    975:        There are a number of features of PCRE regular expressions that are not
                    976:        supported by the alternative matching algorithm. They are as follows:
                    977: 
                    978:        1. Because the algorithm finds all  possible  matches,  the  greedy  or
                    979:        ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
                    980:        ungreedy quantifiers are treated in exactly the same way. However, pos-
                    981:        sessive  quantifiers can make a difference when what follows could also
                    982:        match what is quantified, for example in a pattern like this:
                    983: 
                    984:          ^a++\w!
                    985: 
                    986:        This pattern matches "aaab!" but not "aaa!", which would be matched  by
                    987:        a  non-possessive quantifier. Similarly, if an atomic group is present,
                    988:        it is matched as if it were a standalone pattern at the current  point,
                    989:        and  the  longest match is then "locked in" for the rest of the overall
                    990:        pattern.
                    991: 
                    992:        2. When dealing with multiple paths through the tree simultaneously, it
                    993:        is  not  straightforward  to  keep track of captured substrings for the
                    994:        different matching possibilities, and  PCRE's  implementation  of  this
                    995:        algorithm does not attempt to do this. This means that no captured sub-
                    996:        strings are available.
                    997: 
                    998:        3. Because no substrings are captured, back references within the  pat-
                    999:        tern are not supported, and cause errors if encountered.
                   1000: 
                   1001:        4.  For  the same reason, conditional expressions that use a backrefer-
                   1002:        ence as the condition or test for a specific group  recursion  are  not
                   1003:        supported.
                   1004: 
                   1005:        5.  Because  many  paths  through the tree may be active, the \K escape
                   1006:        sequence, which resets the start of the match when encountered (but may
                   1007:        be  on  some  paths  and not on others), is not supported. It causes an
                   1008:        error if encountered.
                   1009: 
                   1010:        6. Callouts are supported, but the value of the  capture_top  field  is
                   1011:        always 1, and the value of the capture_last field is always -1.
                   1012: 
1.1.1.2   misho    1013:        7.  The  \C  escape  sequence, which (in the standard algorithm) always
                   1014:        matches a single data unit, even in UTF-8 or UTF-16 modes, is not  sup-
                   1015:        ported  in these modes, because the alternative algorithm moves through
                   1016:        the subject string one character (not data unit) at  a  time,  for  all
                   1017:        active paths through the tree.
1.1       misho    1018: 
1.1.1.2   misho    1019:        8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
                   1020:        are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1.1       misho    1021:        negative assertion.
                   1022: 
                   1023: 
                   1024: ADVANTAGES OF THE ALTERNATIVE ALGORITHM
                   1025: 
1.1.1.2   misho    1026:        Using  the alternative matching algorithm provides the following advan-
1.1       misho    1027:        tages:
                   1028: 
                   1029:        1. All possible matches (at a single point in the subject) are automat-
1.1.1.2   misho    1030:        ically  found,  and  in particular, the longest match is found. To find
1.1       misho    1031:        more than one match using the standard algorithm, you have to do kludgy
                   1032:        things with callouts.
                   1033: 
1.1.1.2   misho    1034:        2.  Because  the  alternative  algorithm  scans the subject string just
                   1035:        once, and never needs to backtrack (except for lookbehinds), it is pos-
                   1036:        sible  to  pass  very  long subject strings to the matching function in
                   1037:        several pieces, checking for partial matching each time. Although it is
                   1038:        possible  to  do multi-segment matching using the standard algorithm by
                   1039:        retaining partially matched substrings, it  is  more  complicated.  The
                   1040:        pcrepartial  documentation  gives  details of partial matching and dis-
                   1041:        cusses multi-segment matching.
1.1       misho    1042: 
                   1043: 
                   1044: DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
                   1045: 
                   1046:        The alternative algorithm suffers from a number of disadvantages:
                   1047: 
1.1.1.2   misho    1048:        1. It is substantially slower than  the  standard  algorithm.  This  is
                   1049:        partly  because  it has to search for all possible matches, but is also
1.1       misho    1050:        because it is less susceptible to optimization.
                   1051: 
                   1052:        2. Capturing parentheses and back references are not supported.
                   1053: 
                   1054:        3. Although atomic groups are supported, their use does not provide the
                   1055:        performance advantage that it does for the standard algorithm.
                   1056: 
                   1057: 
                   1058: AUTHOR
                   1059: 
                   1060:        Philip Hazel
                   1061:        University Computing Service
                   1062:        Cambridge CB2 3QH, England.
                   1063: 
                   1064: 
                   1065: REVISION
                   1066: 
1.1.1.2   misho    1067:        Last updated: 08 January 2012
                   1068:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    1069: ------------------------------------------------------------------------------
                   1070: 
                   1071: 
                   1072: PCREAPI(3)                                                          PCREAPI(3)
                   1073: 
                   1074: 
                   1075: NAME
                   1076:        PCRE - Perl-compatible regular expressions
                   1077: 
1.1.1.2   misho    1078:        #include <pcre.h>
1.1       misho    1079: 
                   1080: 
1.1.1.2   misho    1081: PCRE NATIVE API BASIC FUNCTIONS
1.1       misho    1082: 
                   1083:        pcre *pcre_compile(const char *pattern, int options,
                   1084:             const char **errptr, int *erroffset,
                   1085:             const unsigned char *tableptr);
                   1086: 
                   1087:        pcre *pcre_compile2(const char *pattern, int options,
                   1088:             int *errorcodeptr,
                   1089:             const char **errptr, int *erroffset,
                   1090:             const unsigned char *tableptr);
                   1091: 
                   1092:        pcre_extra *pcre_study(const pcre *code, int options,
                   1093:             const char **errptr);
                   1094: 
                   1095:        void pcre_free_study(pcre_extra *extra);
                   1096: 
                   1097:        int pcre_exec(const pcre *code, const pcre_extra *extra,
                   1098:             const char *subject, int length, int startoffset,
                   1099:             int options, int *ovector, int ovecsize);
                   1100: 
                   1101:        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
                   1102:             const char *subject, int length, int startoffset,
                   1103:             int options, int *ovector, int ovecsize,
                   1104:             int *workspace, int wscount);
                   1105: 
1.1.1.2   misho    1106: 
                   1107: PCRE NATIVE API STRING EXTRACTION FUNCTIONS
                   1108: 
1.1       misho    1109:        int pcre_copy_named_substring(const pcre *code,
                   1110:             const char *subject, int *ovector,
                   1111:             int stringcount, const char *stringname,
                   1112:             char *buffer, int buffersize);
                   1113: 
                   1114:        int pcre_copy_substring(const char *subject, int *ovector,
                   1115:             int stringcount, int stringnumber, char *buffer,
                   1116:             int buffersize);
                   1117: 
                   1118:        int pcre_get_named_substring(const pcre *code,
                   1119:             const char *subject, int *ovector,
                   1120:             int stringcount, const char *stringname,
                   1121:             const char **stringptr);
                   1122: 
                   1123:        int pcre_get_stringnumber(const pcre *code,
                   1124:             const char *name);
                   1125: 
                   1126:        int pcre_get_stringtable_entries(const pcre *code,
                   1127:             const char *name, char **first, char **last);
                   1128: 
                   1129:        int pcre_get_substring(const char *subject, int *ovector,
                   1130:             int stringcount, int stringnumber,
                   1131:             const char **stringptr);
                   1132: 
                   1133:        int pcre_get_substring_list(const char *subject,
                   1134:             int *ovector, int stringcount, const char ***listptr);
                   1135: 
                   1136:        void pcre_free_substring(const char *stringptr);
                   1137: 
                   1138:        void pcre_free_substring_list(const char **stringptr);
                   1139: 
1.1.1.2   misho    1140: 
                   1141: PCRE NATIVE API AUXILIARY FUNCTIONS
                   1142: 
                   1143:        pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
                   1144: 
                   1145:        void pcre_jit_stack_free(pcre_jit_stack *stack);
                   1146: 
                   1147:        void pcre_assign_jit_stack(pcre_extra *extra,
                   1148:             pcre_jit_callback callback, void *data);
                   1149: 
1.1       misho    1150:        const unsigned char *pcre_maketables(void);
                   1151: 
                   1152:        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
                   1153:             int what, void *where);
                   1154: 
                   1155:        int pcre_refcount(pcre *code, int adjust);
                   1156: 
                   1157:        int pcre_config(int what, void *where);
                   1158: 
1.1.1.2   misho    1159:        const char *pcre_version(void);
                   1160: 
                   1161:        int pcre_pattern_to_host_byte_order(pcre *code,
                   1162:             pcre_extra *extra, const unsigned char *tables);
1.1       misho    1163: 
                   1164: 
                   1165: PCRE NATIVE API INDIRECTED FUNCTIONS
                   1166: 
                   1167:        void *(*pcre_malloc)(size_t);
                   1168: 
                   1169:        void (*pcre_free)(void *);
                   1170: 
                   1171:        void *(*pcre_stack_malloc)(size_t);
                   1172: 
                   1173:        void (*pcre_stack_free)(void *);
                   1174: 
                   1175:        int (*pcre_callout)(pcre_callout_block *);
                   1176: 
                   1177: 
1.1.1.2   misho    1178: PCRE 8-BIT AND 16-BIT LIBRARIES
                   1179: 
                   1180:        From  release  8.30,  PCRE  can  be  compiled as a library for handling
                   1181:        16-bit character strings as  well  as,  or  instead  of,  the  original
                   1182:        library that handles 8-bit character strings. To avoid too much compli-
                   1183:        cation, this document describes the 8-bit versions  of  the  functions,
                   1184:        with only occasional references to the 16-bit library.
                   1185: 
                   1186:        The  16-bit  functions  operate in the same way as their 8-bit counter-
                   1187:        parts; they just use different  data  types  for  their  arguments  and
                   1188:        results, and their names start with pcre16_ instead of pcre_. For every
                   1189:        option that has UTF8 in its name (for example, PCRE_UTF8), there  is  a
                   1190:        corresponding 16-bit name with UTF8 replaced by UTF16. This facility is
                   1191:        in fact just cosmetic; the 16-bit option names define the same bit val-
                   1192:        ues.
                   1193: 
                   1194:        References to bytes and UTF-8 in this document should be read as refer-
                   1195:        ences to 16-bit data  quantities  and  UTF-16  when  using  the  16-bit
                   1196:        library,  unless specified otherwise. More details of the specific dif-
                   1197:        ferences for the 16-bit library are given in the pcre16 page.
                   1198: 
                   1199: 
1.1       misho    1200: PCRE API OVERVIEW
                   1201: 
                   1202:        PCRE has its own native API, which is described in this document. There
1.1.1.2   misho    1203:        are  also some wrapper functions (for the 8-bit library only) that cor-
                   1204:        respond to the POSIX regular expression  API,  but  they  do  not  give
                   1205:        access  to  all  the functionality. They are described in the pcreposix
                   1206:        documentation. Both of these APIs define a set of C function  calls.  A
                   1207:        C++ wrapper (again for the 8-bit library only) is also distributed with
                   1208:        PCRE. It is documented in the pcrecpp page.
1.1       misho    1209: 
                   1210:        The native API C function prototypes are defined  in  the  header  file
1.1.1.2   misho    1211:        pcre.h,  and  on Unix-like systems the (8-bit) library itself is called
                   1212:        libpcre. It can normally be accessed by adding -lpcre  to  the  command
                   1213:        for  linking an application that uses PCRE. The header file defines the
                   1214:        macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
                   1215:        numbers  for the library. Applications can use these to include support
1.1       misho    1216:        for different releases of PCRE.
                   1217: 
                   1218:        In a Windows environment, if you want to statically link an application
                   1219:        program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
                   1220:        before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
                   1221:        loc()   and   pcre_free()   exported   functions   will   be   declared
                   1222:        __declspec(dllimport), with unwanted results.
                   1223: 
                   1224:        The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
                   1225:        pcre_exec()  are used for compiling and matching regular expressions in
                   1226:        a Perl-compatible manner. A sample program that demonstrates  the  sim-
                   1227:        plest  way  of  using them is provided in the file called pcredemo.c in
                   1228:        the PCRE source distribution. A listing of this program is given in the
                   1229:        pcredemo  documentation, and the pcresample documentation describes how
                   1230:        to compile and run it.
                   1231: 
                   1232:        Just-in-time compiler support is an optional feature of PCRE  that  can
                   1233:        be built in appropriate hardware environments. It greatly speeds up the
                   1234:        matching performance of  many  patterns.  Simple  programs  can  easily
                   1235:        request  that  it  be  used  if available, by setting an option that is
                   1236:        ignored when it is not relevant. More complicated programs  might  need
                   1237:        to     make    use    of    the    functions    pcre_jit_stack_alloc(),
                   1238:        pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
                   1239:        the  JIT  code's  memory  usage.   These functions are discussed in the
                   1240:        pcrejit documentation.
                   1241: 
                   1242:        A second matching function, pcre_dfa_exec(), which is not Perl-compati-
                   1243:        ble,  is  also provided. This uses a different algorithm for the match-
                   1244:        ing. The alternative algorithm finds all possible matches (at  a  given
                   1245:        point  in  the  subject), and scans the subject just once (unless there
                   1246:        are lookbehind assertions). However, this  algorithm  does  not  return
                   1247:        captured  substrings.  A description of the two matching algorithms and
                   1248:        their advantages and disadvantages is given in the  pcrematching  docu-
                   1249:        mentation.
                   1250: 
                   1251:        In  addition  to  the  main compiling and matching functions, there are
                   1252:        convenience functions for extracting captured substrings from a subject
                   1253:        string that is matched by pcre_exec(). They are:
                   1254: 
                   1255:          pcre_copy_substring()
                   1256:          pcre_copy_named_substring()
                   1257:          pcre_get_substring()
                   1258:          pcre_get_named_substring()
                   1259:          pcre_get_substring_list()
                   1260:          pcre_get_stringnumber()
                   1261:          pcre_get_stringtable_entries()
                   1262: 
                   1263:        pcre_free_substring() and pcre_free_substring_list() are also provided,
                   1264:        to free the memory used for extracted strings.
                   1265: 
                   1266:        The function pcre_maketables() is used to  build  a  set  of  character
                   1267:        tables   in   the   current   locale  for  passing  to  pcre_compile(),
                   1268:        pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
                   1269:        provided  for  specialist  use.  Most  commonly,  no special tables are
                   1270:        passed, in which case internal tables that are generated when  PCRE  is
                   1271:        built are used.
                   1272: 
                   1273:        The  function  pcre_fullinfo()  is used to find out information about a
1.1.1.2   misho    1274:        compiled pattern. The function pcre_version() returns a  pointer  to  a
                   1275:        string containing the version of PCRE and its date of release.
1.1       misho    1276: 
                   1277:        The  function  pcre_refcount()  maintains  a  reference count in a data
                   1278:        block containing a compiled pattern. This is provided for  the  benefit
                   1279:        of object-oriented applications.
                   1280: 
                   1281:        The  global  variables  pcre_malloc and pcre_free initially contain the
                   1282:        entry points of the standard malloc()  and  free()  functions,  respec-
                   1283:        tively. PCRE calls the memory management functions via these variables,
                   1284:        so a calling program can replace them if it  wishes  to  intercept  the
                   1285:        calls. This should be done before calling any PCRE functions.
                   1286: 
                   1287:        The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
                   1288:        indirections to memory management functions.  These  special  functions
                   1289:        are  used  only  when  PCRE is compiled to use the heap for remembering
                   1290:        data, instead of recursive function calls, when running the pcre_exec()
                   1291:        function.  See  the  pcrebuild  documentation  for details of how to do
                   1292:        this. It is a non-standard way of building PCRE, for  use  in  environ-
                   1293:        ments  that  have  limited stacks. Because of the greater use of memory
                   1294:        management, it runs more slowly. Separate  functions  are  provided  so
                   1295:        that  special-purpose  external  code  can  be used for this case. When
                   1296:        used, these functions are always called in a  stack-like  manner  (last
                   1297:        obtained,  first freed), and always for memory blocks of the same size.
                   1298:        There is a discussion about PCRE's stack usage in the  pcrestack  docu-
                   1299:        mentation.
                   1300: 
                   1301:        The global variable pcre_callout initially contains NULL. It can be set
                   1302:        by the caller to a "callout" function, which PCRE  will  then  call  at
                   1303:        specified  points during a matching operation. Details are given in the
                   1304:        pcrecallout documentation.
                   1305: 
                   1306: 
                   1307: NEWLINES
                   1308: 
                   1309:        PCRE supports five different conventions for indicating line breaks  in
                   1310:        strings:  a  single  CR (carriage return) character, a single LF (line-
                   1311:        feed) character, the two-character sequence CRLF, any of the three pre-
                   1312:        ceding,  or any Unicode newline sequence. The Unicode newline sequences
                   1313:        are the three just mentioned, plus the single characters  VT  (vertical
1.1.1.3 ! misho    1314:        tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1.1       misho    1315:        separator, U+2028), and PS (paragraph separator, U+2029).
                   1316: 
                   1317:        Each of the first three conventions is used by at least  one  operating
                   1318:        system  as its standard newline sequence. When PCRE is built, a default
                   1319:        can be specified.  The default default is LF, which is the  Unix  stan-
                   1320:        dard.  When  PCRE  is run, the default can be overridden, either when a
                   1321:        pattern is compiled, or when it is matched.
                   1322: 
                   1323:        At compile time, the newline convention can be specified by the options
                   1324:        argument  of  pcre_compile(), or it can be specified by special text at
                   1325:        the start of the pattern itself; this overrides any other settings. See
                   1326:        the pcrepattern page for details of the special character sequences.
                   1327: 
                   1328:        In the PCRE documentation the word "newline" is used to mean "the char-
                   1329:        acter or pair of characters that indicate a line break". The choice  of
                   1330:        newline  convention  affects  the  handling of the dot, circumflex, and
                   1331:        dollar metacharacters, the handling of #-comments in /x mode, and, when
                   1332:        CRLF  is a recognized line ending sequence, the match position advance-
                   1333:        ment for a non-anchored pattern. There is more detail about this in the
                   1334:        section on pcre_exec() options below.
                   1335: 
                   1336:        The  choice of newline convention does not affect the interpretation of
                   1337:        the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
                   1338:        which is controlled in a similar way, but by separate options.
                   1339: 
                   1340: 
                   1341: MULTITHREADING
                   1342: 
                   1343:        The  PCRE  functions  can be used in multi-threading applications, with
                   1344:        the  proviso  that  the  memory  management  functions  pointed  to  by
                   1345:        pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
                   1346:        callout function pointed to by pcre_callout, are shared by all threads.
                   1347: 
                   1348:        The compiled form of a regular expression is not altered during  match-
                   1349:        ing, so the same compiled pattern can safely be used by several threads
                   1350:        at once.
                   1351: 
                   1352:        If the just-in-time optimization feature is being used, it needs  sepa-
                   1353:        rate  memory stack areas for each thread. See the pcrejit documentation
                   1354:        for more details.
                   1355: 
                   1356: 
                   1357: SAVING PRECOMPILED PATTERNS FOR LATER USE
                   1358: 
                   1359:        The compiled form of a regular expression can be saved and re-used at a
                   1360:        later  time,  possibly by a different program, and even on a host other
                   1361:        than the one on which  it  was  compiled.  Details  are  given  in  the
1.1.1.2   misho    1362:        pcreprecompile  documentation,  which  includes  a  description  of the
                   1363:        pcre_pattern_to_host_byte_order() function. However, compiling a  regu-
                   1364:        lar  expression  with one version of PCRE for use with a different ver-
                   1365:        sion is not guaranteed to work and may cause crashes.
1.1       misho    1366: 
                   1367: 
                   1368: CHECKING BUILD-TIME OPTIONS
                   1369: 
                   1370:        int pcre_config(int what, void *where);
                   1371: 
1.1.1.2   misho    1372:        The function pcre_config() makes it possible for a PCRE client to  dis-
1.1       misho    1373:        cover which optional features have been compiled into the PCRE library.
1.1.1.2   misho    1374:        The pcrebuild documentation has more details about these optional  fea-
1.1       misho    1375:        tures.
                   1376: 
1.1.1.2   misho    1377:        The  first  argument  for pcre_config() is an integer, specifying which
1.1       misho    1378:        information is required; the second argument is a pointer to a variable
1.1.1.2   misho    1379:        into  which  the  information  is placed. The returned value is zero on
                   1380:        success, or the negative error code PCRE_ERROR_BADOPTION if  the  value
                   1381:        in  the  first argument is not recognized. The following information is
1.1       misho    1382:        available:
                   1383: 
                   1384:          PCRE_CONFIG_UTF8
                   1385: 
1.1.1.2   misho    1386:        The output is an integer that is set to one if UTF-8 support is  avail-
                   1387:        able;  otherwise  it  is  set  to  zero. If this option is given to the
                   1388:        16-bit  version  of  this  function,  pcre16_config(),  the  result  is
                   1389:        PCRE_ERROR_BADOPTION.
                   1390: 
                   1391:          PCRE_CONFIG_UTF16
                   1392: 
                   1393:        The output is an integer that is set to one if UTF-16 support is avail-
                   1394:        able; otherwise it is set to zero. This value should normally be  given
                   1395:        to the 16-bit version of this function, pcre16_config(). If it is given
                   1396:        to the 8-bit version of this function, the result is  PCRE_ERROR_BADOP-
                   1397:        TION.
1.1       misho    1398: 
                   1399:          PCRE_CONFIG_UNICODE_PROPERTIES
                   1400: 
1.1.1.2   misho    1401:        The  output  is  an  integer  that is set to one if support for Unicode
1.1       misho    1402:        character properties is available; otherwise it is set to zero.
                   1403: 
                   1404:          PCRE_CONFIG_JIT
                   1405: 
                   1406:        The output is an integer that is set to one if support for just-in-time
                   1407:        compiling is available; otherwise it is set to zero.
                   1408: 
1.1.1.2   misho    1409:          PCRE_CONFIG_JITTARGET
                   1410: 
                   1411:        The  output is a pointer to a zero-terminated "const char *" string. If
                   1412:        JIT support is available, the string contains the name of the architec-
                   1413:        ture  for  which the JIT compiler is configured, for example "x86 32bit
                   1414:        (little endian + unaligned)". If JIT  support  is  not  available,  the
                   1415:        result is NULL.
                   1416: 
1.1       misho    1417:          PCRE_CONFIG_NEWLINE
                   1418: 
1.1.1.2   misho    1419:        The  output  is  an integer whose value specifies the default character
                   1420:        sequence that is recognized as meaning "newline". The four values  that
1.1       misho    1421:        are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1.1.1.2   misho    1422:        and -1 for ANY.  Though they are derived from ASCII,  the  same  values
1.1       misho    1423:        are returned in EBCDIC environments. The default should normally corre-
                   1424:        spond to the standard sequence for your operating system.
                   1425: 
                   1426:          PCRE_CONFIG_BSR
                   1427: 
                   1428:        The output is an integer whose value indicates what character sequences
1.1.1.2   misho    1429:        the  \R  escape sequence matches by default. A value of 0 means that \R
                   1430:        matches any Unicode line ending sequence; a value of 1  means  that  \R
1.1       misho    1431:        matches only CR, LF, or CRLF. The default can be overridden when a pat-
                   1432:        tern is compiled or matched.
                   1433: 
                   1434:          PCRE_CONFIG_LINK_SIZE
                   1435: 
1.1.1.2   misho    1436:        The output is an integer that contains the number  of  bytes  used  for
                   1437:        internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
                   1438:        library, the value can be 2, 3, or 4. For the 16-bit library, the value
                   1439:        is either 2 or 4 and is still a number of bytes. The default value of 2
                   1440:        is sufficient for all but the most massive patterns,  since  it  allows
                   1441:        the  compiled  pattern  to  be  up to 64K in size.  Larger values allow
                   1442:        larger regular expressions to be compiled, at  the  expense  of  slower
                   1443:        matching.
1.1       misho    1444: 
                   1445:          PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
                   1446: 
1.1.1.2   misho    1447:        The  output  is  an integer that contains the threshold above which the
                   1448:        POSIX interface uses malloc() for output vectors. Further  details  are
1.1       misho    1449:        given in the pcreposix documentation.
                   1450: 
                   1451:          PCRE_CONFIG_MATCH_LIMIT
                   1452: 
1.1.1.2   misho    1453:        The  output is a long integer that gives the default limit for the num-
                   1454:        ber of internal matching function calls  in  a  pcre_exec()  execution.
1.1       misho    1455:        Further details are given with pcre_exec() below.
                   1456: 
                   1457:          PCRE_CONFIG_MATCH_LIMIT_RECURSION
                   1458: 
                   1459:        The output is a long integer that gives the default limit for the depth
1.1.1.2   misho    1460:        of  recursion  when  calling  the  internal  matching  function  in   a
                   1461:        pcre_exec()  execution.  Further  details  are  given  with pcre_exec()
1.1       misho    1462:        below.
                   1463: 
                   1464:          PCRE_CONFIG_STACKRECURSE
                   1465: 
1.1.1.2   misho    1466:        The output is an integer that is set to one if internal recursion  when
1.1       misho    1467:        running pcre_exec() is implemented by recursive function calls that use
1.1.1.2   misho    1468:        the stack to remember their state. This is the usual way that  PCRE  is
1.1       misho    1469:        compiled. The output is zero if PCRE was compiled to use blocks of data
1.1.1.2   misho    1470:        on the  heap  instead  of  recursive  function  calls.  In  this  case,
                   1471:        pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1.1       misho    1472:        blocks on the heap, thus avoiding the use of the stack.
                   1473: 
                   1474: 
                   1475: COMPILING A PATTERN
                   1476: 
                   1477:        pcre *pcre_compile(const char *pattern, int options,
                   1478:             const char **errptr, int *erroffset,
                   1479:             const unsigned char *tableptr);
                   1480: 
                   1481:        pcre *pcre_compile2(const char *pattern, int options,
                   1482:             int *errorcodeptr,
                   1483:             const char **errptr, int *erroffset,
                   1484:             const unsigned char *tableptr);
                   1485: 
                   1486:        Either of the functions pcre_compile() or pcre_compile2() can be called
                   1487:        to compile a pattern into an internal form. The only difference between
1.1.1.2   misho    1488:        the two interfaces is that pcre_compile2() has an additional  argument,
                   1489:        errorcodeptr,  via  which  a  numerical  error code can be returned. To
                   1490:        avoid too much repetition, we refer just to pcre_compile()  below,  but
1.1       misho    1491:        the information applies equally to pcre_compile2().
                   1492: 
                   1493:        The pattern is a C string terminated by a binary zero, and is passed in
1.1.1.2   misho    1494:        the pattern argument. A pointer to a single block  of  memory  that  is
                   1495:        obtained  via  pcre_malloc is returned. This contains the compiled code
1.1       misho    1496:        and related data. The pcre type is defined for the returned block; this
                   1497:        is a typedef for a structure whose contents are not externally defined.
                   1498:        It is up to the caller to free the memory (via pcre_free) when it is no
                   1499:        longer required.
                   1500: 
1.1.1.2   misho    1501:        Although  the compiled code of a PCRE regex is relocatable, that is, it
1.1       misho    1502:        does not depend on memory location, the complete pcre data block is not
1.1.1.2   misho    1503:        fully  relocatable, because it may contain a copy of the tableptr argu-
1.1       misho    1504:        ment, which is an address (see below).
                   1505: 
                   1506:        The options argument contains various bit settings that affect the com-
1.1.1.2   misho    1507:        pilation.  It  should be zero if no options are required. The available
                   1508:        options are described below. Some of them (in  particular,  those  that
                   1509:        are  compatible with Perl, but some others as well) can also be set and
                   1510:        unset from within the pattern (see  the  detailed  description  in  the
                   1511:        pcrepattern  documentation). For those options that can be different in
                   1512:        different parts of the pattern, the contents of  the  options  argument
1.1       misho    1513:        specifies their settings at the start of compilation and execution. The
1.1.1.2   misho    1514:        PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK,  and
1.1.1.3 ! misho    1515:        PCRE_NO_START_OPTIMIZE  options  can  be set at the time of matching as
        !          1516:        well as at compile time.
1.1       misho    1517: 
                   1518:        If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1.1.1.2   misho    1519:        if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1.1       misho    1520:        sets the variable pointed to by errptr to point to a textual error mes-
                   1521:        sage. This is a static string that is part of the library. You must not
1.1.1.2   misho    1522:        try to free it. Normally, the offset from the start of the  pattern  to
                   1523:        the  byte  that  was  being  processed when the error was discovered is
                   1524:        placed in the variable pointed to by erroffset, which must not be  NULL
                   1525:        (if  it is, an immediate error is given). However, for an invalid UTF-8
                   1526:        string, the offset is that of the first byte of the failing character.
1.1       misho    1527: 
1.1.1.2   misho    1528:        Some errors are not detected until the whole pattern has been  scanned;
                   1529:        in  these  cases,  the offset passed back is the length of the pattern.
1.1       misho    1530:        Note that the offset is in bytes, not characters, even in  UTF-8  mode.
                   1531:        It may sometimes point into the middle of a UTF-8 character.
                   1532: 
                   1533:        If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
                   1534:        codeptr argument is not NULL, a non-zero error code number is  returned
                   1535:        via  this argument in the event of an error. This is in addition to the
                   1536:        textual error message. Error codes and messages are listed below.
                   1537: 
                   1538:        If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
                   1539:        character  tables  that  are  built  when  PCRE  is compiled, using the
                   1540:        default C locale. Otherwise, tableptr must be an address  that  is  the
                   1541:        result  of  a  call to pcre_maketables(). This value is stored with the
                   1542:        compiled pattern, and used again by pcre_exec(), unless  another  table
                   1543:        pointer is passed to it. For more discussion, see the section on locale
                   1544:        support below.
                   1545: 
                   1546:        This code fragment shows a typical straightforward  call  to  pcre_com-
                   1547:        pile():
                   1548: 
                   1549:          pcre *re;
                   1550:          const char *error;
                   1551:          int erroffset;
                   1552:          re = pcre_compile(
                   1553:            "^A.*Z",          /* the pattern */
                   1554:            0,                /* default options */
                   1555:            &error,           /* for error message */
                   1556:            &erroffset,       /* for error offset */
                   1557:            NULL);            /* use default character tables */
                   1558: 
                   1559:        The  following  names  for option bits are defined in the pcre.h header
                   1560:        file:
                   1561: 
                   1562:          PCRE_ANCHORED
                   1563: 
                   1564:        If this bit is set, the pattern is forced to be "anchored", that is, it
                   1565:        is  constrained to match only at the first matching point in the string
                   1566:        that is being searched (the "subject string"). This effect can also  be
                   1567:        achieved  by appropriate constructs in the pattern itself, which is the
                   1568:        only way to do it in Perl.
                   1569: 
                   1570:          PCRE_AUTO_CALLOUT
                   1571: 
                   1572:        If this bit is set, pcre_compile() automatically inserts callout items,
                   1573:        all  with  number  255, before each pattern item. For discussion of the
                   1574:        callout facility, see the pcrecallout documentation.
                   1575: 
                   1576:          PCRE_BSR_ANYCRLF
                   1577:          PCRE_BSR_UNICODE
                   1578: 
                   1579:        These options (which are mutually exclusive) control what the \R escape
                   1580:        sequence  matches.  The choice is either to match only CR, LF, or CRLF,
                   1581:        or to match any Unicode newline sequence. The default is specified when
                   1582:        PCRE is built. It can be overridden from within the pattern, or by set-
                   1583:        ting an option when a compiled pattern is matched.
                   1584: 
                   1585:          PCRE_CASELESS
                   1586: 
                   1587:        If this bit is set, letters in the pattern match both upper  and  lower
                   1588:        case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
                   1589:        changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
                   1590:        always  understands the concept of case for characters whose values are
                   1591:        less than 128, so caseless matching is always possible. For  characters
                   1592:        with  higher  values,  the concept of case is supported if PCRE is com-
                   1593:        piled with Unicode property support, but not otherwise. If you want  to
                   1594:        use  caseless  matching  for  characters 128 and above, you must ensure
                   1595:        that PCRE is compiled with Unicode property support  as  well  as  with
                   1596:        UTF-8 support.
                   1597: 
                   1598:          PCRE_DOLLAR_ENDONLY
                   1599: 
                   1600:        If  this bit is set, a dollar metacharacter in the pattern matches only
                   1601:        at the end of the subject string. Without this option,  a  dollar  also
                   1602:        matches  immediately before a newline at the end of the string (but not
                   1603:        before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
                   1604:        if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
                   1605:        Perl, and no way to set it within a pattern.
                   1606: 
                   1607:          PCRE_DOTALL
                   1608: 
                   1609:        If this bit is set, a dot metacharacter in the pattern matches a  char-
                   1610:        acter of any value, including one that indicates a newline. However, it
                   1611:        only ever matches one character, even if newlines are  coded  as  CRLF.
                   1612:        Without  this option, a dot does not match when the current position is
                   1613:        at a newline. This option is equivalent to Perl's /s option, and it can
                   1614:        be  changed within a pattern by a (?s) option setting. A negative class
                   1615:        such as [^a] always matches newline characters, independent of the set-
                   1616:        ting of this option.
                   1617: 
                   1618:          PCRE_DUPNAMES
                   1619: 
                   1620:        If  this  bit is set, names used to identify capturing subpatterns need
                   1621:        not be unique. This can be helpful for certain types of pattern when it
                   1622:        is  known  that  only  one instance of the named subpattern can ever be
                   1623:        matched. There are more details of named subpatterns  below;  see  also
                   1624:        the pcrepattern documentation.
                   1625: 
                   1626:          PCRE_EXTENDED
                   1627: 
1.1.1.3 ! misho    1628:        If  this  bit  is  set,  white space data characters in the pattern are
        !          1629:        totally ignored except when escaped or inside a character class.  White
1.1       misho    1630:        space does not include the VT character (code 11). In addition, charac-
                   1631:        ters between an unescaped # outside a character class and the next new-
                   1632:        line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
                   1633:        option, and it can be changed within a pattern by a  (?x)  option  set-
                   1634:        ting.
                   1635: 
                   1636:        Which  characters  are  interpreted  as  newlines  is controlled by the
                   1637:        options passed to pcre_compile() or by a special sequence at the  start
                   1638:        of  the  pattern, as described in the section entitled "Newline conven-
                   1639:        tions" in the pcrepattern documentation. Note that the end of this type
                   1640:        of  comment  is  a  literal  newline  sequence  in  the pattern; escape
                   1641:        sequences that happen to represent a newline do not count.
                   1642: 
                   1643:        This option makes it possible to include  comments  inside  complicated
                   1644:        patterns.   Note,  however,  that this applies only to data characters.
1.1.1.3 ! misho    1645:        White space  characters  may  never  appear  within  special  character
1.1       misho    1646:        sequences in a pattern, for example within the sequence (?( that intro-
                   1647:        duces a conditional subpattern.
                   1648: 
                   1649:          PCRE_EXTRA
                   1650: 
                   1651:        This option was invented in order to turn on  additional  functionality
                   1652:        of  PCRE  that  is  incompatible with Perl, but it is currently of very
                   1653:        little use. When set, any backslash in a pattern that is followed by  a
                   1654:        letter  that  has  no  special  meaning causes an error, thus reserving
                   1655:        these combinations for future expansion. By  default,  as  in  Perl,  a
                   1656:        backslash  followed by a letter with no special meaning is treated as a
                   1657:        literal. (Perl can, however, be persuaded to give an error for this, by
                   1658:        running  it with the -w option.) There are at present no other features
                   1659:        controlled by this option. It can also be set by a (?X) option  setting
                   1660:        within a pattern.
                   1661: 
                   1662:          PCRE_FIRSTLINE
                   1663: 
                   1664:        If  this  option  is  set,  an  unanchored pattern is required to match
                   1665:        before or at the first  newline  in  the  subject  string,  though  the
                   1666:        matched text may continue over the newline.
                   1667: 
                   1668:          PCRE_JAVASCRIPT_COMPAT
                   1669: 
                   1670:        If this option is set, PCRE's behaviour is changed in some ways so that
                   1671:        it is compatible with JavaScript rather than Perl. The changes  are  as
                   1672:        follows:
                   1673: 
                   1674:        (1)  A  lone  closing square bracket in a pattern causes a compile-time
                   1675:        error, because this is illegal in JavaScript (by default it is  treated
                   1676:        as a data character). Thus, the pattern AB]CD becomes illegal when this
                   1677:        option is set.
                   1678: 
                   1679:        (2) At run time, a back reference to an unset subpattern group  matches
                   1680:        an  empty  string (by default this causes the current matching alterna-
                   1681:        tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
                   1682:        set  (assuming  it can find an "a" in the subject), whereas it fails by
                   1683:        default, for Perl compatibility.
                   1684: 
                   1685:        (3) \U matches an upper case "U" character; by default \U causes a com-
                   1686:        pile time error (Perl uses \U to upper case subsequent characters).
                   1687: 
                   1688:        (4) \u matches a lower case "u" character unless it is followed by four
                   1689:        hexadecimal digits, in which case the hexadecimal  number  defines  the
                   1690:        code  point  to match. By default, \u causes a compile time error (Perl
                   1691:        uses it to upper case the following character).
                   1692: 
                   1693:        (5) \x matches a lower case "x" character unless it is followed by  two
                   1694:        hexadecimal  digits,  in  which case the hexadecimal number defines the
                   1695:        code point to match. By default, as in Perl, a  hexadecimal  number  is
                   1696:        always expected after \x, but it may have zero, one, or two digits (so,
                   1697:        for example, \xz matches a binary zero character followed by z).
                   1698: 
                   1699:          PCRE_MULTILINE
                   1700: 
                   1701:        By default, PCRE treats the subject string as consisting  of  a  single
                   1702:        line  of characters (even if it actually contains newlines). The "start
                   1703:        of line" metacharacter (^) matches only at the  start  of  the  string,
                   1704:        while  the  "end  of line" metacharacter ($) matches only at the end of
                   1705:        the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
                   1706:        is set). This is the same as Perl.
                   1707: 
                   1708:        When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
                   1709:        constructs match immediately following or immediately  before  internal
                   1710:        newlines  in  the  subject string, respectively, as well as at the very
                   1711:        start and end. This is equivalent to Perl's /m option, and  it  can  be
                   1712:        changed within a pattern by a (?m) option setting. If there are no new-
                   1713:        lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
                   1714:        setting PCRE_MULTILINE has no effect.
                   1715: 
                   1716:          PCRE_NEWLINE_CR
                   1717:          PCRE_NEWLINE_LF
                   1718:          PCRE_NEWLINE_CRLF
                   1719:          PCRE_NEWLINE_ANYCRLF
                   1720:          PCRE_NEWLINE_ANY
                   1721: 
                   1722:        These  options  override the default newline definition that was chosen
                   1723:        when PCRE was built. Setting the first or the second specifies  that  a
                   1724:        newline  is  indicated  by a single character (CR or LF, respectively).
                   1725:        Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
                   1726:        two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
                   1727:        that any of the three preceding sequences should be recognized. Setting
                   1728:        PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
                   1729:        recognized. The Unicode newline sequences are the three just mentioned,
1.1.1.3 ! misho    1730:        plus  the  single  characters VT (vertical tab, U+000B), FF (form feed,
1.1       misho    1731:        U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1.1.1.2   misho    1732:        (paragraph  separator, U+2029). For the 8-bit library, the last two are
                   1733:        recognized only in UTF-8 mode.
1.1       misho    1734: 
                   1735:        The newline setting in the  options  word  uses  three  bits  that  are
                   1736:        treated as a number, giving eight possibilities. Currently only six are
                   1737:        used (default plus the five values above). This means that if  you  set
                   1738:        more  than one newline option, the combination may or may not be sensi-
                   1739:        ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
                   1740:        PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
                   1741:        cause an error.
                   1742: 
                   1743:        The only time that a line break in a pattern  is  specially  recognized
1.1.1.3 ! misho    1744:        when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
1.1       misho    1745:        characters, and so are ignored in this mode. Also, an unescaped #  out-
                   1746:        side  a  character class indicates a comment that lasts until after the
                   1747:        next line break sequence. In other circumstances, line break  sequences
                   1748:        in patterns are treated as literal data.
                   1749: 
                   1750:        The newline option that is set at compile time becomes the default that
                   1751:        is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
                   1752: 
                   1753:          PCRE_NO_AUTO_CAPTURE
                   1754: 
                   1755:        If this option is set, it disables the use of numbered capturing paren-
                   1756:        theses  in the pattern. Any opening parenthesis that is not followed by
                   1757:        ? behaves as if it were followed by ?: but named parentheses can  still
                   1758:        be  used  for  capturing  (and  they acquire numbers in the usual way).
                   1759:        There is no equivalent of this option in Perl.
                   1760: 
                   1761:          NO_START_OPTIMIZE
                   1762: 
                   1763:        This is an option that acts at matching time; that is, it is really  an
                   1764:        option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
                   1765:        time, it is remembered with the compiled pattern and assumed at  match-
                   1766:        ing  time.  For  details  see  the discussion of PCRE_NO_START_OPTIMIZE
                   1767:        below.
                   1768: 
                   1769:          PCRE_UCP
                   1770: 
                   1771:        This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
                   1772:        \w,  and  some  of  the POSIX character classes. By default, only ASCII
                   1773:        characters are recognized, but if PCRE_UCP is set,  Unicode  properties
                   1774:        are  used instead to classify characters. More details are given in the
                   1775:        section on generic character types in the pcrepattern page. If you  set
                   1776:        PCRE_UCP,  matching  one of the items it affects takes much longer. The
                   1777:        option is available only if PCRE has been compiled with  Unicode  prop-
                   1778:        erty support.
                   1779: 
                   1780:          PCRE_UNGREEDY
                   1781: 
                   1782:        This  option  inverts  the "greediness" of the quantifiers so that they
                   1783:        are not greedy by default, but become greedy if followed by "?". It  is
                   1784:        not  compatible  with Perl. It can also be set by a (?U) option setting
                   1785:        within the pattern.
                   1786: 
                   1787:          PCRE_UTF8
                   1788: 
                   1789:        This option causes PCRE to regard both the pattern and the  subject  as
1.1.1.2   misho    1790:        strings of UTF-8 characters instead of single-byte strings. However, it
                   1791:        is available only when PCRE is built to include UTF  support.  If  not,
                   1792:        the  use  of  this option provokes an error. Details of how this option
                   1793:        changes the behaviour of PCRE are given in the pcreunicode page.
1.1       misho    1794: 
                   1795:          PCRE_NO_UTF8_CHECK
                   1796: 
                   1797:        When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1.1.1.2   misho    1798:        automatically  checked.  There  is  a  discussion about the validity of
                   1799:        UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
                   1800:        found,  pcre_compile()  returns an error. If you already know that your
                   1801:        pattern is valid, and you want to skip this check for performance  rea-
                   1802:        sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
                   1803:        effect of passing an invalid UTF-8 string as a pattern is undefined. It
                   1804:        may  cause  your  program  to  crash. Note that this option can also be
                   1805:        passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
                   1806:        checking of subject strings.
1.1       misho    1807: 
                   1808: 
                   1809: COMPILATION ERROR CODES
                   1810: 
1.1.1.2   misho    1811:        The  following  table  lists  the  error  codes than may be returned by
                   1812:        pcre_compile2(), along with the error messages that may be returned  by
                   1813:        both  compiling  functions.  Note  that error messages are always 8-bit
                   1814:        ASCII strings, even in 16-bit mode. As PCRE has developed,  some  error
                   1815:        codes  have  fallen  out of use. To avoid confusion, they have not been
                   1816:        re-used.
1.1       misho    1817: 
                   1818:           0  no error
                   1819:           1  \ at end of pattern
                   1820:           2  \c at end of pattern
                   1821:           3  unrecognized character follows \
                   1822:           4  numbers out of order in {} quantifier
                   1823:           5  number too big in {} quantifier
                   1824:           6  missing terminating ] for character class
                   1825:           7  invalid escape sequence in character class
                   1826:           8  range out of order in character class
                   1827:           9  nothing to repeat
                   1828:          10  [this code is not in use]
                   1829:          11  internal error: unexpected repeat
                   1830:          12  unrecognized character after (? or (?-
                   1831:          13  POSIX named classes are supported only within a class
                   1832:          14  missing )
                   1833:          15  reference to non-existent subpattern
                   1834:          16  erroffset passed as NULL
                   1835:          17  unknown option bit(s) set
                   1836:          18  missing ) after comment
                   1837:          19  [this code is not in use]
                   1838:          20  regular expression is too large
                   1839:          21  failed to get memory
                   1840:          22  unmatched parentheses
                   1841:          23  internal error: code overflow
                   1842:          24  unrecognized character after (?<
                   1843:          25  lookbehind assertion is not fixed length
                   1844:          26  malformed number or name after (?(
                   1845:          27  conditional group contains more than two branches
                   1846:          28  assertion expected after (?(
                   1847:          29  (?R or (?[+-]digits must be followed by )
                   1848:          30  unknown POSIX class name
                   1849:          31  POSIX collating elements are not supported
1.1.1.2   misho    1850:          32  this version of PCRE is compiled without UTF support
1.1       misho    1851:          33  [this code is not in use]
                   1852:          34  character value in \x{...} sequence is too large
                   1853:          35  invalid condition (?(0)
                   1854:          36  \C not allowed in lookbehind assertion
                   1855:          37  PCRE does not support \L, \l, \N{name}, \U, or \u
                   1856:          38  number after (?C is > 255
                   1857:          39  closing ) for (?C expected
                   1858:          40  recursive call could loop indefinitely
                   1859:          41  unrecognized character after (?P
                   1860:          42  syntax error in subpattern name (missing terminator)
                   1861:          43  two named subpatterns have the same name
1.1.1.2   misho    1862:          44  invalid UTF-8 string (specifically UTF-8)
1.1       misho    1863:          45  support for \P, \p, and \X has not been compiled
                   1864:          46  malformed \P or \p sequence
                   1865:          47  unknown property name after \P or \p
                   1866:          48  subpattern name is too long (maximum 32 characters)
                   1867:          49  too many named subpatterns (maximum 10000)
                   1868:          50  [this code is not in use]
1.1.1.2   misho    1869:          51  octal value is greater than \377 in 8-bit non-UTF-8 mode
1.1       misho    1870:          52  internal error: overran compiling workspace
                   1871:          53  internal error: previously-checked referenced subpattern
                   1872:                not found
                   1873:          54  DEFINE group contains more than one branch
                   1874:          55  repeating a DEFINE group is not allowed
                   1875:          56  inconsistent NEWLINE options
                   1876:          57  \g is not followed by a braced, angle-bracketed, or quoted
                   1877:                name/number or by a plain number
                   1878:          58  a numbered reference must not be zero
                   1879:          59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
                   1880:          60  (*VERB) not recognized
                   1881:          61  number is too big
                   1882:          62  subpattern name expected
                   1883:          63  digit expected after (?+
                   1884:          64  ] is an invalid data character in JavaScript compatibility mode
                   1885:          65  different names for subpatterns of the same number are
                   1886:                not allowed
                   1887:          66  (*MARK) must have an argument
1.1.1.2   misho    1888:          67  this version of PCRE is not compiled with Unicode property
                   1889:                support
1.1       misho    1890:          68  \c must be followed by an ASCII character
                   1891:          69  \k is not followed by a braced, angle-bracketed, or quoted name
1.1.1.2   misho    1892:          70  internal error: unknown opcode in find_fixedlength()
                   1893:          71  \N is not supported in a class
                   1894:          72  too many forward references
                   1895:          73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
                   1896:          74  invalid UTF-16 string (specifically UTF-16)
1.1.1.3 ! misho    1897:          75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
        !          1898:          76  character value in \u.... sequence is too large
1.1       misho    1899: 
1.1.1.2   misho    1900:        The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1.1       misho    1901:        values may be used if the limits were changed when PCRE was built.
                   1902: 
                   1903: 
                   1904: STUDYING A PATTERN
                   1905: 
                   1906:        pcre_extra *pcre_study(const pcre *code, int options
                   1907:             const char **errptr);
                   1908: 
1.1.1.2   misho    1909:        If  a  compiled  pattern is going to be used several times, it is worth
1.1       misho    1910:        spending more time analyzing it in order to speed up the time taken for
1.1.1.2   misho    1911:        matching.  The function pcre_study() takes a pointer to a compiled pat-
1.1       misho    1912:        tern as its first argument. If studying the pattern produces additional
1.1.1.2   misho    1913:        information  that  will  help speed up matching, pcre_study() returns a
                   1914:        pointer to a pcre_extra block, in which the study_data field points  to
1.1       misho    1915:        the results of the study.
                   1916: 
                   1917:        The  returned  value  from  pcre_study()  can  be  passed  directly  to
1.1.1.2   misho    1918:        pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
                   1919:        tains  other  fields  that can be set by the caller before the block is
1.1       misho    1920:        passed; these are described below in the section on matching a pattern.
                   1921: 
1.1.1.2   misho    1922:        If studying the  pattern  does  not  produce  any  useful  information,
1.1       misho    1923:        pcre_study() returns NULL. In that circumstance, if the calling program
1.1.1.2   misho    1924:        wants  to  pass  any  of   the   other   fields   to   pcre_exec()   or
1.1       misho    1925:        pcre_dfa_exec(), it must set up its own pcre_extra block.
                   1926: 
1.1.1.3 ! misho    1927:        The  second  argument  of  pcre_study() contains option bits. There are
        !          1928:        three options:
        !          1929: 
        !          1930:          PCRE_STUDY_JIT_COMPILE
        !          1931:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
        !          1932:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
        !          1933: 
        !          1934:        If any of these are set, and the just-in-time  compiler  is  available,
        !          1935:        the  pattern  is  further compiled into machine code that executes much
        !          1936:        faster than the pcre_exec()  interpretive  matching  function.  If  the
        !          1937:        just-in-time  compiler is not available, these options are ignored. All
        !          1938:        other bits in the options argument must be zero.
1.1       misho    1939: 
1.1.1.2   misho    1940:        JIT compilation is a heavyweight optimization. It can  take  some  time
                   1941:        for  patterns  to  be analyzed, and for one-off matches and simple pat-
                   1942:        terns the benefit of faster execution might be offset by a much  slower
1.1       misho    1943:        study time.  Not all patterns can be optimized by the JIT compiler. For
1.1.1.2   misho    1944:        those that cannot be handled, matching automatically falls back to  the
                   1945:        pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
1.1       misho    1946:        tion.
                   1947: 
1.1.1.2   misho    1948:        The third argument for pcre_study() is a pointer for an error  message.
                   1949:        If  studying  succeeds  (even  if no data is returned), the variable it
                   1950:        points to is set to NULL. Otherwise it is set to  point  to  a  textual
1.1       misho    1951:        error message. This is a static string that is part of the library. You
1.1.1.2   misho    1952:        must not try to free it. You should test the  error  pointer  for  NULL
1.1       misho    1953:        after calling pcre_study(), to be sure that it has run successfully.
                   1954: 
1.1.1.2   misho    1955:        When  you are finished with a pattern, you can free the memory used for
1.1       misho    1956:        the study data by calling pcre_free_study(). This function was added to
1.1.1.2   misho    1957:        the  API  for  release  8.20. For earlier versions, the memory could be
                   1958:        freed with pcre_free(), just like the pattern itself. This  will  still
1.1.1.3 ! misho    1959:        work  in  cases where JIT optimization is not used, but it is advisable
        !          1960:        to change to the new function when convenient.
1.1       misho    1961: 
1.1.1.2   misho    1962:        This is a typical way in which pcre_study() is used (except that  in  a
1.1       misho    1963:        real application there should be tests for errors):
                   1964: 
                   1965:          int rc;
                   1966:          pcre *re;
                   1967:          pcre_extra *sd;
                   1968:          re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
                   1969:          sd = pcre_study(
                   1970:            re,             /* result of pcre_compile() */
                   1971:            0,              /* no options */
                   1972:            &error);        /* set to NULL or points to a message */
                   1973:          rc = pcre_exec(   /* see below for details of pcre_exec() options */
                   1974:            re, sd, "subject", 7, 0, 0, ovector, 30);
                   1975:          ...
                   1976:          pcre_free_study(sd);
                   1977:          pcre_free(re);
                   1978: 
                   1979:        Studying a pattern does two things: first, a lower bound for the length
                   1980:        of subject string that is needed to match the pattern is computed. This
                   1981:        does not mean that there are any strings of that length that match, but
1.1.1.2   misho    1982:        it does guarantee that no shorter strings match. The value is  used  by
                   1983:        pcre_exec()  and  pcre_dfa_exec()  to  avoid  wasting time by trying to
                   1984:        match strings that are shorter than the lower bound. You can  find  out
1.1       misho    1985:        the value in a calling program via the pcre_fullinfo() function.
                   1986: 
                   1987:        Studying a pattern is also useful for non-anchored patterns that do not
1.1.1.2   misho    1988:        have a single fixed starting character. A bitmap of  possible  starting
                   1989:        bytes  is  created. This speeds up finding a position in the subject at
                   1990:        which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
                   1991:        values less than 256.)
1.1       misho    1992: 
1.1.1.3 ! misho    1993:        These  two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
        !          1994:        and the information is also used by the JIT  compiler.   The  optimiza-
        !          1995:        tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
        !          1996:        calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
        !          1997:        tion  is  also disabled. You might want to do this if your pattern con-
        !          1998:        tains callouts or (*MARK) and you want to make use of these  facilities
        !          1999:        in    cases    where    matching   fails.   See   the   discussion   of
        !          2000:        PCRE_NO_START_OPTIMIZE below.
1.1       misho    2001: 
                   2002: 
                   2003: LOCALE SUPPORT
                   2004: 
1.1.1.3 ! misho    2005:        PCRE handles caseless matching, and determines whether  characters  are
        !          2006:        letters,  digits, or whatever, by reference to a set of tables, indexed
        !          2007:        by character value. When running in UTF-8 mode, this  applies  only  to
        !          2008:        characters  with  codes  less than 128. By default, higher-valued codes
1.1       misho    2009:        never match escapes such as \w or \d, but they can be tested with \p if
1.1.1.3 ! misho    2010:        PCRE  is  built with Unicode character property support. Alternatively,
        !          2011:        the PCRE_UCP option can be set at compile  time;  this  causes  \w  and
1.1       misho    2012:        friends to use Unicode property support instead of built-in tables. The
                   2013:        use of locales with Unicode is discouraged. If you are handling charac-
1.1.1.3 ! misho    2014:        ters  with codes greater than 128, you should either use UTF-8 and Uni-
1.1       misho    2015:        code, or use locales, but not try to mix the two.
                   2016: 
1.1.1.3 ! misho    2017:        PCRE contains an internal set of tables that are used  when  the  final
        !          2018:        argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1.1       misho    2019:        applications.  Normally, the internal tables recognize only ASCII char-
                   2020:        acters. However, when PCRE is built, it is possible to cause the inter-
                   2021:        nal tables to be rebuilt in the default "C" locale of the local system,
                   2022:        which may cause them to be different.
                   2023: 
1.1.1.3 ! misho    2024:        The  internal tables can always be overridden by tables supplied by the
1.1       misho    2025:        application that calls PCRE. These may be created in a different locale
1.1.1.3 ! misho    2026:        from  the  default.  As more and more applications change to using Uni-
1.1       misho    2027:        code, the need for this locale support is expected to die away.
                   2028: 
1.1.1.3 ! misho    2029:        External tables are built by calling  the  pcre_maketables()  function,
        !          2030:        which  has no arguments, in the relevant locale. The result can then be
        !          2031:        passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
        !          2032:        example,  to  build  and use tables that are appropriate for the French
        !          2033:        locale (where accented characters with  values  greater  than  128  are
1.1       misho    2034:        treated as letters), the following code could be used:
                   2035: 
                   2036:          setlocale(LC_CTYPE, "fr_FR");
                   2037:          tables = pcre_maketables();
                   2038:          re = pcre_compile(..., tables);
                   2039: 
1.1.1.3 ! misho    2040:        The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1.1       misho    2041:        if you are using Windows, the name for the French locale is "french".
                   2042: 
1.1.1.3 ! misho    2043:        When pcre_maketables() runs, the tables are built  in  memory  that  is
        !          2044:        obtained  via  pcre_malloc. It is the caller's responsibility to ensure
        !          2045:        that the memory containing the tables remains available for as long  as
1.1       misho    2046:        it is needed.
                   2047: 
                   2048:        The pointer that is passed to pcre_compile() is saved with the compiled
1.1.1.3 ! misho    2049:        pattern, and the same tables are used via this pointer by  pcre_study()
1.1       misho    2050:        and normally also by pcre_exec(). Thus, by default, for any single pat-
                   2051:        tern, compilation, studying and matching all happen in the same locale,
                   2052:        but different patterns can be compiled in different locales.
                   2053: 
1.1.1.3 ! misho    2054:        It  is  possible to pass a table pointer or NULL (indicating the use of
        !          2055:        the internal tables) to pcre_exec(). Although  not  intended  for  this
        !          2056:        purpose,  this facility could be used to match a pattern in a different
1.1       misho    2057:        locale from the one in which it was compiled. Passing table pointers at
                   2058:        run time is discussed below in the section on matching a pattern.
                   2059: 
                   2060: 
                   2061: INFORMATION ABOUT A PATTERN
                   2062: 
                   2063:        int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
                   2064:             int what, void *where);
                   2065: 
1.1.1.3 ! misho    2066:        The  pcre_fullinfo() function returns information about a compiled pat-
        !          2067:        tern. It replaces the pcre_info() function, which was removed from  the
1.1.1.2   misho    2068:        library at version 8.30, after more than 10 years of obsolescence.
1.1       misho    2069: 
1.1.1.3 ! misho    2070:        The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
        !          2071:        pattern. The second argument is the result of pcre_study(), or NULL  if
        !          2072:        the  pattern  was not studied. The third argument specifies which piece
        !          2073:        of information is required, and the fourth argument is a pointer  to  a
        !          2074:        variable  to  receive  the  data. The yield of the function is zero for
1.1       misho    2075:        success, or one of the following negative numbers:
                   2076: 
1.1.1.2   misho    2077:          PCRE_ERROR_NULL           the argument code was NULL
                   2078:                                    the argument where was NULL
                   2079:          PCRE_ERROR_BADMAGIC       the "magic number" was not found
                   2080:          PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
                   2081:                                    endianness
                   2082:          PCRE_ERROR_BADOPTION      the value of what was invalid
1.1       misho    2083: 
1.1.1.3 ! misho    2084:        The "magic number" is placed at the start of each compiled  pattern  as
        !          2085:        an  simple check against passing an arbitrary memory pointer. The endi-
1.1.1.2   misho    2086:        anness error can occur if a compiled pattern is saved and reloaded on a
1.1.1.3 ! misho    2087:        different  host.  Here  is a typical call of pcre_fullinfo(), to obtain
1.1.1.2   misho    2088:        the length of the compiled pattern:
1.1       misho    2089: 
                   2090:          int rc;
                   2091:          size_t length;
                   2092:          rc = pcre_fullinfo(
                   2093:            re,               /* result of pcre_compile() */
                   2094:            sd,               /* result of pcre_study(), or NULL */
                   2095:            PCRE_INFO_SIZE,   /* what is required */
                   2096:            &length);         /* where to put the data */
                   2097: 
1.1.1.3 ! misho    2098:        The possible values for the third argument are defined in  pcre.h,  and
1.1       misho    2099:        are as follows:
                   2100: 
                   2101:          PCRE_INFO_BACKREFMAX
                   2102: 
1.1.1.3 ! misho    2103:        Return  the  number  of  the highest back reference in the pattern. The
        !          2104:        fourth argument should point to an int variable. Zero  is  returned  if
1.1       misho    2105:        there are no back references.
                   2106: 
                   2107:          PCRE_INFO_CAPTURECOUNT
                   2108: 
1.1.1.3 ! misho    2109:        Return  the  number of capturing subpatterns in the pattern. The fourth
1.1       misho    2110:        argument should point to an int variable.
                   2111: 
                   2112:          PCRE_INFO_DEFAULT_TABLES
                   2113: 
1.1.1.3 ! misho    2114:        Return a pointer to the internal default character tables within  PCRE.
        !          2115:        The  fourth  argument should point to an unsigned char * variable. This
1.1       misho    2116:        information call is provided for internal use by the pcre_study() func-
1.1.1.3 ! misho    2117:        tion.  External  callers  can  cause PCRE to use its internal tables by
1.1       misho    2118:        passing a NULL table pointer.
                   2119: 
                   2120:          PCRE_INFO_FIRSTBYTE
                   2121: 
1.1.1.2   misho    2122:        Return information about the first data unit of any matched string, for
1.1.1.3 ! misho    2123:        a  non-anchored  pattern.  (The name of this option refers to the 8-bit
        !          2124:        library, where data units are bytes.) The fourth argument should  point
1.1.1.2   misho    2125:        to an int variable.
                   2126: 
1.1.1.3 ! misho    2127:        If  there  is  a  fixed first value, for example, the letter "c" from a
        !          2128:        pattern such as (cat|cow|coyote), its value is returned. In  the  8-bit
        !          2129:        library,  the  value is always less than 256; in the 16-bit library the
1.1.1.2   misho    2130:        value can be up to 0xffff.
1.1       misho    2131: 
1.1.1.2   misho    2132:        If there is no fixed first value, and if either
1.1       misho    2133: 
1.1.1.3 ! misho    2134:        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1.1       misho    2135:        branch starts with "^", or
                   2136: 
                   2137:        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
                   2138:        set (if it were set, the pattern would be anchored),
                   2139: 
1.1.1.3 ! misho    2140:        -1 is returned, indicating that the pattern matches only at  the  start
        !          2141:        of  a  subject string or after any newline within the string. Otherwise
1.1       misho    2142:        -2 is returned. For anchored patterns, -2 is returned.
                   2143: 
                   2144:          PCRE_INFO_FIRSTTABLE
                   2145: 
1.1.1.3 ! misho    2146:        If the pattern was studied, and this resulted in the construction of  a
        !          2147:        256-bit  table indicating a fixed set of values for the first data unit
        !          2148:        in any matching string, a pointer to the table is  returned.  Otherwise
        !          2149:        NULL  is returned. The fourth argument should point to an unsigned char
1.1.1.2   misho    2150:        * variable.
1.1       misho    2151: 
                   2152:          PCRE_INFO_HASCRORLF
                   2153: 
1.1.1.3 ! misho    2154:        Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
        !          2155:        characters,  otherwise  0.  The  fourth argument should point to an int
        !          2156:        variable. An explicit match is either a literal CR or LF character,  or
1.1       misho    2157:        \r or \n.
                   2158: 
                   2159:          PCRE_INFO_JCHANGED
                   2160: 
1.1.1.3 ! misho    2161:        Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
        !          2162:        otherwise 0. The fourth argument should point to an int variable.  (?J)
1.1       misho    2163:        and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
                   2164: 
                   2165:          PCRE_INFO_JIT
                   2166: 
1.1.1.3 ! misho    2167:        Return  1  if  the pattern was studied with one of the JIT options, and
        !          2168:        just-in-time compiling was successful. The fourth argument should point
        !          2169:        to  an  int variable. A return value of 0 means that JIT support is not
        !          2170:        available in this version of PCRE, or that the pattern was not  studied
        !          2171:        with  a JIT option, or that the JIT compiler could not handle this par-
        !          2172:        ticular pattern. See the pcrejit documentation for details of what  can
        !          2173:        and cannot be handled.
1.1       misho    2174: 
                   2175:          PCRE_INFO_JITSIZE
                   2176: 
1.1.1.3 ! misho    2177:        If  the  pattern was successfully studied with a JIT option, return the
        !          2178:        size of the JIT compiled code, otherwise return zero. The fourth  argu-
        !          2179:        ment should point to a size_t variable.
1.1       misho    2180: 
                   2181:          PCRE_INFO_LASTLITERAL
                   2182: 
1.1.1.3 ! misho    2183:        Return  the value of the rightmost literal data unit that must exist in
        !          2184:        any matched string, other than at its start, if such a value  has  been
1.1       misho    2185:        recorded. The fourth argument should point to an int variable. If there
1.1.1.2   misho    2186:        is no such value, -1 is returned. For anchored patterns, a last literal
1.1.1.3 ! misho    2187:        value  is recorded only if it follows something of variable length. For
1.1       misho    2188:        example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
                   2189:        /^a\dz\d/ the returned value is -1.
                   2190: 
1.1.1.3 ! misho    2191:          PCRE_INFO_MAXLOOKBEHIND
        !          2192: 
        !          2193:        Return  the  number of characters (NB not bytes) in the longest lookbe-
        !          2194:        hind assertion in the pattern. Note that the simple assertions  \b  and
        !          2195:        \B  require a one-character lookbehind. This information is useful when
        !          2196:        doing multi-segment matching using the partial matching facilities.
        !          2197: 
1.1       misho    2198:          PCRE_INFO_MINLENGTH
                   2199: 
1.1.1.2   misho    2200:        If the pattern was studied and a minimum length  for  matching  subject
                   2201:        strings  was  computed,  its  value is returned. Otherwise the returned
                   2202:        value is -1. The value is a number of characters, which in  UTF-8  mode
                   2203:        may  be  different from the number of bytes. The fourth argument should
                   2204:        point to an int variable. A non-negative value is a lower bound to  the
                   2205:        length  of  any  matching  string. There may not be any strings of that
                   2206:        length that do actually match, but every string that does match  is  at
                   2207:        least that long.
1.1       misho    2208: 
                   2209:          PCRE_INFO_NAMECOUNT
                   2210:          PCRE_INFO_NAMEENTRYSIZE
                   2211:          PCRE_INFO_NAMETABLE
                   2212: 
                   2213:        PCRE  supports the use of named as well as numbered capturing parenthe-
                   2214:        ses. The names are just an additional way of identifying the  parenthe-
                   2215:        ses, which still acquire numbers. Several convenience functions such as
                   2216:        pcre_get_named_substring() are provided for  extracting  captured  sub-
                   2217:        strings  by  name. It is also possible to extract the data directly, by
                   2218:        first converting the name to a number in order to  access  the  correct
                   2219:        pointers in the output vector (described with pcre_exec() below). To do
                   2220:        the conversion, you need  to  use  the  name-to-number  map,  which  is
                   2221:        described by these three values.
                   2222: 
                   2223:        The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
                   2224:        gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
                   2225:        of  each  entry;  both  of  these  return  an int value. The entry size
                   2226:        depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1.1.1.2   misho    2227:        a pointer to the first entry of the table. This is a pointer to char in
                   2228:        the 8-bit library, where the first two bytes of each entry are the num-
                   2229:        ber  of  the capturing parenthesis, most significant byte first. In the
                   2230:        16-bit library, the pointer points to 16-bit data units, the  first  of
                   2231:        which  contains  the  parenthesis  number. The rest of the entry is the
                   2232:        corresponding name, zero terminated.
1.1       misho    2233: 
                   2234:        The names are in alphabetical order. Duplicate names may appear if  (?|
                   2235:        is used to create multiple groups with the same number, as described in
                   2236:        the section on duplicate subpattern numbers in  the  pcrepattern  page.
                   2237:        Duplicate  names  for  subpatterns with different numbers are permitted
                   2238:        only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they
                   2239:        appear  in  the table in the order in which they were found in the pat-
                   2240:        tern. In the absence of (?| this is the  order  of  increasing  number;
                   2241:        when (?| is used this is not necessarily the case because later subpat-
                   2242:        terns may have lower numbers.
                   2243: 
                   2244:        As a simple example of the name/number table,  consider  the  following
1.1.1.2   misho    2245:        pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
                   2246:        set, so white space - including newlines - is ignored):
1.1       misho    2247: 
                   2248:          (?<date> (?<year>(\d\d)?\d\d) -
                   2249:          (?<month>\d\d) - (?<day>\d\d) )
                   2250: 
                   2251:        There are four named subpatterns, so the table has  four  entries,  and
                   2252:        each  entry  in the table is eight bytes long. The table is as follows,
                   2253:        with non-printing bytes shows in hexadecimal, and undefined bytes shown
                   2254:        as ??:
                   2255: 
                   2256:          00 01 d  a  t  e  00 ??
                   2257:          00 05 d  a  y  00 ?? ??
                   2258:          00 04 m  o  n  t  h  00
                   2259:          00 02 y  e  a  r  00 ??
                   2260: 
                   2261:        When  writing  code  to  extract  data from named subpatterns using the
                   2262:        name-to-number map, remember that the length of the entries  is  likely
                   2263:        to be different for each compiled pattern.
                   2264: 
                   2265:          PCRE_INFO_OKPARTIAL
                   2266: 
                   2267:        Return  1  if  the  pattern  can  be  used  for  partial  matching with
                   2268:        pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
                   2269:        variable.  From  release  8.00,  this  always  returns  1,  because the
                   2270:        restrictions that previously applied  to  partial  matching  have  been
                   2271:        lifted.  The  pcrepartial documentation gives details of partial match-
                   2272:        ing.
                   2273: 
                   2274:          PCRE_INFO_OPTIONS
                   2275: 
                   2276:        Return a copy of the options with which the pattern was  compiled.  The
                   2277:        fourth  argument  should  point to an unsigned long int variable. These
                   2278:        option bits are those specified in the call to pcre_compile(), modified
                   2279:        by any top-level option settings at the start of the pattern itself. In
                   2280:        other words, they are the options that will be in force  when  matching
                   2281:        starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
                   2282:        the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
                   2283:        and PCRE_EXTENDED.
                   2284: 
                   2285:        A  pattern  is  automatically  anchored by PCRE if all of its top-level
                   2286:        alternatives begin with one of the following:
                   2287: 
                   2288:          ^     unless PCRE_MULTILINE is set
                   2289:          \A    always
                   2290:          \G    always
                   2291:          .*    if PCRE_DOTALL is set and there are no back
                   2292:                  references to the subpattern in which .* appears
                   2293: 
                   2294:        For such patterns, the PCRE_ANCHORED bit is set in the options returned
                   2295:        by pcre_fullinfo().
                   2296: 
                   2297:          PCRE_INFO_SIZE
                   2298: 
1.1.1.2   misho    2299:        Return  the size of the compiled pattern in bytes (for both libraries).
                   2300:        The fourth argument should point to a size_t variable. This value  does
                   2301:        not  include  the  size  of  the  pcre  structure  that  is returned by
                   2302:        pcre_compile(). The value that is passed as the argument  to  pcre_mal-
                   2303:        loc()  when pcre_compile() is getting memory in which to place the com-
                   2304:        piled data is the value returned by this option plus the  size  of  the
                   2305:        pcre  structure. Studying a compiled pattern, with or without JIT, does
                   2306:        not alter the value returned by this option.
1.1       misho    2307: 
                   2308:          PCRE_INFO_STUDYSIZE
                   2309: 
1.1.1.2   misho    2310:        Return the size in bytes of the data block pointed to by the study_data
                   2311:        field  in  a  pcre_extra  block.  If pcre_extra is NULL, or there is no
                   2312:        study data, zero is returned. The fourth argument  should  point  to  a
                   2313:        size_t  variable. The study_data field is set by pcre_study() to record
                   2314:        information that will speed  up  matching  (see  the  section  entitled
                   2315:        "Studying a pattern" above). The format of the study_data block is pri-
                   2316:        vate, but its length is made available via this option so that  it  can
                   2317:        be  saved  and  restored  (see  the  pcreprecompile  documentation  for
                   2318:        details).
1.1       misho    2319: 
                   2320: 
                   2321: REFERENCE COUNTS
                   2322: 
                   2323:        int pcre_refcount(pcre *code, int adjust);
                   2324: 
1.1.1.2   misho    2325:        The pcre_refcount() function is used to maintain a reference  count  in
1.1       misho    2326:        the data block that contains a compiled pattern. It is provided for the
1.1.1.2   misho    2327:        benefit of applications that  operate  in  an  object-oriented  manner,
1.1       misho    2328:        where different parts of the application may be using the same compiled
                   2329:        pattern, but you want to free the block when they are all done.
                   2330: 
                   2331:        When a pattern is compiled, the reference count field is initialized to
1.1.1.2   misho    2332:        zero.   It is changed only by calling this function, whose action is to
                   2333:        add the adjust value (which may be positive or  negative)  to  it.  The
1.1       misho    2334:        yield of the function is the new value. However, the value of the count
1.1.1.2   misho    2335:        is constrained to lie between 0 and 65535, inclusive. If the new  value
1.1       misho    2336:        is outside these limits, it is forced to the appropriate limit value.
                   2337: 
1.1.1.2   misho    2338:        Except  when it is zero, the reference count is not correctly preserved
                   2339:        if a pattern is compiled on one host and then  transferred  to  a  host
1.1       misho    2340:        whose byte-order is different. (This seems a highly unlikely scenario.)
                   2341: 
                   2342: 
                   2343: MATCHING A PATTERN: THE TRADITIONAL FUNCTION
                   2344: 
                   2345:        int pcre_exec(const pcre *code, const pcre_extra *extra,
                   2346:             const char *subject, int length, int startoffset,
                   2347:             int options, int *ovector, int ovecsize);
                   2348: 
1.1.1.2   misho    2349:        The  function pcre_exec() is called to match a subject string against a
                   2350:        compiled pattern, which is passed in the code argument. If the  pattern
                   2351:        was  studied,  the  result  of  the study should be passed in the extra
                   2352:        argument. You can call pcre_exec() with the same code and  extra  argu-
                   2353:        ments  as  many  times as you like, in order to match different subject
1.1       misho    2354:        strings with the same pattern.
                   2355: 
1.1.1.2   misho    2356:        This function is the main matching facility  of  the  library,  and  it
                   2357:        operates  in  a  Perl-like  manner. For specialist use there is also an
                   2358:        alternative matching function, which is described below in the  section
1.1       misho    2359:        about the pcre_dfa_exec() function.
                   2360: 
1.1.1.2   misho    2361:        In  most applications, the pattern will have been compiled (and option-
                   2362:        ally studied) in the same process that calls pcre_exec().  However,  it
1.1       misho    2363:        is possible to save compiled patterns and study data, and then use them
1.1.1.2   misho    2364:        later in different processes, possibly even on different hosts.  For  a
1.1       misho    2365:        discussion about this, see the pcreprecompile documentation.
                   2366: 
                   2367:        Here is an example of a simple call to pcre_exec():
                   2368: 
                   2369:          int rc;
                   2370:          int ovector[30];
                   2371:          rc = pcre_exec(
                   2372:            re,             /* result of pcre_compile() */
                   2373:            NULL,           /* we didn't study the pattern */
                   2374:            "some string",  /* the subject string */
                   2375:            11,             /* the length of the subject string */
                   2376:            0,              /* start at offset 0 in the subject */
                   2377:            0,              /* default options */
                   2378:            ovector,        /* vector of integers for substring information */
                   2379:            30);            /* number of elements (NOT size in bytes) */
                   2380: 
                   2381:    Extra data for pcre_exec()
                   2382: 
1.1.1.2   misho    2383:        If  the  extra argument is not NULL, it must point to a pcre_extra data
                   2384:        block. The pcre_study() function returns such a block (when it  doesn't
                   2385:        return  NULL), but you can also create one for yourself, and pass addi-
                   2386:        tional information in it. The pcre_extra block contains  the  following
1.1       misho    2387:        fields (not necessarily in this order):
                   2388: 
                   2389:          unsigned long int flags;
                   2390:          void *study_data;
                   2391:          void *executable_jit;
                   2392:          unsigned long int match_limit;
                   2393:          unsigned long int match_limit_recursion;
                   2394:          void *callout_data;
                   2395:          const unsigned char *tables;
                   2396:          unsigned char **mark;
                   2397: 
1.1.1.2   misho    2398:        In  the  16-bit  version  of  this  structure,  the mark field has type
                   2399:        "PCRE_UCHAR16 **".
                   2400: 
1.1.1.3 ! misho    2401:        The flags field is used to specify which of the other fields  are  set.
        !          2402:        The flag bits are:
1.1       misho    2403: 
1.1.1.3 ! misho    2404:          PCRE_EXTRA_CALLOUT_DATA
1.1       misho    2405:          PCRE_EXTRA_EXECUTABLE_JIT
1.1.1.3 ! misho    2406:          PCRE_EXTRA_MARK
1.1       misho    2407:          PCRE_EXTRA_MATCH_LIMIT
                   2408:          PCRE_EXTRA_MATCH_LIMIT_RECURSION
1.1.1.3 ! misho    2409:          PCRE_EXTRA_STUDY_DATA
1.1       misho    2410:          PCRE_EXTRA_TABLES
                   2411: 
                   2412:        Other  flag  bits should be set to zero. The study_data field and some-
                   2413:        times the executable_jit field are set in the pcre_extra block that  is
                   2414:        returned  by pcre_study(), together with the appropriate flag bits. You
                   2415:        should not set these yourself, but you may add to the block by  setting
1.1.1.3 ! misho    2416:        other fields and their corresponding flag bits.
1.1       misho    2417: 
                   2418:        The match_limit field provides a means of preventing PCRE from using up
                   2419:        a vast amount of resources when running patterns that are not going  to
                   2420:        match,  but  which  have  a very large number of possibilities in their
                   2421:        search trees. The classic example is a pattern that uses nested  unlim-
                   2422:        ited repeats.
                   2423: 
                   2424:        Internally,  pcre_exec() uses a function called match(), which it calls
                   2425:        repeatedly (sometimes recursively). The limit  set  by  match_limit  is
                   2426:        imposed  on the number of times this function is called during a match,
                   2427:        which has the effect of limiting the amount of  backtracking  that  can
                   2428:        take place. For patterns that are not anchored, the count restarts from
                   2429:        zero for each position in the subject string.
                   2430: 
                   2431:        When pcre_exec() is called with a pattern that was successfully studied
1.1.1.3 ! misho    2432:        with  a  JIT  option, the way that the matching is executed is entirely
        !          2433:        different.  However, there is still the possibility of runaway matching
        !          2434:        that goes on for a very long time, and so the match_limit value is also
        !          2435:        used in this case (but in a different way) to limit how long the match-
        !          2436:        ing can continue.
1.1       misho    2437: 
                   2438:        The  default  value  for  the  limit can be set when PCRE is built; the
                   2439:        default default is 10 million, which handles all but the  most  extreme
                   2440:        cases.  You  can  override  the  default by suppling pcre_exec() with a
                   2441:        pcre_extra    block    in    which    match_limit    is    set,     and
                   2442:        PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
                   2443:        exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
                   2444: 
                   2445:        The match_limit_recursion field is similar to match_limit, but  instead
                   2446:        of limiting the total number of times that match() is called, it limits
                   2447:        the depth of recursion. The recursion depth is a  smaller  number  than
                   2448:        the  total number of calls, because not all calls to match() are recur-
                   2449:        sive.  This limit is of use only if it is set smaller than match_limit.
                   2450: 
                   2451:        Limiting the recursion depth limits the amount of  machine  stack  that
                   2452:        can  be used, or, when PCRE has been compiled to use memory on the heap
                   2453:        instead of the stack, the amount of heap memory that can be used.  This
1.1.1.3 ! misho    2454:        limit  is not relevant, and is ignored, when matching is done using JIT
        !          2455:        compiled code.
1.1       misho    2456: 
                   2457:        The default value for match_limit_recursion can be  set  when  PCRE  is
                   2458:        built;  the  default  default  is  the  same  value  as the default for
                   2459:        match_limit. You can override the default by suppling pcre_exec()  with
                   2460:        a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
                   2461:        PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
                   2462:        limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
                   2463: 
                   2464:        The  callout_data  field is used in conjunction with the "callout" fea-
                   2465:        ture, and is described in the pcrecallout documentation.
                   2466: 
                   2467:        The tables field  is  used  to  pass  a  character  tables  pointer  to
                   2468:        pcre_exec();  this overrides the value that is stored with the compiled
                   2469:        pattern. A non-NULL value is stored with the compiled pattern  only  if
                   2470:        custom  tables  were  supplied to pcre_compile() via its tableptr argu-
                   2471:        ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
                   2472:        PCRE's  internal  tables  to be used. This facility is helpful when re-
                   2473:        using patterns that have been saved after compiling  with  an  external
                   2474:        set  of  tables,  because  the  external tables might be at a different
                   2475:        address when pcre_exec() is called. See the  pcreprecompile  documenta-
                   2476:        tion for a discussion of saving compiled patterns for later use.
                   2477: 
                   2478:        If  PCRE_EXTRA_MARK  is  set in the flags field, the mark field must be
1.1.1.2   misho    2479:        set to point to a suitable variable. If the pattern contains any  back-
1.1       misho    2480:        tracking  control verbs such as (*MARK:NAME), and the execution ends up
                   2481:        with a name to pass back, a pointer to the  name  string  (zero  termi-
                   2482:        nated)  is  placed  in  the  variable pointed to by the mark field. The
                   2483:        names are within the compiled pattern; if you wish  to  retain  such  a
                   2484:        name  you must copy it before freeing the memory of a compiled pattern.
                   2485:        If there is no name to pass back, the variable pointed to by  the  mark
1.1.1.2   misho    2486:        field  is  set  to NULL. For details of the backtracking control verbs,
                   2487:        see the section entitled "Backtracking control" in the pcrepattern doc-
                   2488:        umentation.
1.1       misho    2489: 
                   2490:    Option bits for pcre_exec()
                   2491: 
                   2492:        The  unused  bits of the options argument for pcre_exec() must be zero.
                   2493:        The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
                   2494:        PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
1.1.1.3 ! misho    2495:        PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,   and
        !          2496:        PCRE_PARTIAL_SOFT.
1.1       misho    2497: 
1.1.1.3 ! misho    2498:        If  the  pattern  was successfully studied with one of the just-in-time
        !          2499:        (JIT) compile options, the only supported options for JIT execution are
        !          2500:        PCRE_NO_UTF8_CHECK,     PCRE_NOTBOL,     PCRE_NOTEOL,    PCRE_NOTEMPTY,
        !          2501:        PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If  an
        !          2502:        unsupported  option  is  used, JIT execution is disabled and the normal
        !          2503:        interpretive code in pcre_exec() is run.
1.1       misho    2504: 
                   2505:          PCRE_ANCHORED
                   2506: 
                   2507:        The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
                   2508:        matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
                   2509:        turned out to be anchored by virtue of its contents, it cannot be  made
                   2510:        unachored at matching time.
                   2511: 
                   2512:          PCRE_BSR_ANYCRLF
                   2513:          PCRE_BSR_UNICODE
                   2514: 
                   2515:        These options (which are mutually exclusive) control what the \R escape
                   2516:        sequence matches. The choice is either to match only CR, LF,  or  CRLF,
                   2517:        or  to  match  any Unicode newline sequence. These options override the
                   2518:        choice that was made or defaulted when the pattern was compiled.
                   2519: 
                   2520:          PCRE_NEWLINE_CR
                   2521:          PCRE_NEWLINE_LF
                   2522:          PCRE_NEWLINE_CRLF
                   2523:          PCRE_NEWLINE_ANYCRLF
                   2524:          PCRE_NEWLINE_ANY
                   2525: 
                   2526:        These options override  the  newline  definition  that  was  chosen  or
                   2527:        defaulted  when the pattern was compiled. For details, see the descrip-
                   2528:        tion of pcre_compile()  above.  During  matching,  the  newline  choice
                   2529:        affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
                   2530:        ters. It may also alter the way the match position is advanced after  a
                   2531:        match failure for an unanchored pattern.
                   2532: 
                   2533:        When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
                   2534:        set, and a match attempt for an unanchored pattern fails when the  cur-
                   2535:        rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
                   2536:        explicit matches for  CR  or  LF  characters,  the  match  position  is
                   2537:        advanced by two characters instead of one, in other words, to after the
                   2538:        CRLF.
                   2539: 
                   2540:        The above rule is a compromise that makes the most common cases work as
                   2541:        expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
                   2542:        option is not set), it does not match the string "\r\nA" because, after
                   2543:        failing  at the start, it skips both the CR and the LF before retrying.
                   2544:        However, the pattern [\r\n]A does match that string,  because  it  con-
                   2545:        tains an explicit CR or LF reference, and so advances only by one char-
                   2546:        acter after the first failure.
                   2547: 
                   2548:        An explicit match for CR of LF is either a literal appearance of one of
                   2549:        those  characters,  or  one  of the \r or \n escape sequences. Implicit
                   2550:        matches such as [^X] do not count, nor does \s (which includes  CR  and
                   2551:        LF in the characters that it matches).
                   2552: 
                   2553:        Notwithstanding  the above, anomalous effects may still occur when CRLF
                   2554:        is a valid newline sequence and explicit \r or \n escapes appear in the
                   2555:        pattern.
                   2556: 
                   2557:          PCRE_NOTBOL
                   2558: 
                   2559:        This option specifies that first character of the subject string is not
                   2560:        the beginning of a line, so the  circumflex  metacharacter  should  not
                   2561:        match  before it. Setting this without PCRE_MULTILINE (at compile time)
                   2562:        causes circumflex never to match. This option affects only  the  behav-
                   2563:        iour of the circumflex metacharacter. It does not affect \A.
                   2564: 
                   2565:          PCRE_NOTEOL
                   2566: 
                   2567:        This option specifies that the end of the subject string is not the end
                   2568:        of a line, so the dollar metacharacter should not match it nor  (except
                   2569:        in  multiline mode) a newline immediately before it. Setting this with-
                   2570:        out PCRE_MULTILINE (at compile time) causes dollar never to match. This
                   2571:        option  affects only the behaviour of the dollar metacharacter. It does
                   2572:        not affect \Z or \z.
                   2573: 
                   2574:          PCRE_NOTEMPTY
                   2575: 
                   2576:        An empty string is not considered to be a valid match if this option is
                   2577:        set.  If  there are alternatives in the pattern, they are tried. If all
                   2578:        the alternatives match the empty string, the entire  match  fails.  For
                   2579:        example, if the pattern
                   2580: 
                   2581:          a?b?
                   2582: 
                   2583:        is  applied  to  a  string not beginning with "a" or "b", it matches an
                   2584:        empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
                   2585:        match is not valid, so PCRE searches further into the string for occur-
                   2586:        rences of "a" or "b".
                   2587: 
                   2588:          PCRE_NOTEMPTY_ATSTART
                   2589: 
                   2590:        This is like PCRE_NOTEMPTY, except that an empty string match  that  is
                   2591:        not  at  the  start  of  the  subject  is  permitted. If the pattern is
                   2592:        anchored, such a match can occur only if the pattern contains \K.
                   2593: 
                   2594:        Perl    has    no    direct    equivalent    of    PCRE_NOTEMPTY     or
                   2595:        PCRE_NOTEMPTY_ATSTART,  but  it  does  make a special case of a pattern
                   2596:        match of the empty string within its split() function, and  when  using
                   2597:        the  /g  modifier.  It  is  possible  to emulate Perl's behaviour after
                   2598:        matching a null string by first trying the match again at the same off-
                   2599:        set  with  PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED,  and then if that
                   2600:        fails, by advancing the starting offset (see below) and trying an ordi-
                   2601:        nary  match  again. There is some code that demonstrates how to do this
                   2602:        in the pcredemo sample program. In the most general case, you  have  to
                   2603:        check  to  see  if the newline convention recognizes CRLF as a newline,
                   2604:        and if so, and the current character is CR followed by LF, advance  the
                   2605:        starting offset by two characters instead of one.
                   2606: 
                   2607:          PCRE_NO_START_OPTIMIZE
                   2608: 
                   2609:        There  are a number of optimizations that pcre_exec() uses at the start
                   2610:        of a match, in order to speed up the process. For  example,  if  it  is
                   2611:        known that an unanchored match must start with a specific character, it
                   2612:        searches the subject for that character, and fails  immediately  if  it
                   2613:        cannot  find  it,  without actually running the main matching function.
                   2614:        This means that a special item such as (*COMMIT) at the start of a pat-
                   2615:        tern  is  not  considered until after a suitable starting point for the
                   2616:        match has been found. When callouts or (*MARK) items are in use,  these
                   2617:        "start-up" optimizations can cause them to be skipped if the pattern is
                   2618:        never actually used. The start-up optimizations are in  effect  a  pre-
                   2619:        scan of the subject that takes place before the pattern is run.
                   2620: 
                   2621:        The  PCRE_NO_START_OPTIMIZE option disables the start-up optimizations,
                   2622:        possibly causing performance to suffer,  but  ensuring  that  in  cases
                   2623:        where  the  result is "no match", the callouts do occur, and that items
                   2624:        such as (*COMMIT) and (*MARK) are considered at every possible starting
                   2625:        position  in  the  subject  string. If PCRE_NO_START_OPTIMIZE is set at
1.1.1.3 ! misho    2626:        compile time,  it  cannot  be  unset  at  matching  time.  The  use  of
        !          2627:        PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching
        !          2628:        is always done using interpretively.
1.1       misho    2629: 
                   2630:        Setting PCRE_NO_START_OPTIMIZE can change the  outcome  of  a  matching
                   2631:        operation.  Consider the pattern
                   2632: 
                   2633:          (*COMMIT)ABC
                   2634: 
                   2635:        When  this  is  compiled, PCRE records the fact that a match must start
                   2636:        with the character "A". Suppose the subject  string  is  "DEFABC".  The
                   2637:        start-up  optimization  scans along the subject, finds "A" and runs the
                   2638:        first match attempt from there. The (*COMMIT) item means that the  pat-
                   2639:        tern  must  match the current starting position, which in this case, it
                   2640:        does. However, if the same match  is  run  with  PCRE_NO_START_OPTIMIZE
                   2641:        set,  the  initial  scan  along the subject string does not happen. The
                   2642:        first match attempt is run starting  from  "D"  and  when  this  fails,
                   2643:        (*COMMIT)  prevents  any  further  matches  being tried, so the overall
                   2644:        result is "no match". If the pattern is studied,  more  start-up  opti-
                   2645:        mizations  may  be  used. For example, a minimum length for the subject
                   2646:        may be recorded. Consider the pattern
                   2647: 
                   2648:          (*MARK:A)(X|Y)
                   2649: 
                   2650:        The minimum length for a match is one  character.  If  the  subject  is
                   2651:        "ABC",  there  will  be  attempts  to  match "ABC", "BC", "C", and then
                   2652:        finally an empty string.  If the pattern is studied, the final  attempt
                   2653:        does  not take place, because PCRE knows that the subject is too short,
                   2654:        and so the (*MARK) is never encountered.  In this  case,  studying  the
                   2655:        pattern  does  not  affect the overall match result, which is still "no
                   2656:        match", but it does affect the auxiliary information that is returned.
                   2657: 
                   2658:          PCRE_NO_UTF8_CHECK
                   2659: 
                   2660:        When PCRE_UTF8 is set at compile time, the validity of the subject as a
                   2661:        UTF-8  string is automatically checked when pcre_exec() is subsequently
1.1.1.3 ! misho    2662:        called.  The entire string is checked before any other processing takes
        !          2663:        place.  The  value  of  startoffset  is  also checked to ensure that it
        !          2664:        points to the start of a UTF-8 character. There is a  discussion  about
        !          2665:        the  validity  of  UTF-8 strings in the pcreunicode page. If an invalid
        !          2666:        sequence  of  bytes   is   found,   pcre_exec()   returns   the   error
1.1.1.2   misho    2667:        PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
                   2668:        truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
1.1.1.3 ! misho    2669:        both  cases, information about the precise nature of the error may also
        !          2670:        be returned (see the descriptions of these errors in the section  enti-
        !          2671:        tled  Error return values from pcre_exec() below).  If startoffset con-
1.1.1.2   misho    2672:        tains a value that does not point to the start of a UTF-8 character (or
                   2673:        to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
                   2674: 
1.1.1.3 ! misho    2675:        If  you  already  know that your subject is valid, and you want to skip
        !          2676:        these   checks   for   performance   reasons,   you   can    set    the
        !          2677:        PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
        !          2678:        do this for the second and subsequent calls to pcre_exec() if  you  are
        !          2679:        making  repeated  calls  to  find  all  the matches in a single subject
        !          2680:        string. However, you should be  sure  that  the  value  of  startoffset
        !          2681:        points  to  the  start of a character (or the end of the subject). When
1.1.1.2   misho    2682:        PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
1.1.1.3 ! misho    2683:        subject  or  an invalid value of startoffset is undefined. Your program
1.1.1.2   misho    2684:        may crash.
1.1       misho    2685: 
                   2686:          PCRE_PARTIAL_HARD
                   2687:          PCRE_PARTIAL_SOFT
                   2688: 
1.1.1.3 ! misho    2689:        These options turn on the partial matching feature. For backwards  com-
        !          2690:        patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
        !          2691:        match occurs if the end of the subject string is reached  successfully,
        !          2692:        but  there  are not enough subject characters to complete the match. If
1.1       misho    2693:        this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
1.1.1.3 ! misho    2694:        matching  continues  by  testing any remaining alternatives. Only if no
        !          2695:        complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
        !          2696:        PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
        !          2697:        caller is prepared to handle a partial match, but only if  no  complete
1.1       misho    2698:        match can be found.
                   2699: 
1.1.1.3 ! misho    2700:        If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
        !          2701:        case, if a partial match  is  found,  pcre_exec()  immediately  returns
        !          2702:        PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
        !          2703:        other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
1.1       misho    2704:        ered to be more important that an alternative complete match.
                   2705: 
1.1.1.3 ! misho    2706:        In  both  cases,  the portion of the string that was inspected when the
1.1       misho    2707:        partial match was found is set as the first matching string. There is a
1.1.1.3 ! misho    2708:        more  detailed  discussion  of partial and multi-segment matching, with
1.1       misho    2709:        examples, in the pcrepartial documentation.
                   2710: 
                   2711:    The string to be matched by pcre_exec()
                   2712: 
1.1.1.3 ! misho    2713:        The subject string is passed to pcre_exec() as a pointer in subject,  a
        !          2714:        length  in  bytes in length, and a starting byte offset in startoffset.
        !          2715:        If this is  negative  or  greater  than  the  length  of  the  subject,
        !          2716:        pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
        !          2717:        zero, the search for a match starts at the beginning  of  the  subject,
1.1       misho    2718:        and this is by far the most common case. In UTF-8 mode, the byte offset
1.1.1.3 ! misho    2719:        must point to the start of a UTF-8 character (or the end  of  the  sub-
        !          2720:        ject).  Unlike  the pattern string, the subject may contain binary zero
1.1       misho    2721:        bytes.
                   2722: 
1.1.1.3 ! misho    2723:        A non-zero starting offset is useful when searching for  another  match
        !          2724:        in  the same subject by calling pcre_exec() again after a previous suc-
        !          2725:        cess.  Setting startoffset differs from just passing over  a  shortened
        !          2726:        string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
1.1       misho    2727:        with any kind of lookbehind. For example, consider the pattern
                   2728: 
                   2729:          \Biss\B
                   2730: 
1.1.1.3 ! misho    2731:        which finds occurrences of "iss" in the middle of  words.  (\B  matches
        !          2732:        only  if  the  current position in the subject is not a word boundary.)
        !          2733:        When applied to the string "Mississipi" the first call  to  pcre_exec()
        !          2734:        finds  the  first  occurrence. If pcre_exec() is called again with just
        !          2735:        the remainder of the subject,  namely  "issipi",  it  does  not  match,
1.1       misho    2736:        because \B is always false at the start of the subject, which is deemed
1.1.1.3 ! misho    2737:        to be a word boundary. However, if pcre_exec()  is  passed  the  entire
1.1       misho    2738:        string again, but with startoffset set to 4, it finds the second occur-
1.1.1.3 ! misho    2739:        rence of "iss" because it is able to look behind the starting point  to
1.1       misho    2740:        discover that it is preceded by a letter.
                   2741: 
1.1.1.3 ! misho    2742:        Finding  all  the  matches  in a subject is tricky when the pattern can
1.1       misho    2743:        match an empty string. It is possible to emulate Perl's /g behaviour by
1.1.1.3 ! misho    2744:        first   trying   the   match   again  at  the  same  offset,  with  the
        !          2745:        PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
        !          2746:        fails,  advancing  the  starting  offset  and  trying an ordinary match
1.1       misho    2747:        again. There is some code that demonstrates how to do this in the pcre-
                   2748:        demo sample program. In the most general case, you have to check to see
1.1.1.3 ! misho    2749:        if the newline convention recognizes CRLF as a newline, and if so,  and
1.1       misho    2750:        the current character is CR followed by LF, advance the starting offset
                   2751:        by two characters instead of one.
                   2752: 
1.1.1.3 ! misho    2753:        If a non-zero starting offset is passed when the pattern  is  anchored,
1.1       misho    2754:        one attempt to match at the given offset is made. This can only succeed
1.1.1.3 ! misho    2755:        if the pattern does not require the match to be at  the  start  of  the
1.1       misho    2756:        subject.
                   2757: 
                   2758:    How pcre_exec() returns captured substrings
                   2759: 
1.1.1.3 ! misho    2760:        In  general, a pattern matches a certain portion of the subject, and in
        !          2761:        addition, further substrings from the subject  may  be  picked  out  by
        !          2762:        parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
        !          2763:        this is called "capturing" in what follows, and the  phrase  "capturing
        !          2764:        subpattern"  is  used for a fragment of a pattern that picks out a sub-
        !          2765:        string. PCRE supports several other kinds of  parenthesized  subpattern
1.1       misho    2766:        that do not cause substrings to be captured.
                   2767: 
                   2768:        Captured substrings are returned to the caller via a vector of integers
1.1.1.3 ! misho    2769:        whose address is passed in ovector. The number of elements in the  vec-
        !          2770:        tor  is  passed in ovecsize, which must be a non-negative number. Note:
1.1       misho    2771:        this argument is NOT the size of ovector in bytes.
                   2772: 
1.1.1.3 ! misho    2773:        The first two-thirds of the vector is used to pass back  captured  sub-
        !          2774:        strings,  each  substring using a pair of integers. The remaining third
        !          2775:        of the vector is used as workspace by pcre_exec() while  matching  cap-
        !          2776:        turing  subpatterns, and is not available for passing back information.
        !          2777:        The number passed in ovecsize should always be a multiple of three.  If
1.1       misho    2778:        it is not, it is rounded down.
                   2779: 
1.1.1.3 ! misho    2780:        When  a  match  is successful, information about captured substrings is
        !          2781:        returned in pairs of integers, starting at the  beginning  of  ovector,
        !          2782:        and  continuing  up  to two-thirds of its length at the most. The first
        !          2783:        element of each pair is set to the byte offset of the  first  character
        !          2784:        in  a  substring, and the second is set to the byte offset of the first
        !          2785:        character after the end of a substring. Note: these values  are  always
1.1       misho    2786:        byte offsets, even in UTF-8 mode. They are not character counts.
                   2787: 
1.1.1.3 ! misho    2788:        The  first  pair  of  integers, ovector[0] and ovector[1], identify the
        !          2789:        portion of the subject string matched by the entire pattern.  The  next
        !          2790:        pair  is  used for the first capturing subpattern, and so on. The value
1.1       misho    2791:        returned by pcre_exec() is one more than the highest numbered pair that
1.1.1.3 ! misho    2792:        has  been  set.  For example, if two substrings have been captured, the
        !          2793:        returned value is 3. If there are no capturing subpatterns, the  return
1.1       misho    2794:        value from a successful match is 1, indicating that just the first pair
                   2795:        of offsets has been set.
                   2796: 
                   2797:        If a capturing subpattern is matched repeatedly, it is the last portion
                   2798:        of the string that it matched that is returned.
                   2799: 
1.1.1.3 ! misho    2800:        If  the vector is too small to hold all the captured substring offsets,
1.1       misho    2801:        it is used as far as possible (up to two-thirds of its length), and the
1.1.1.3 ! misho    2802:        function  returns a value of zero. If neither the actual string matched
        !          2803:        nor any captured substrings are of interest, pcre_exec() may be  called
        !          2804:        with  ovector passed as NULL and ovecsize as zero. However, if the pat-
        !          2805:        tern contains back references and the ovector  is  not  big  enough  to
        !          2806:        remember  the related substrings, PCRE has to get additional memory for
        !          2807:        use during matching. Thus it is usually advisable to supply an  ovector
1.1       misho    2808:        of reasonable size.
                   2809: 
1.1.1.3 ! misho    2810:        There  are  some  cases where zero is returned (indicating vector over-
        !          2811:        flow) when in fact the vector is exactly the right size for  the  final
1.1       misho    2812:        match. For example, consider the pattern
                   2813: 
                   2814:          (a)(?:(b)c|bd)
                   2815: 
1.1.1.3 ! misho    2816:        If  a  vector of 6 elements (allowing for only 1 captured substring) is
1.1       misho    2817:        given with subject string "abd", pcre_exec() will try to set the second
                   2818:        captured string, thereby recording a vector overflow, before failing to
1.1.1.3 ! misho    2819:        match "c" and backing up  to  try  the  second  alternative.  The  zero
        !          2820:        return,  however,  does  correctly  indicate that the maximum number of
1.1       misho    2821:        slots (namely 2) have been filled. In similar cases where there is tem-
1.1.1.3 ! misho    2822:        porary  overflow,  but  the final number of used slots is actually less
1.1       misho    2823:        than the maximum, a non-zero value is returned.
                   2824: 
                   2825:        The pcre_fullinfo() function can be used to find out how many capturing
1.1.1.3 ! misho    2826:        subpatterns  there  are  in  a  compiled pattern. The smallest size for
        !          2827:        ovector that will allow for n captured substrings, in addition  to  the
1.1       misho    2828:        offsets of the substring matched by the whole pattern, is (n+1)*3.
                   2829: 
1.1.1.3 ! misho    2830:        It  is  possible for capturing subpattern number n+1 to match some part
1.1       misho    2831:        of the subject when subpattern n has not been used at all. For example,
1.1.1.3 ! misho    2832:        if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1.1       misho    2833:        return from the function is 4, and subpatterns 1 and 3 are matched, but
1.1.1.3 ! misho    2834:        2  is  not.  When  this happens, both values in the offset pairs corre-
1.1       misho    2835:        sponding to unused subpatterns are set to -1.
                   2836: 
1.1.1.3 ! misho    2837:        Offset values that correspond to unused subpatterns at the end  of  the
        !          2838:        expression  are  also  set  to  -1. For example, if the string "abc" is
        !          2839:        matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
        !          2840:        matched.  The  return  from the function is 2, because the highest used
        !          2841:        capturing subpattern number is 1, and the offsets for  for  the  second
        !          2842:        and  third  capturing subpatterns (assuming the vector is large enough,
1.1       misho    2843:        of course) are set to -1.
                   2844: 
1.1.1.3 ! misho    2845:        Note: Elements in the first two-thirds of ovector that  do  not  corre-
        !          2846:        spond  to  capturing parentheses in the pattern are never changed. That
        !          2847:        is, if a pattern contains n capturing parentheses, no more  than  ovec-
        !          2848:        tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
1.1       misho    2849:        the first two-thirds) retain whatever values they previously had.
                   2850: 
1.1.1.3 ! misho    2851:        Some convenience functions are provided  for  extracting  the  captured
1.1       misho    2852:        substrings as separate strings. These are described below.
                   2853: 
                   2854:    Error return values from pcre_exec()
                   2855: 
1.1.1.3 ! misho    2856:        If  pcre_exec()  fails, it returns a negative number. The following are
1.1       misho    2857:        defined in the header file:
                   2858: 
                   2859:          PCRE_ERROR_NOMATCH        (-1)
                   2860: 
                   2861:        The subject string did not match the pattern.
                   2862: 
                   2863:          PCRE_ERROR_NULL           (-2)
                   2864: 
1.1.1.3 ! misho    2865:        Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1.1       misho    2866:        ovecsize was not zero.
                   2867: 
                   2868:          PCRE_ERROR_BADOPTION      (-3)
                   2869: 
                   2870:        An unrecognized bit was set in the options argument.
                   2871: 
                   2872:          PCRE_ERROR_BADMAGIC       (-4)
                   2873: 
1.1.1.3 ! misho    2874:        PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1.1       misho    2875:        to catch the case when it is passed a junk pointer and to detect when a
                   2876:        pattern that was compiled in an environment of one endianness is run in
1.1.1.3 ! misho    2877:        an environment with the other endianness. This is the error  that  PCRE
1.1       misho    2878:        gives when the magic number is not present.
                   2879: 
                   2880:          PCRE_ERROR_UNKNOWN_OPCODE (-5)
                   2881: 
                   2882:        While running the pattern match, an unknown item was encountered in the
1.1.1.3 ! misho    2883:        compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1.1       misho    2884:        overwriting of the compiled pattern.
                   2885: 
                   2886:          PCRE_ERROR_NOMEMORY       (-6)
                   2887: 
1.1.1.3 ! misho    2888:        If  a  pattern contains back references, but the ovector that is passed
1.1       misho    2889:        to pcre_exec() is not big enough to remember the referenced substrings,
1.1.1.3 ! misho    2890:        PCRE  gets  a  block of memory at the start of matching to use for this
        !          2891:        purpose. If the call via pcre_malloc() fails, this error is given.  The
1.1       misho    2892:        memory is automatically freed at the end of matching.
                   2893: 
1.1.1.3 ! misho    2894:        This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
        !          2895:        This can happen only when PCRE has been compiled with  --disable-stack-
1.1       misho    2896:        for-recursion.
                   2897: 
                   2898:          PCRE_ERROR_NOSUBSTRING    (-7)
                   2899: 
1.1.1.3 ! misho    2900:        This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1.1       misho    2901:        and  pcre_get_substring_list()  functions  (see  below).  It  is  never
                   2902:        returned by pcre_exec().
                   2903: 
                   2904:          PCRE_ERROR_MATCHLIMIT     (-8)
                   2905: 
1.1.1.3 ! misho    2906:        The  backtracking  limit,  as  specified  by the match_limit field in a
        !          2907:        pcre_extra structure (or defaulted) was reached.  See  the  description
1.1       misho    2908:        above.
                   2909: 
                   2910:          PCRE_ERROR_CALLOUT        (-9)
                   2911: 
                   2912:        This error is never generated by pcre_exec() itself. It is provided for
1.1.1.3 ! misho    2913:        use by callout functions that want to yield a distinctive  error  code.
1.1       misho    2914:        See the pcrecallout documentation for details.
                   2915: 
                   2916:          PCRE_ERROR_BADUTF8        (-10)
                   2917: 
1.1.1.3 ! misho    2918:        A  string  that contains an invalid UTF-8 byte sequence was passed as a
        !          2919:        subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
        !          2920:        the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
        !          2921:        start of the the invalid UTF-8 character is placed in  the  first  ele-
        !          2922:        ment,  and  a  reason  code is placed in the second element. The reason
1.1       misho    2923:        codes are listed in the following section.  For backward compatibility,
1.1.1.3 ! misho    2924:        if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
        !          2925:        acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
1.1       misho    2926:        PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
                   2927: 
                   2928:          PCRE_ERROR_BADUTF8_OFFSET (-11)
                   2929: 
1.1.1.3 ! misho    2930:        The  UTF-8  byte  sequence that was passed as a subject was checked and
        !          2931:        found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
        !          2932:        value  of startoffset did not point to the beginning of a UTF-8 charac-
1.1       misho    2933:        ter or the end of the subject.
                   2934: 
                   2935:          PCRE_ERROR_PARTIAL        (-12)
                   2936: 
1.1.1.3 ! misho    2937:        The subject string did not match, but it did match partially.  See  the
1.1       misho    2938:        pcrepartial documentation for details of partial matching.
                   2939: 
                   2940:          PCRE_ERROR_BADPARTIAL     (-13)
                   2941: 
1.1.1.3 ! misho    2942:        This  code  is  no  longer  in  use.  It was formerly returned when the
        !          2943:        PCRE_PARTIAL option was used with a compiled pattern  containing  items
        !          2944:        that  were  not  supported  for  partial  matching.  From  release 8.00
1.1       misho    2945:        onwards, there are no restrictions on partial matching.
                   2946: 
                   2947:          PCRE_ERROR_INTERNAL       (-14)
                   2948: 
1.1.1.3 ! misho    2949:        An unexpected internal error has occurred. This error could  be  caused
1.1       misho    2950:        by a bug in PCRE or by overwriting of the compiled pattern.
                   2951: 
                   2952:          PCRE_ERROR_BADCOUNT       (-15)
                   2953: 
                   2954:        This error is given if the value of the ovecsize argument is negative.
                   2955: 
                   2956:          PCRE_ERROR_RECURSIONLIMIT (-21)
                   2957: 
                   2958:        The internal recursion limit, as specified by the match_limit_recursion
1.1.1.3 ! misho    2959:        field in a pcre_extra structure (or defaulted)  was  reached.  See  the
1.1       misho    2960:        description above.
                   2961: 
                   2962:          PCRE_ERROR_BADNEWLINE     (-23)
                   2963: 
                   2964:        An invalid combination of PCRE_NEWLINE_xxx options was given.
                   2965: 
                   2966:          PCRE_ERROR_BADOFFSET      (-24)
                   2967: 
                   2968:        The value of startoffset was negative or greater than the length of the
                   2969:        subject, that is, the value in length.
                   2970: 
                   2971:          PCRE_ERROR_SHORTUTF8      (-25)
                   2972: 
1.1.1.3 ! misho    2973:        This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
        !          2974:        string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
        !          2975:        option is set.  Information  about  the  failure  is  returned  as  for
        !          2976:        PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
        !          2977:        this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
        !          2978:        tion  of returned information; it is retained for backwards compatibil-
1.1       misho    2979:        ity.
                   2980: 
                   2981:          PCRE_ERROR_RECURSELOOP    (-26)
                   2982: 
                   2983:        This error is returned when pcre_exec() detects a recursion loop within
1.1.1.3 ! misho    2984:        the  pattern. Specifically, it means that either the whole pattern or a
        !          2985:        subpattern has been called recursively for the second time at the  same
1.1       misho    2986:        position in the subject string. Some simple patterns that might do this
1.1.1.3 ! misho    2987:        are detected and faulted at compile time, but more  complicated  cases,
1.1       misho    2988:        in particular mutual recursions between two different subpatterns, can-
                   2989:        not be detected until run time.
                   2990: 
                   2991:          PCRE_ERROR_JIT_STACKLIMIT (-27)
                   2992: 
1.1.1.3 ! misho    2993:        This error is returned when a pattern  that  was  successfully  studied
        !          2994:        using  a  JIT compile option is being matched, but the memory available
        !          2995:        for the just-in-time processing stack is  not  large  enough.  See  the
        !          2996:        pcrejit documentation for more details.
1.1       misho    2997: 
1.1.1.3 ! misho    2998:          PCRE_ERROR_BADMODE        (-28)
1.1.1.2   misho    2999: 
                   3000:        This error is given if a pattern that was compiled by the 8-bit library
                   3001:        is passed to a 16-bit library function, or vice versa.
                   3002: 
1.1.1.3 ! misho    3003:          PCRE_ERROR_BADENDIANNESS  (-29)
1.1.1.2   misho    3004: 
1.1.1.3 ! misho    3005:        This error is given if  a  pattern  that  was  compiled  and  saved  is
        !          3006:        reloaded  on  a  host  with  different endianness. The utility function
1.1.1.2   misho    3007:        pcre_pattern_to_host_byte_order() can be used to convert such a pattern
                   3008:        so that it runs on the new host.
                   3009: 
1.1.1.3 ! misho    3010:        Error numbers -16 to -20, -22, and -30 are not used by pcre_exec().
1.1       misho    3011: 
                   3012:    Reason codes for invalid UTF-8 strings
                   3013: 
1.1.1.3 ! misho    3014:        This  section  applies  only  to  the  8-bit library. The corresponding
1.1.1.2   misho    3015:        information for the 16-bit library is given in the pcre16 page.
                   3016: 
1.1       misho    3017:        When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
1.1.1.3 ! misho    3018:        UTF8,  and  the size of the output vector (ovecsize) is at least 2, the
        !          3019:        offset of the start of the invalid UTF-8 character  is  placed  in  the
1.1       misho    3020:        first output vector element (ovector[0]) and a reason code is placed in
1.1.1.3 ! misho    3021:        the second element (ovector[1]). The reason codes are  given  names  in
1.1       misho    3022:        the pcre.h header file:
                   3023: 
                   3024:          PCRE_UTF8_ERR1
                   3025:          PCRE_UTF8_ERR2
                   3026:          PCRE_UTF8_ERR3
                   3027:          PCRE_UTF8_ERR4
                   3028:          PCRE_UTF8_ERR5
                   3029: 
1.1.1.3 ! misho    3030:        The  string  ends  with a truncated UTF-8 character; the code specifies
        !          3031:        how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
        !          3032:        characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
        !          3033:        nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
1.1       misho    3034:        checked first; hence the possibility of 4 or 5 missing bytes.
                   3035: 
                   3036:          PCRE_UTF8_ERR6
                   3037:          PCRE_UTF8_ERR7
                   3038:          PCRE_UTF8_ERR8
                   3039:          PCRE_UTF8_ERR9
                   3040:          PCRE_UTF8_ERR10
                   3041: 
                   3042:        The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
1.1.1.3 ! misho    3043:        the character do not have the binary value 0b10 (that  is,  either  the
1.1       misho    3044:        most significant bit is 0, or the next bit is 1).
                   3045: 
                   3046:          PCRE_UTF8_ERR11
                   3047:          PCRE_UTF8_ERR12
                   3048: 
1.1.1.3 ! misho    3049:        A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
1.1       misho    3050:        long; these code points are excluded by RFC 3629.
                   3051: 
                   3052:          PCRE_UTF8_ERR13
                   3053: 
1.1.1.3 ! misho    3054:        A 4-byte character has a value greater than 0x10fff; these code  points
1.1       misho    3055:        are excluded by RFC 3629.
                   3056: 
                   3057:          PCRE_UTF8_ERR14
                   3058: 
1.1.1.3 ! misho    3059:        A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
        !          3060:        range of code points are reserved by RFC 3629 for use with UTF-16,  and
1.1       misho    3061:        so are excluded from UTF-8.
                   3062: 
                   3063:          PCRE_UTF8_ERR15
                   3064:          PCRE_UTF8_ERR16
                   3065:          PCRE_UTF8_ERR17
                   3066:          PCRE_UTF8_ERR18
                   3067:          PCRE_UTF8_ERR19
                   3068: 
1.1.1.3 ! misho    3069:        A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
        !          3070:        for a value that can be represented by fewer bytes, which  is  invalid.
        !          3071:        For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
1.1       misho    3072:        rect coding uses just one byte.
                   3073: 
                   3074:          PCRE_UTF8_ERR20
                   3075: 
                   3076:        The two most significant bits of the first byte of a character have the
1.1.1.3 ! misho    3077:        binary  value 0b10 (that is, the most significant bit is 1 and the sec-
        !          3078:        ond is 0). Such a byte can only validly occur as the second  or  subse-
1.1       misho    3079:        quent byte of a multi-byte character.
                   3080: 
                   3081:          PCRE_UTF8_ERR21
                   3082: 
1.1.1.3 ! misho    3083:        The  first byte of a character has the value 0xfe or 0xff. These values
1.1       misho    3084:        can never occur in a valid UTF-8 string.
                   3085: 
                   3086: 
                   3087: EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
                   3088: 
                   3089:        int pcre_copy_substring(const char *subject, int *ovector,
                   3090:             int stringcount, int stringnumber, char *buffer,
                   3091:             int buffersize);
                   3092: 
                   3093:        int pcre_get_substring(const char *subject, int *ovector,
                   3094:             int stringcount, int stringnumber,
                   3095:             const char **stringptr);
                   3096: 
                   3097:        int pcre_get_substring_list(const char *subject,
                   3098:             int *ovector, int stringcount, const char ***listptr);
                   3099: 
1.1.1.3 ! misho    3100:        Captured substrings can be  accessed  directly  by  using  the  offsets
        !          3101:        returned  by  pcre_exec()  in  ovector.  For convenience, the functions
1.1       misho    3102:        pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1.1.1.3 ! misho    3103:        string_list()  are  provided for extracting captured substrings as new,
        !          3104:        separate, zero-terminated strings. These functions identify  substrings
        !          3105:        by  number.  The  next section describes functions for extracting named
1.1       misho    3106:        substrings.
                   3107: 
1.1.1.3 ! misho    3108:        A substring that contains a binary zero is correctly extracted and  has
        !          3109:        a  further zero added on the end, but the result is not, of course, a C
        !          3110:        string.  However, you can process such a string  by  referring  to  the
        !          3111:        length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
1.1       misho    3112:        string().  Unfortunately, the interface to pcre_get_substring_list() is
1.1.1.3 ! misho    3113:        not  adequate for handling strings containing binary zeros, because the
1.1       misho    3114:        end of the final string is not independently indicated.
                   3115: 
1.1.1.3 ! misho    3116:        The first three arguments are the same for all  three  of  these  func-
        !          3117:        tions:  subject  is  the subject string that has just been successfully
1.1       misho    3118:        matched, ovector is a pointer to the vector of integer offsets that was
                   3119:        passed to pcre_exec(), and stringcount is the number of substrings that
1.1.1.3 ! misho    3120:        were captured by the match, including the substring  that  matched  the
1.1       misho    3121:        entire regular expression. This is the value returned by pcre_exec() if
1.1.1.3 ! misho    3122:        it is greater than zero. If pcre_exec() returned zero, indicating  that
        !          3123:        it  ran out of space in ovector, the value passed as stringcount should
1.1       misho    3124:        be the number of elements in the vector divided by three.
                   3125: 
1.1.1.3 ! misho    3126:        The functions pcre_copy_substring() and pcre_get_substring() extract  a
        !          3127:        single  substring,  whose  number  is given as stringnumber. A value of
        !          3128:        zero extracts the substring that matched the  entire  pattern,  whereas
        !          3129:        higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
        !          3130:        string(), the string is placed in buffer,  whose  length  is  given  by
        !          3131:        buffersize,  while  for  pcre_get_substring()  a new block of memory is
        !          3132:        obtained via pcre_malloc, and its address is  returned  via  stringptr.
        !          3133:        The  yield  of  the function is the length of the string, not including
1.1       misho    3134:        the terminating zero, or one of these error codes:
                   3135: 
                   3136:          PCRE_ERROR_NOMEMORY       (-6)
                   3137: 
1.1.1.3 ! misho    3138:        The buffer was too small for pcre_copy_substring(), or the  attempt  to
1.1       misho    3139:        get memory failed for pcre_get_substring().
                   3140: 
                   3141:          PCRE_ERROR_NOSUBSTRING    (-7)
                   3142: 
                   3143:        There is no substring whose number is stringnumber.
                   3144: 
1.1.1.3 ! misho    3145:        The  pcre_get_substring_list()  function  extracts  all  available sub-
        !          3146:        strings and builds a list of pointers to them. All this is  done  in  a
1.1       misho    3147:        single block of memory that is obtained via pcre_malloc. The address of
1.1.1.3 ! misho    3148:        the memory block is returned via listptr, which is also  the  start  of
        !          3149:        the  list  of  string pointers. The end of the list is marked by a NULL
        !          3150:        pointer. The yield of the function is zero if all  went  well,  or  the
1.1       misho    3151:        error code
                   3152: 
                   3153:          PCRE_ERROR_NOMEMORY       (-6)
                   3154: 
                   3155:        if the attempt to get the memory block failed.
                   3156: 
1.1.1.3 ! misho    3157:        When  any of these functions encounter a substring that is unset, which
        !          3158:        can happen when capturing subpattern number n+1 matches  some  part  of
        !          3159:        the  subject, but subpattern n has not been used at all, they return an
1.1       misho    3160:        empty string. This can be distinguished from a genuine zero-length sub-
1.1.1.3 ! misho    3161:        string  by inspecting the appropriate offset in ovector, which is nega-
1.1       misho    3162:        tive for unset substrings.
                   3163: 
1.1.1.3 ! misho    3164:        The two convenience functions pcre_free_substring() and  pcre_free_sub-
        !          3165:        string_list()  can  be  used  to free the memory returned by a previous
1.1       misho    3166:        call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1.1.1.3 ! misho    3167:        tively.  They  do  nothing  more  than  call the function pointed to by
        !          3168:        pcre_free, which of course could be called directly from a  C  program.
        !          3169:        However,  PCRE is used in some situations where it is linked via a spe-
        !          3170:        cial  interface  to  another  programming  language  that  cannot   use
        !          3171:        pcre_free  directly;  it is for these cases that the functions are pro-
1.1       misho    3172:        vided.
                   3173: 
                   3174: 
                   3175: EXTRACTING CAPTURED SUBSTRINGS BY NAME
                   3176: 
                   3177:        int pcre_get_stringnumber(const pcre *code,
                   3178:             const char *name);
                   3179: 
                   3180:        int pcre_copy_named_substring(const pcre *code,
                   3181:             const char *subject, int *ovector,
                   3182:             int stringcount, const char *stringname,
                   3183:             char *buffer, int buffersize);
                   3184: 
                   3185:        int pcre_get_named_substring(const pcre *code,
                   3186:             const char *subject, int *ovector,
                   3187:             int stringcount, const char *stringname,
                   3188:             const char **stringptr);
                   3189: 
1.1.1.3 ! misho    3190:        To extract a substring by name, you first have to find associated  num-
1.1       misho    3191:        ber.  For example, for this pattern
                   3192: 
                   3193:          (a+)b(?<xxx>\d+)...
                   3194: 
                   3195:        the number of the subpattern called "xxx" is 2. If the name is known to
                   3196:        be unique (PCRE_DUPNAMES was not set), you can find the number from the
                   3197:        name by calling pcre_get_stringnumber(). The first argument is the com-
                   3198:        piled pattern, and the second is the name. The yield of the function is
1.1.1.3 ! misho    3199:        the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
1.1       misho    3200:        subpattern of that name.
                   3201: 
                   3202:        Given the number, you can extract the substring directly, or use one of
                   3203:        the functions described in the previous section. For convenience, there
                   3204:        are also two functions that do the whole job.
                   3205: 
1.1.1.3 ! misho    3206:        Most   of   the   arguments    of    pcre_copy_named_substring()    and
        !          3207:        pcre_get_named_substring()  are  the  same  as  those for the similarly
        !          3208:        named functions that extract by number. As these are described  in  the
        !          3209:        previous  section,  they  are not re-described here. There are just two
1.1       misho    3210:        differences:
                   3211: 
1.1.1.3 ! misho    3212:        First, instead of a substring number, a substring name is  given.  Sec-
1.1       misho    3213:        ond, there is an extra argument, given at the start, which is a pointer
1.1.1.3 ! misho    3214:        to the compiled pattern. This is needed in order to gain access to  the
1.1       misho    3215:        name-to-number translation table.
                   3216: 
1.1.1.3 ! misho    3217:        These  functions call pcre_get_stringnumber(), and if it succeeds, they
        !          3218:        then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
        !          3219:        ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
1.1       misho    3220:        behaviour may not be what you want (see the next section).
                   3221: 
                   3222:        Warning: If the pattern uses the (?| feature to set up multiple subpat-
1.1.1.3 ! misho    3223:        terns  with  the  same number, as described in the section on duplicate
        !          3224:        subpattern numbers in the pcrepattern page, you  cannot  use  names  to
        !          3225:        distinguish  the  different subpatterns, because names are not included
        !          3226:        in the compiled code. The matching process uses only numbers. For  this
        !          3227:        reason,  the  use of different names for subpatterns of the same number
1.1       misho    3228:        causes an error at compile time.
                   3229: 
                   3230: 
                   3231: DUPLICATE SUBPATTERN NAMES
                   3232: 
                   3233:        int pcre_get_stringtable_entries(const pcre *code,
                   3234:             const char *name, char **first, char **last);
                   3235: 
1.1.1.3 ! misho    3236:        When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
        !          3237:        subpatterns  are not required to be unique. (Duplicate names are always
        !          3238:        allowed for subpatterns with the same number, created by using the  (?|
        !          3239:        feature.  Indeed,  if  such subpatterns are named, they are required to
1.1       misho    3240:        use the same names.)
                   3241: 
                   3242:        Normally, patterns with duplicate names are such that in any one match,
1.1.1.3 ! misho    3243:        only  one of the named subpatterns participates. An example is shown in
1.1       misho    3244:        the pcrepattern documentation.
                   3245: 
1.1.1.3 ! misho    3246:        When   duplicates   are   present,   pcre_copy_named_substring()    and
        !          3247:        pcre_get_named_substring()  return the first substring corresponding to
        !          3248:        the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
        !          3249:        (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
        !          3250:        function returns one of the numbers that are associated with the  name,
1.1       misho    3251:        but it is not defined which it is.
                   3252: 
1.1.1.3 ! misho    3253:        If  you want to get full details of all captured substrings for a given
        !          3254:        name, you must use  the  pcre_get_stringtable_entries()  function.  The
1.1       misho    3255:        first argument is the compiled pattern, and the second is the name. The
1.1.1.3 ! misho    3256:        third and fourth are pointers to variables which  are  updated  by  the
1.1       misho    3257:        function. After it has run, they point to the first and last entries in
1.1.1.3 ! misho    3258:        the name-to-number table  for  the  given  name.  The  function  itself
        !          3259:        returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
        !          3260:        there are none. The format of the table is described above in the  sec-
        !          3261:        tion  entitled  Information about a pattern above.  Given all the rele-
        !          3262:        vant entries for the name, you can extract each of their  numbers,  and
1.1       misho    3263:        hence the captured data, if any.
                   3264: 
                   3265: 
                   3266: FINDING ALL POSSIBLE MATCHES
                   3267: 
1.1.1.3 ! misho    3268:        The  traditional  matching  function  uses a similar algorithm to Perl,
1.1       misho    3269:        which stops when it finds the first match, starting at a given point in
1.1.1.3 ! misho    3270:        the  subject.  If you want to find all possible matches, or the longest
        !          3271:        possible match, consider using the alternative matching  function  (see
        !          3272:        below)  instead.  If you cannot use the alternative function, but still
        !          3273:        need to find all possible matches, you can kludge it up by  making  use
1.1       misho    3274:        of the callout facility, which is described in the pcrecallout documen-
                   3275:        tation.
                   3276: 
                   3277:        What you have to do is to insert a callout right at the end of the pat-
1.1.1.3 ! misho    3278:        tern.   When your callout function is called, extract and save the cur-
        !          3279:        rent matched substring. Then return  1,  which  forces  pcre_exec()  to
        !          3280:        backtrack  and  try other alternatives. Ultimately, when it runs out of
1.1       misho    3281:        matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
                   3282: 
                   3283: 
1.1.1.2   misho    3284: OBTAINING AN ESTIMATE OF STACK USAGE
                   3285: 
1.1.1.3 ! misho    3286:        Matching certain patterns using pcre_exec() can use a  lot  of  process
        !          3287:        stack,  which  in  certain  environments can be rather limited in size.
        !          3288:        Some users find it helpful to have an estimate of the amount  of  stack
        !          3289:        that  is  used  by  pcre_exec(),  to help them set recursion limits, as
        !          3290:        described in the pcrestack documentation. The estimate that  is  output
1.1.1.2   misho    3291:        by pcretest when called with the -m and -C options is obtained by call-
1.1.1.3 ! misho    3292:        ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for  its
1.1.1.2   misho    3293:        first five arguments.
                   3294: 
1.1.1.3 ! misho    3295:        Normally,  if  its  first  argument  is  NULL,  pcre_exec() immediately
        !          3296:        returns the negative error code PCRE_ERROR_NULL, but with this  special
        !          3297:        combination  of  arguments,  it returns instead a negative number whose
        !          3298:        absolute value is the approximate stack frame size in bytes.  (A  nega-
        !          3299:        tive  number  is  used so that it is clear that no match has happened.)
        !          3300:        The value is approximate because in  some  cases,  recursive  calls  to
1.1.1.2   misho    3301:        pcre_exec() occur when there are one or two additional variables on the
                   3302:        stack.
                   3303: 
1.1.1.3 ! misho    3304:        If PCRE has been compiled to use the heap  instead  of  the  stack  for
        !          3305:        recursion,  the  value  returned  is  the  size  of  each block that is
1.1.1.2   misho    3306:        obtained from the heap.
                   3307: 
                   3308: 
1.1       misho    3309: MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
                   3310: 
                   3311:        int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
                   3312:             const char *subject, int length, int startoffset,
                   3313:             int options, int *ovector, int ovecsize,
                   3314:             int *workspace, int wscount);
                   3315: 
1.1.1.3 ! misho    3316:        The function pcre_dfa_exec()  is  called  to  match  a  subject  string
        !          3317:        against  a  compiled pattern, using a matching algorithm that scans the
        !          3318:        subject string just once, and does not backtrack.  This  has  different
        !          3319:        characteristics  to  the  normal  algorithm, and is not compatible with
        !          3320:        Perl. Some of the features of PCRE patterns are not  supported.  Never-
        !          3321:        theless,  there are times when this kind of matching can be useful. For
        !          3322:        a discussion of the two matching algorithms, and  a  list  of  features
        !          3323:        that  pcre_dfa_exec() does not support, see the pcrematching documenta-
1.1       misho    3324:        tion.
                   3325: 
1.1.1.3 ! misho    3326:        The arguments for the pcre_dfa_exec() function  are  the  same  as  for
1.1       misho    3327:        pcre_exec(), plus two extras. The ovector argument is used in a differ-
1.1.1.3 ! misho    3328:        ent way, and this is described below. The other  common  arguments  are
        !          3329:        used  in  the  same way as for pcre_exec(), so their description is not
1.1       misho    3330:        repeated here.
                   3331: 
1.1.1.3 ! misho    3332:        The two additional arguments provide workspace for  the  function.  The
        !          3333:        workspace  vector  should  contain at least 20 elements. It is used for
1.1       misho    3334:        keeping  track  of  multiple  paths  through  the  pattern  tree.  More
1.1.1.3 ! misho    3335:        workspace  will  be  needed for patterns and subjects where there are a
1.1       misho    3336:        lot of potential matches.
                   3337: 
                   3338:        Here is an example of a simple call to pcre_dfa_exec():
                   3339: 
                   3340:          int rc;
                   3341:          int ovector[10];
                   3342:          int wspace[20];
                   3343:          rc = pcre_dfa_exec(
                   3344:            re,             /* result of pcre_compile() */
                   3345:            NULL,           /* we didn't study the pattern */
                   3346:            "some string",  /* the subject string */
                   3347:            11,             /* the length of the subject string */
                   3348:            0,              /* start at offset 0 in the subject */
                   3349:            0,              /* default options */
                   3350:            ovector,        /* vector of integers for substring information */
                   3351:            10,             /* number of elements (NOT size in bytes) */
                   3352:            wspace,         /* working space vector */
                   3353:            20);            /* number of elements (NOT size in bytes) */
                   3354: 
                   3355:    Option bits for pcre_dfa_exec()
                   3356: 
1.1.1.3 ! misho    3357:        The unused bits of the options argument  for  pcre_dfa_exec()  must  be
        !          3358:        zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
1.1       misho    3359:        LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
1.1.1.3 ! misho    3360:        PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,
        !          3361:        PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-
        !          3362:        TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last
        !          3363:        four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
1.1       misho    3364:        description is not repeated here.
                   3365: 
                   3366:          PCRE_PARTIAL_HARD
                   3367:          PCRE_PARTIAL_SOFT
                   3368: 
1.1.1.3 ! misho    3369:        These  have the same general effect as they do for pcre_exec(), but the
        !          3370:        details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
        !          3371:        pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
        !          3372:        ject is reached and there is still at least  one  matching  possibility
1.1       misho    3373:        that requires additional characters. This happens even if some complete
                   3374:        matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
                   3375:        code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
1.1.1.3 ! misho    3376:        of the subject is reached, there have been  no  complete  matches,  but
        !          3377:        there  is  still  at least one matching possibility. The portion of the
        !          3378:        string that was inspected when the longest partial match was  found  is
        !          3379:        set  as  the  first  matching  string  in  both cases.  There is a more
        !          3380:        detailed discussion of partial and multi-segment matching,  with  exam-
1.1       misho    3381:        ples, in the pcrepartial documentation.
                   3382: 
                   3383:          PCRE_DFA_SHORTEST
                   3384: 
1.1.1.3 ! misho    3385:        Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
1.1       misho    3386:        stop as soon as it has found one match. Because of the way the alterna-
1.1.1.3 ! misho    3387:        tive  algorithm  works, this is necessarily the shortest possible match
1.1       misho    3388:        at the first possible matching point in the subject string.
                   3389: 
                   3390:          PCRE_DFA_RESTART
                   3391: 
                   3392:        When pcre_dfa_exec() returns a partial match, it is possible to call it
1.1.1.3 ! misho    3393:        again,  with  additional  subject characters, and have it continue with
        !          3394:        the same match. The PCRE_DFA_RESTART option requests this action;  when
        !          3395:        it  is  set,  the workspace and wscount options must reference the same
        !          3396:        vector as before because data about the match so far is  left  in  them
1.1       misho    3397:        after a partial match. There is more discussion of this facility in the
                   3398:        pcrepartial documentation.
                   3399: 
                   3400:    Successful returns from pcre_dfa_exec()
                   3401: 
1.1.1.3 ! misho    3402:        When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
1.1       misho    3403:        string in the subject. Note, however, that all the matches from one run
1.1.1.3 ! misho    3404:        of the function start at the same point in  the  subject.  The  shorter
        !          3405:        matches  are all initial substrings of the longer matches. For example,
1.1       misho    3406:        if the pattern
                   3407: 
                   3408:          <.*>
                   3409: 
                   3410:        is matched against the string
                   3411: 
                   3412:          This is <something> <something else> <something further> no more
                   3413: 
                   3414:        the three matched strings are
                   3415: 
                   3416:          <something>
                   3417:          <something> <something else>
                   3418:          <something> <something else> <something further>
                   3419: 
1.1.1.3 ! misho    3420:        On success, the yield of the function is a number  greater  than  zero,
        !          3421:        which  is  the  number of matched substrings. The substrings themselves
        !          3422:        are returned in ovector. Each string uses two elements;  the  first  is
        !          3423:        the  offset  to  the start, and the second is the offset to the end. In
        !          3424:        fact, all the strings have the same start  offset.  (Space  could  have
        !          3425:        been  saved by giving this only once, but it was decided to retain some
        !          3426:        compatibility with the way pcre_exec() returns data,  even  though  the
1.1       misho    3427:        meaning of the strings is different.)
                   3428: 
                   3429:        The strings are returned in reverse order of length; that is, the long-
1.1.1.3 ! misho    3430:        est matching string is given first. If there were too many  matches  to
        !          3431:        fit  into ovector, the yield of the function is zero, and the vector is
        !          3432:        filled with the longest matches.  Unlike  pcre_exec(),  pcre_dfa_exec()
1.1       misho    3433:        can use the entire ovector for returning matched strings.
                   3434: 
                   3435:    Error returns from pcre_dfa_exec()
                   3436: 
1.1.1.3 ! misho    3437:        The  pcre_dfa_exec()  function returns a negative number when it fails.
        !          3438:        Many of the errors are the same  as  for  pcre_exec(),  and  these  are
        !          3439:        described  above.   There are in addition the following errors that are
1.1       misho    3440:        specific to pcre_dfa_exec():
                   3441: 
                   3442:          PCRE_ERROR_DFA_UITEM      (-16)
                   3443: 
1.1.1.3 ! misho    3444:        This return is given if pcre_dfa_exec() encounters an item in the  pat-
        !          3445:        tern  that  it  does not support, for instance, the use of \C or a back
1.1       misho    3446:        reference.
                   3447: 
                   3448:          PCRE_ERROR_DFA_UCOND      (-17)
                   3449: 
1.1.1.3 ! misho    3450:        This return is given if pcre_dfa_exec()  encounters  a  condition  item
        !          3451:        that  uses  a back reference for the condition, or a test for recursion
1.1       misho    3452:        in a specific group. These are not supported.
                   3453: 
                   3454:          PCRE_ERROR_DFA_UMLIMIT    (-18)
                   3455: 
1.1.1.3 ! misho    3456:        This return is given if pcre_dfa_exec() is called with an  extra  block
        !          3457:        that  contains  a  setting  of the match_limit or match_limit_recursion
        !          3458:        fields. This is not supported (these fields  are  meaningless  for  DFA
1.1       misho    3459:        matching).
                   3460: 
                   3461:          PCRE_ERROR_DFA_WSSIZE     (-19)
                   3462: 
1.1.1.3 ! misho    3463:        This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
1.1       misho    3464:        workspace vector.
                   3465: 
                   3466:          PCRE_ERROR_DFA_RECURSE    (-20)
                   3467: 
1.1.1.3 ! misho    3468:        When a recursive subpattern is processed, the matching  function  calls
        !          3469:        itself  recursively,  using  private vectors for ovector and workspace.
        !          3470:        This error is given if the output vector  is  not  large  enough.  This
1.1       misho    3471:        should be extremely rare, as a vector of size 1000 is used.
                   3472: 
1.1.1.3 ! misho    3473:          PCRE_ERROR_DFA_BADRESTART (-30)
        !          3474: 
        !          3475:        When  pcre_dfa_exec()  is called with the PCRE_DFA_RESTART option, some
        !          3476:        plausibility checks are made on the contents of  the  workspace,  which
        !          3477:        should  contain  data about the previous partial match. If any of these
        !          3478:        checks fail, this error is given.
        !          3479: 
1.1       misho    3480: 
                   3481: SEE ALSO
                   3482: 
1.1.1.2   misho    3483:        pcre16(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),   pcrematch-
                   3484:        ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3),
                   3485:        pcrestack(3).
1.1       misho    3486: 
                   3487: 
                   3488: AUTHOR
                   3489: 
                   3490:        Philip Hazel
                   3491:        University Computing Service
                   3492:        Cambridge CB2 3QH, England.
                   3493: 
                   3494: 
                   3495: REVISION
                   3496: 
1.1.1.3 ! misho    3497:        Last updated: 17 June 2012
1.1.1.2   misho    3498:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    3499: ------------------------------------------------------------------------------
                   3500: 
                   3501: 
                   3502: PCRECALLOUT(3)                                                  PCRECALLOUT(3)
                   3503: 
                   3504: 
                   3505: NAME
                   3506:        PCRE - Perl-compatible regular expressions
                   3507: 
                   3508: 
                   3509: PCRE CALLOUTS
                   3510: 
                   3511:        int (*pcre_callout)(pcre_callout_block *);
                   3512: 
1.1.1.2   misho    3513:        int (*pcre16_callout)(pcre16_callout_block *);
                   3514: 
1.1       misho    3515:        PCRE provides a feature called "callout", which is a means of temporar-
                   3516:        ily passing control to the caller of PCRE  in  the  middle  of  pattern
                   3517:        matching.  The  caller of PCRE provides an external function by putting
1.1.1.2   misho    3518:        its entry point in the global variable pcre_callout (pcre16_callout for
                   3519:        the  16-bit  library).  By  default, this variable contains NULL, which
                   3520:        disables all calling out.
1.1       misho    3521: 
1.1.1.2   misho    3522:        Within a regular expression, (?C) indicates the  points  at  which  the
                   3523:        external  function  is  to  be  called. Different callout points can be
                   3524:        identified by putting a number less than 256 after the  letter  C.  The
                   3525:        default  value  is  zero.   For  example,  this pattern has two callout
1.1       misho    3526:        points:
                   3527: 
                   3528:          (?C1)abc(?C2)def
                   3529: 
1.1.1.2   misho    3530:        If the PCRE_AUTO_CALLOUT option bit is set when a pattern is  compiled,
                   3531:        PCRE  automatically  inserts callouts, all with number 255, before each
                   3532:        item in the pattern. For example, if PCRE_AUTO_CALLOUT is used with the
                   3533:        pattern
1.1       misho    3534: 
                   3535:          A(\d{2}|--)
                   3536: 
                   3537:        it is processed as if it were
                   3538: 
                   3539:        (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
                   3540: 
1.1.1.2   misho    3541:        Notice  that  there  is a callout before and after each parenthesis and
                   3542:        alternation bar. Automatic  callouts  can  be  used  for  tracking  the
                   3543:        progress  of  pattern matching. The pcretest command has an option that
                   3544:        sets automatic callouts; when it is used, the output indicates how  the
                   3545:        pattern  is  matched. This is useful information when you are trying to
1.1       misho    3546:        optimize the performance of a particular pattern.
                   3547: 
1.1.1.2   misho    3548:        The use of callouts in a pattern makes it ineligible  for  optimization
1.1       misho    3549:        by  the  just-in-time  compiler.  Studying  such  a  pattern  with  the
                   3550:        PCRE_STUDY_JIT_COMPILE option always fails.
                   3551: 
                   3552: 
                   3553: MISSING CALLOUTS
                   3554: 
1.1.1.2   misho    3555:        You should be aware that, because of  optimizations  in  the  way  PCRE
                   3556:        matches  patterns  by  default,  callouts  sometimes do not happen. For
1.1       misho    3557:        example, if the pattern is
                   3558: 
                   3559:          ab(?C4)cd
                   3560: 
                   3561:        PCRE knows that any matching string must contain the letter "d". If the
1.1.1.2   misho    3562:        subject  string  is "abyz", the lack of "d" means that matching doesn't
                   3563:        ever start, and the callout is never  reached.  However,  with  "abyd",
1.1       misho    3564:        though the result is still no match, the callout is obeyed.
                   3565: 
1.1.1.2   misho    3566:        If  the pattern is studied, PCRE knows the minimum length of a matching
                   3567:        string, and will immediately give a "no match" return without  actually
                   3568:        running  a  match if the subject is not long enough, or, for unanchored
1.1       misho    3569:        patterns, if it has been scanned far enough.
                   3570: 
1.1.1.2   misho    3571:        You can disable these optimizations by passing the  PCRE_NO_START_OPTI-
                   3572:        MIZE  option  to the matching function, or by starting the pattern with
                   3573:        (*NO_START_OPT). This slows down the matching process, but does  ensure
                   3574:        that callouts such as the example above are obeyed.
1.1       misho    3575: 
                   3576: 
                   3577: THE CALLOUT INTERFACE
                   3578: 
                   3579:        During  matching, when PCRE reaches a callout point, the external func-
1.1.1.2   misho    3580:        tion defined by pcre_callout or pcre16_callout  is  called  (if  it  is
                   3581:        set).   This applies to both normal and DFA matching. The only argument
                   3582:        to the callout function is a pointer to a pcre_callout or  pcre16_call-
                   3583:        out block.  These structures contains the following fields:
                   3584: 
                   3585:          int           version;
                   3586:          int           callout_number;
                   3587:          int          *offset_vector;
                   3588:          const char   *subject;           (8-bit version)
                   3589:          PCRE_SPTR16   subject;           (16-bit version)
                   3590:          int           subject_length;
                   3591:          int           start_match;
                   3592:          int           current_position;
                   3593:          int           capture_top;
                   3594:          int           capture_last;
                   3595:          void         *callout_data;
                   3596:          int           pattern_position;
                   3597:          int           next_item_length;
                   3598:          const unsigned char *mark;       (8-bit version)
                   3599:          const PCRE_UCHAR16  *mark;       (16-bit version)
1.1       misho    3600: 
                   3601:        The  version  field  is an integer containing the version number of the
                   3602:        block format. The initial version was 0; the current version is 2.  The
                   3603:        version  number  will  change  again in future if additional fields are
                   3604:        added, but the intention is never to remove any of the existing fields.
                   3605: 
                   3606:        The callout_number field contains the number of the  callout,  as  com-
                   3607:        piled  into  the pattern (that is, the number after ?C for manual call-
                   3608:        outs, and 255 for automatically generated callouts).
                   3609: 
                   3610:        The offset_vector field is a pointer to the vector of offsets that  was
1.1.1.2   misho    3611:        passed  by  the  caller  to  the matching function. When pcre_exec() or
                   3612:        pcre16_exec() is used, the contents  can  be  inspected,  in  order  to
                   3613:        extract  substrings  that  have been matched so far, in the same way as
                   3614:        for extracting substrings after a match  has  completed.  For  the  DFA
                   3615:        matching functions, this field is not useful.
1.1       misho    3616: 
                   3617:        The subject and subject_length fields contain copies of the values that
1.1.1.2   misho    3618:        were passed to the matching function.
1.1       misho    3619: 
                   3620:        The start_match field normally contains the offset within  the  subject
                   3621:        at  which  the  current  match  attempt started. However, if the escape
                   3622:        sequence \K has been encountered, this value is changed to reflect  the
                   3623:        modified  starting  point.  If the pattern is not anchored, the callout
                   3624:        function may be called several times from the same point in the pattern
                   3625:        for different starting points in the subject.
                   3626: 
                   3627:        The  current_position  field  contains the offset within the subject of
                   3628:        the current match pointer.
                   3629: 
1.1.1.2   misho    3630:        When the pcre_exec() or pcre16_exec() is used,  the  capture_top  field
                   3631:        contains one more than the number of the highest numbered captured sub-
                   3632:        string so far. If no substrings have been captured, the value  of  cap-
                   3633:        ture_top  is  one.  This  is always the case when the DFA functions are
                   3634:        used, because they do not support captured substrings.
1.1       misho    3635: 
                   3636:        The capture_last field contains the number of the  most  recently  cap-
                   3637:        tured  substring. If no substrings have been captured, its value is -1.
1.1.1.2   misho    3638:        This is always the case for the DFA matching functions.
1.1       misho    3639: 
1.1.1.2   misho    3640:        The callout_data field contains a value that is passed  to  a  matching
                   3641:        function  specifically so that it can be passed back in callouts. It is
                   3642:        passed in the callout_data field of a pcre_extra or  pcre16_extra  data
1.1       misho    3643:        structure.  If  no such data was passed, the value of callout_data in a
1.1.1.2   misho    3644:        callout block is NULL. There is a description of the pcre_extra  struc-
                   3645:        ture in the pcreapi documentation.
1.1       misho    3646: 
1.1.1.2   misho    3647:        The  pattern_position  field  is  present from version 1 of the callout
                   3648:        structure. It contains the offset to the next item to be matched in the
                   3649:        pattern string.
                   3650: 
                   3651:        The  next_item_length  field  is  present from version 1 of the callout
                   3652:        structure. It contains the length of the next item to be matched in the
                   3653:        pattern  string.  When  the callout immediately precedes an alternation
                   3654:        bar, a closing parenthesis, or the end of the pattern,  the  length  is
                   3655:        zero.  When  the callout precedes an opening parenthesis, the length is
                   3656:        that of the entire subpattern.
1.1       misho    3657: 
                   3658:        The pattern_position and next_item_length fields are intended  to  help
                   3659:        in  distinguishing between different automatic callouts, which all have
                   3660:        the same callout number. However, they are set for all callouts.
                   3661: 
1.1.1.2   misho    3662:        The mark field is present from version 2 of the callout  structure.  In
                   3663:        callouts from pcre_exec() or pcre16_exec() it contains a pointer to the
                   3664:        zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or
                   3665:        (*THEN)  item  in the match, or NULL if no such items have been passed.
                   3666:        Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a
                   3667:        previous  (*MARK).  In  callouts  from  the DFA matching functions this
                   3668:        field always contains NULL.
1.1       misho    3669: 
                   3670: 
                   3671: RETURN VALUES
                   3672: 
                   3673:        The external callout function returns an integer to PCRE. If the  value
                   3674:        is  zero,  matching  proceeds  as  normal. If the value is greater than
                   3675:        zero, matching fails at the current point, but  the  testing  of  other
                   3676:        matching possibilities goes ahead, just as if a lookahead assertion had
1.1.1.2   misho    3677:        failed. If the value is less than zero, the  match  is  abandoned,  the
                   3678:        matching function returns the negative value.
1.1       misho    3679: 
                   3680:        Negative   values   should   normally   be   chosen  from  the  set  of
                   3681:        PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
                   3682:        dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
                   3683:        reserved for use by callout functions; it will never be  used  by  PCRE
                   3684:        itself.
                   3685: 
                   3686: 
                   3687: AUTHOR
                   3688: 
                   3689:        Philip Hazel
                   3690:        University Computing Service
                   3691:        Cambridge CB2 3QH, England.
                   3692: 
                   3693: 
                   3694: REVISION
                   3695: 
1.1.1.2   misho    3696:        Last updated: 08 Janurary 2012
                   3697:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    3698: ------------------------------------------------------------------------------
                   3699: 
                   3700: 
                   3701: PCRECOMPAT(3)                                                    PCRECOMPAT(3)
                   3702: 
                   3703: 
                   3704: NAME
                   3705:        PCRE - Perl-compatible regular expressions
                   3706: 
                   3707: 
                   3708: DIFFERENCES BETWEEN PCRE AND PERL
                   3709: 
                   3710:        This  document describes the differences in the ways that PCRE and Perl
                   3711:        handle regular expressions. The differences  described  here  are  with
                   3712:        respect to Perl versions 5.10 and above.
                   3713: 
1.1.1.2   misho    3714:        1. PCRE has only a subset of Perl's Unicode support. Details of what it
                   3715:        does have are given in the pcreunicode page.
1.1       misho    3716: 
                   3717:        2. PCRE allows repeat quantifiers only on parenthesized assertions, but
                   3718:        they  do  not mean what you might think. For example, (?!a){3} does not
                   3719:        assert that the next three characters are not "a". It just asserts that
                   3720:        the next character is not "a" three times (in principle: PCRE optimizes
                   3721:        this to run the assertion just once). Perl allows repeat quantifiers on
                   3722:        other assertions such as \b, but these do not seem to have any use.
                   3723: 
                   3724:        3.  Capturing  subpatterns  that occur inside negative lookahead asser-
                   3725:        tions are counted, but their entries in the offsets  vector  are  never
                   3726:        set.  Perl sets its numerical variables from any such patterns that are
                   3727:        matched before the assertion fails to match something (thereby succeed-
                   3728:        ing),  but  only  if the negative lookahead assertion contains just one
                   3729:        branch.
                   3730: 
                   3731:        4. Though binary zero characters are supported in the  subject  string,
                   3732:        they are not allowed in a pattern string because it is passed as a nor-
                   3733:        mal C string, terminated by zero. The escape sequence \0 can be used in
                   3734:        the pattern to represent a binary zero.
                   3735: 
                   3736:        5.  The  following Perl escape sequences are not supported: \l, \u, \L,
                   3737:        \U, and \N when followed by a character name or Unicode value.  (\N  on
                   3738:        its own, matching a non-newline character, is supported.) In fact these
                   3739:        are implemented by Perl's general string-handling and are not  part  of
                   3740:        its  pattern  matching engine. If any of these are encountered by PCRE,
                   3741:        an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
                   3742:        PAT  option  is set, \U and \u are interpreted as JavaScript interprets
                   3743:        them.
                   3744: 
                   3745:        6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
                   3746:        is  built  with Unicode character property support. The properties that
                   3747:        can be tested with \p and \P are limited to the general category  prop-
                   3748:        erties  such  as  Lu and Nd, script names such as Greek or Han, and the
                   3749:        derived properties Any and L&. PCRE does  support  the  Cs  (surrogate)
                   3750:        property,  which  Perl  does  not; the Perl documentation says "Because
                   3751:        Perl hides the need for the user to understand the internal representa-
                   3752:        tion  of Unicode characters, there is no need to implement the somewhat
                   3753:        messy concept of surrogates."
                   3754: 
                   3755:        7. PCRE implements a simpler version of \X than Perl, which changed  to
                   3756:        make  \X  match what Unicode calls an "extended grapheme cluster". This
                   3757:        is more complicated than an extended Unicode sequence,  which  is  what
                   3758:        PCRE matches.
                   3759: 
                   3760:        8. PCRE does support the \Q...\E escape for quoting substrings. Charac-
                   3761:        ters in between are treated as literals.  This  is  slightly  different
                   3762:        from  Perl  in  that  $  and  @ are also handled as literals inside the
                   3763:        quotes. In Perl, they cause variable interpolation (but of course  PCRE
                   3764:        does not have variables). Note the following examples:
                   3765: 
                   3766:            Pattern            PCRE matches      Perl matches
                   3767: 
                   3768:            \Qabc$xyz\E        abc$xyz           abc followed by the
                   3769:                                                   contents of $xyz
                   3770:            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
                   3771:            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
                   3772: 
                   3773:        The  \Q...\E  sequence  is recognized both inside and outside character
                   3774:        classes.
                   3775: 
                   3776:        9. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
                   3777:        constructions.  However,  there is support for recursive patterns. This
                   3778:        is not available in Perl 5.8, but it is in Perl 5.10.  Also,  the  PCRE
                   3779:        "callout"  feature allows an external function to be called during pat-
                   3780:        tern matching. See the pcrecallout documentation for details.
                   3781: 
                   3782:        10. Subpatterns that are called as subroutines (whether or  not  recur-
                   3783:        sively)  are  always  treated  as  atomic  groups in PCRE. This is like
                   3784:        Python, but unlike Perl.  Captured values that are set outside  a  sub-
                   3785:        routine  call  can  be  reference from inside in PCRE, but not in Perl.
                   3786:        There is a discussion that explains these differences in more detail in
                   3787:        the section on recursion differences from Perl in the pcrepattern page.
                   3788: 
1.1.1.3 ! misho    3789:        11.  If  any of the backtracking control verbs are used in an assertion
        !          3790:        or in a subpattern that is called  as  a  subroutine  (whether  or  not
        !          3791:        recursively),  their effect is confined to that subpattern; it does not
        !          3792:        extend to the surrounding pattern. This is not always the case in Perl.
        !          3793:        In  particular,  if  (*THEN)  is present in a group that is called as a
        !          3794:        subroutine, its action is limited to that group, even if the group does
        !          3795:        not  contain any | characters. There is one exception to this: the name
        !          3796:        from a *(MARK), (*PRUNE), or (*THEN) that is encountered in a  success-
        !          3797:        ful  positive  assertion  is passed back when a match succeeds (compare
        !          3798:        capturing parentheses in assertions). Note that  such  subpatterns  are
        !          3799:        processed as anchored at the point where they are tested.
1.1       misho    3800: 
                   3801:        12.  There are some differences that are concerned with the settings of
                   3802:        captured strings when part of  a  pattern  is  repeated.  For  example,
                   3803:        matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
                   3804:        unset, but in PCRE it is set to "b".
                   3805: 
                   3806:        13. PCRE's handling of duplicate subpattern numbers and duplicate  sub-
                   3807:        pattern names is not as general as Perl's. This is a consequence of the
                   3808:        fact the PCRE works internally just with numbers, using an external ta-
                   3809:        ble  to  translate  between numbers and names. In particular, a pattern
                   3810:        such as (?|(?<a>A)|(?<b)B), where the two  capturing  parentheses  have
                   3811:        the  same  number  but different names, is not supported, and causes an
                   3812:        error at compile time. If it were allowed, it would not be possible  to
                   3813:        distinguish  which  parentheses matched, because both names map to cap-
                   3814:        turing subpattern number 1. To avoid this confusing situation, an error
                   3815:        is given at compile time.
                   3816: 
                   3817:        14.  Perl  recognizes  comments  in some places that PCRE does not, for
                   3818:        example, between the ( and ? at the start of a subpattern.  If  the  /x
1.1.1.3 ! misho    3819:        modifier is set, Perl allows white space between ( and ? but PCRE never
1.1       misho    3820:        does, even if the PCRE_EXTENDED option is set.
                   3821: 
                   3822:        15. PCRE provides some extensions to the Perl regular expression facil-
                   3823:        ities.   Perl  5.10  includes new features that are not in earlier ver-
                   3824:        sions of Perl, some of which (such as named parentheses) have  been  in
                   3825:        PCRE for some time. This list is with respect to Perl 5.10:
                   3826: 
                   3827:        (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
                   3828:        strings, each alternative branch of a lookbehind assertion can match  a
                   3829:        different  length  of  string.  Perl requires them all to have the same
                   3830:        length.
                   3831: 
                   3832:        (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
                   3833:        meta-character matches only at the very end of the string.
                   3834: 
                   3835:        (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
                   3836:        cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
                   3837:        ignored.  (Perl can be made to issue a warning.)
                   3838: 
                   3839:        (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
                   3840:        fiers is inverted, that is, by default they are not greedy, but if fol-
                   3841:        lowed by a question mark they are.
                   3842: 
                   3843:        (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
                   3844:        tried only at the first matching position in the subject string.
                   3845: 
                   3846:        (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
                   3847:        and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
                   3848:        lents.
                   3849: 
                   3850:        (g) The \R escape sequence can be restricted to match only CR,  LF,  or
                   3851:        CRLF by the PCRE_BSR_ANYCRLF option.
                   3852: 
                   3853:        (h) The callout facility is PCRE-specific.
                   3854: 
                   3855:        (i) The partial matching facility is PCRE-specific.
                   3856: 
                   3857:        (j) Patterns compiled by PCRE can be saved and re-used at a later time,
                   3858:        even on different hosts that have the other endianness.  However,  this
                   3859:        does not apply to optimized data created by the just-in-time compiler.
                   3860: 
1.1.1.2   misho    3861:        (k)   The   alternative   matching   functions   (pcre_dfa_exec()   and
                   3862:        pcre16_dfa_exec()) match in a different way and are  not  Perl-compati-
                   3863:        ble.
1.1       misho    3864: 
1.1.1.2   misho    3865:        (l)  PCRE  recognizes some special sequences such as (*CR) at the start
1.1       misho    3866:        of a pattern that set overall options that cannot be changed within the
                   3867:        pattern.
                   3868: 
                   3869: 
                   3870: AUTHOR
                   3871: 
                   3872:        Philip Hazel
                   3873:        University Computing Service
                   3874:        Cambridge CB2 3QH, England.
                   3875: 
                   3876: 
                   3877: REVISION
                   3878: 
1.1.1.3 ! misho    3879:        Last updated: 01 June 2012
1.1.1.2   misho    3880:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    3881: ------------------------------------------------------------------------------
                   3882: 
                   3883: 
                   3884: PCREPATTERN(3)                                                  PCREPATTERN(3)
                   3885: 
                   3886: 
                   3887: NAME
                   3888:        PCRE - Perl-compatible regular expressions
                   3889: 
                   3890: 
                   3891: PCRE REGULAR EXPRESSION DETAILS
                   3892: 
                   3893:        The  syntax and semantics of the regular expressions that are supported
                   3894:        by PCRE are described in detail below. There is a quick-reference  syn-
                   3895:        tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
                   3896:        semantics as closely as it can. PCRE  also  supports  some  alternative
                   3897:        regular  expression  syntax (which does not conflict with the Perl syn-
                   3898:        tax) in order to provide some compatibility with regular expressions in
                   3899:        Python, .NET, and Oniguruma.
                   3900: 
                   3901:        Perl's  regular expressions are described in its own documentation, and
                   3902:        regular expressions in general are covered in a number of  books,  some
                   3903:        of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
                   3904:        Expressions", published by  O'Reilly,  covers  regular  expressions  in
                   3905:        great  detail.  This  description  of  PCRE's  regular  expressions  is
                   3906:        intended as reference material.
                   3907: 
                   3908:        The original operation of PCRE was on strings of  one-byte  characters.
1.1.1.2   misho    3909:        However,  there  is  now also support for UTF-8 strings in the original
                   3910:        library, and a second library that supports 16-bit and UTF-16 character
                   3911:        strings. To use these features, PCRE must be built to include appropri-
                   3912:        ate support. When using UTF strings you must either call the  compiling
                   3913:        function  with  the PCRE_UTF8 or PCRE_UTF16 option, or the pattern must
                   3914:        start with one of these special sequences:
1.1       misho    3915: 
                   3916:          (*UTF8)
1.1.1.2   misho    3917:          (*UTF16)
1.1       misho    3918: 
1.1.1.2   misho    3919:        Starting a pattern with such a sequence is equivalent  to  setting  the
                   3920:        relevant option. This feature is not Perl-compatible. How setting a UTF
                   3921:        mode affects pattern matching is mentioned  in  several  places  below.
                   3922:        There is also a summary of features in the pcreunicode page.
1.1       misho    3923: 
1.1.1.2   misho    3924:        Another  special  sequence that may appear at the start of a pattern or
                   3925:        in combination with (*UTF8) or (*UTF16) is:
1.1       misho    3926: 
                   3927:          (*UCP)
                   3928: 
1.1.1.2   misho    3929:        This has the same effect as setting  the  PCRE_UCP  option:  it  causes
                   3930:        sequences  such  as  \d  and  \w to use Unicode properties to determine
1.1       misho    3931:        character types, instead of recognizing only characters with codes less
                   3932:        than 128 via a lookup table.
                   3933: 
1.1.1.2   misho    3934:        If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as
1.1       misho    3935:        setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
                   3936:        time. There are also some more of these special sequences that are con-
                   3937:        cerned with the handling of newlines; they are described below.
                   3938: 
1.1.1.2   misho    3939:        The remainder of this document discusses the  patterns  that  are  sup-
                   3940:        ported  by  PCRE  when  one  its  main  matching functions, pcre_exec()
                   3941:        (8-bit) or pcre16_exec() (16-bit), is used. PCRE also  has  alternative
                   3942:        matching  functions, pcre_dfa_exec() and pcre16_dfa_exec(), which match
                   3943:        using a different algorithm that is not Perl-compatible.  Some  of  the
                   3944:        features  discussed  below are not available when DFA matching is used.
                   3945:        The advantages and disadvantages of the alternative functions, and  how
                   3946:        they  differ from the normal functions, are discussed in the pcrematch-
                   3947:        ing page.
1.1       misho    3948: 
                   3949: 
                   3950: NEWLINE CONVENTIONS
                   3951: 
                   3952:        PCRE supports five different conventions for indicating line breaks  in
                   3953:        strings:  a  single  CR (carriage return) character, a single LF (line-
                   3954:        feed) character, the two-character sequence CRLF, any of the three pre-
                   3955:        ceding,  or  any Unicode newline sequence. The pcreapi page has further
                   3956:        discussion about newlines, and shows how to set the newline  convention
                   3957:        in the options arguments for the compiling and matching functions.
                   3958: 
                   3959:        It  is also possible to specify a newline convention by starting a pat-
                   3960:        tern string with one of the following five sequences:
                   3961: 
                   3962:          (*CR)        carriage return
                   3963:          (*LF)        linefeed
                   3964:          (*CRLF)      carriage return, followed by linefeed
                   3965:          (*ANYCRLF)   any of the three above
                   3966:          (*ANY)       all Unicode newline sequences
                   3967: 
1.1.1.2   misho    3968:        These override the default and the options given to the compiling func-
                   3969:        tion.  For  example,  on  a Unix system where LF is the default newline
                   3970:        sequence, the pattern
1.1       misho    3971: 
                   3972:          (*CR)a.b
                   3973: 
                   3974:        changes the convention to CR. That pattern matches "a\nb" because LF is
                   3975:        no  longer  a  newline. Note that these special settings, which are not
                   3976:        Perl-compatible, are recognized only at the very start  of  a  pattern,
                   3977:        and  that  they  must  be  in  upper  case. If more than one of them is
                   3978:        present, the last one is used.
                   3979: 
                   3980:        The newline convention affects the interpretation of the dot  metachar-
                   3981:        acter  when  PCRE_DOTALL is not set, and also the behaviour of \N. How-
                   3982:        ever, it does not affect  what  the  \R  escape  sequence  matches.  By
                   3983:        default,  this is any Unicode newline sequence, for Perl compatibility.
                   3984:        However, this can be changed; see the description of \R in the  section
                   3985:        entitled  "Newline sequences" below. A change of \R setting can be com-
                   3986:        bined with a change of newline convention.
                   3987: 
                   3988: 
                   3989: CHARACTERS AND METACHARACTERS
                   3990: 
                   3991:        A regular expression is a pattern that is  matched  against  a  subject
                   3992:        string  from  left  to right. Most characters stand for themselves in a
                   3993:        pattern, and match the corresponding characters in the  subject.  As  a
                   3994:        trivial example, the pattern
                   3995: 
                   3996:          The quick brown fox
                   3997: 
                   3998:        matches a portion of a subject string that is identical to itself. When
                   3999:        caseless matching is specified (the PCRE_CASELESS option), letters  are
1.1.1.2   misho    4000:        matched  independently  of case. In a UTF mode, PCRE always understands
1.1       misho    4001:        the concept of case for characters whose values are less than  128,  so
                   4002:        caseless  matching  is always possible. For characters with higher val-
                   4003:        ues, the concept of case is supported if PCRE is compiled with  Unicode
                   4004:        property  support,  but  not  otherwise.   If  you want to use caseless
                   4005:        matching for characters 128 and above, you must  ensure  that  PCRE  is
1.1.1.2   misho    4006:        compiled with Unicode property support as well as with UTF support.
1.1       misho    4007: 
                   4008:        The  power  of  regular  expressions  comes from the ability to include
                   4009:        alternatives and repetitions in the pattern. These are encoded  in  the
                   4010:        pattern by the use of metacharacters, which do not stand for themselves
                   4011:        but instead are interpreted in some special way.
                   4012: 
                   4013:        There are two different sets of metacharacters: those that  are  recog-
                   4014:        nized  anywhere in the pattern except within square brackets, and those
                   4015:        that are recognized within square brackets.  Outside  square  brackets,
                   4016:        the metacharacters are as follows:
                   4017: 
                   4018:          \      general escape character with several uses
                   4019:          ^      assert start of string (or line, in multiline mode)
                   4020:          $      assert end of string (or line, in multiline mode)
                   4021:          .      match any character except newline (by default)
                   4022:          [      start character class definition
                   4023:          |      start of alternative branch
                   4024:          (      start subpattern
                   4025:          )      end subpattern
                   4026:          ?      extends the meaning of (
                   4027:                 also 0 or 1 quantifier
                   4028:                 also quantifier minimizer
                   4029:          *      0 or more quantifier
                   4030:          +      1 or more quantifier
                   4031:                 also "possessive quantifier"
                   4032:          {      start min/max quantifier
                   4033: 
                   4034:        Part  of  a  pattern  that is in square brackets is called a "character
                   4035:        class". In a character class the only metacharacters are:
                   4036: 
                   4037:          \      general escape character
                   4038:          ^      negate the class, but only if the first character
                   4039:          -      indicates character range
                   4040:          [      POSIX character class (only if followed by POSIX
                   4041:                   syntax)
                   4042:          ]      terminates the character class
                   4043: 
                   4044:        The following sections describe the use of each of the metacharacters.
                   4045: 
                   4046: 
                   4047: BACKSLASH
                   4048: 
                   4049:        The backslash character has several uses. Firstly, if it is followed by
                   4050:        a character that is not a number or a letter, it takes away any special
                   4051:        meaning that character may have. This use of  backslash  as  an  escape
                   4052:        character applies both inside and outside character classes.
                   4053: 
                   4054:        For  example,  if  you want to match a * character, you write \* in the
                   4055:        pattern.  This escaping action applies whether  or  not  the  following
                   4056:        character  would  otherwise be interpreted as a metacharacter, so it is
                   4057:        always safe to precede a non-alphanumeric  with  backslash  to  specify
                   4058:        that  it stands for itself. In particular, if you want to match a back-
                   4059:        slash, you write \\.
                   4060: 
1.1.1.2   misho    4061:        In a UTF mode, only ASCII numbers and letters have any special  meaning
1.1       misho    4062:        after  a  backslash.  All  other characters (in particular, those whose
                   4063:        codepoints are greater than 127) are treated as literals.
                   4064: 
1.1.1.3 ! misho    4065:        If a pattern is compiled with the PCRE_EXTENDED option, white space  in
1.1       misho    4066:        the  pattern (other than in a character class) and characters between a
                   4067:        # outside a character class and the next newline are ignored. An escap-
1.1.1.3 ! misho    4068:        ing  backslash  can  be used to include a white space or # character as
1.1       misho    4069:        part of the pattern.
                   4070: 
                   4071:        If you want to remove the special meaning from a  sequence  of  charac-
                   4072:        ters,  you can do so by putting them between \Q and \E. This is differ-
                   4073:        ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
                   4074:        sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
                   4075:        tion. Note the following examples:
                   4076: 
                   4077:          Pattern            PCRE matches   Perl matches
                   4078: 
                   4079:          \Qabc$xyz\E        abc$xyz        abc followed by the
                   4080:                                              contents of $xyz
                   4081:          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
                   4082:          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
                   4083: 
                   4084:        The \Q...\E sequence is recognized both inside  and  outside  character
                   4085:        classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
                   4086:        is not followed by \E later in the pattern, the literal  interpretation
                   4087:        continues  to  the  end  of  the pattern (that is, \E is assumed at the
                   4088:        end). If the isolated \Q is inside a character class,  this  causes  an
                   4089:        error, because the character class is not terminated.
                   4090: 
                   4091:    Non-printing characters
                   4092: 
                   4093:        A second use of backslash provides a way of encoding non-printing char-
                   4094:        acters in patterns in a visible manner. There is no restriction on  the
                   4095:        appearance  of non-printing characters, apart from the binary zero that
                   4096:        terminates a pattern, but when a pattern  is  being  prepared  by  text
                   4097:        editing,  it  is  often  easier  to  use  one  of  the following escape
                   4098:        sequences than the binary character it represents:
                   4099: 
                   4100:          \a        alarm, that is, the BEL character (hex 07)
                   4101:          \cx       "control-x", where x is any ASCII character
                   4102:          \e        escape (hex 1B)
1.1.1.3 ! misho    4103:          \f        form feed (hex 0C)
1.1       misho    4104:          \n        linefeed (hex 0A)
                   4105:          \r        carriage return (hex 0D)
                   4106:          \t        tab (hex 09)
                   4107:          \ddd      character with octal code ddd, or back reference
                   4108:          \xhh      character with hex code hh
                   4109:          \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
                   4110:          \uhhhh    character with hex code hhhh (JavaScript mode only)
                   4111: 
                   4112:        The precise effect of \cx is as follows: if x is a lower  case  letter,
                   4113:        it  is converted to upper case. Then bit 6 of the character (hex 40) is
                   4114:        inverted.  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({
                   4115:        is  7B),  while  \c; becomes hex 7B (; is 3B). If the byte following \c
                   4116:        has a value greater than 127, a compile-time error occurs.  This  locks
1.1.1.2   misho    4117:        out non-ASCII characters in all modes. (When PCRE is compiled in EBCDIC
                   4118:        mode, all byte values are valid. A lower case letter  is  converted  to
                   4119:        upper case, and then the 0xc0 bits are flipped.)
1.1       misho    4120: 
                   4121:        By  default,  after  \x,  from  zero to two hexadecimal digits are read
                   4122:        (letters can be in upper or lower case). Any number of hexadecimal dig-
1.1.1.2   misho    4123:        its may appear between \x{ and }, but the character code is constrained
                   4124:        as follows:
                   4125: 
                   4126:          8-bit non-UTF mode    less than 0x100
                   4127:          8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
                   4128:          16-bit non-UTF mode   less than 0x10000
                   4129:          16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
1.1       misho    4130: 
1.1.1.2   misho    4131:        Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
                   4132:        called "surrogate" codepoints).
                   4133: 
                   4134:        If  characters  other than hexadecimal digits appear between \x{ and },
1.1       misho    4135:        or if there is no terminating }, this form of escape is not recognized.
1.1.1.2   misho    4136:        Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal
                   4137:        escape, with no following digits, giving a  character  whose  value  is
1.1       misho    4138:        zero.
                   4139: 
1.1.1.2   misho    4140:        If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
                   4141:        is as just described only when it is followed by two  hexadecimal  dig-
                   4142:        its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
1.1       misho    4143:        mode, support for code points greater than 256 is provided by \u, which
1.1.1.2   misho    4144:        must  be  followed  by  four hexadecimal digits; otherwise it matches a
1.1.1.3 ! misho    4145:        literal "u" character.  Character codes specified by \u  in  JavaScript
        !          4146:        mode  are  constrained in the same was as those specified by \x in non-
        !          4147:        JavaScript mode.
1.1       misho    4148: 
                   4149:        Characters whose value is less than 256 can be defined by either of the
1.1.1.2   misho    4150:        two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
1.1       misho    4151:        ence in the way they are handled. For example, \xdc is exactly the same
                   4152:        as \x{dc} (or \u00dc in JavaScript mode).
                   4153: 
1.1.1.2   misho    4154:        After  \0  up  to two further octal digits are read. If there are fewer
                   4155:        than two digits, just  those  that  are  present  are  used.  Thus  the
1.1       misho    4156:        sequence \0\x\07 specifies two binary zeros followed by a BEL character
1.1.1.2   misho    4157:        (code value 7). Make sure you supply two digits after the initial  zero
1.1       misho    4158:        if the pattern character that follows is itself an octal digit.
                   4159: 
                   4160:        The handling of a backslash followed by a digit other than 0 is compli-
                   4161:        cated.  Outside a character class, PCRE reads it and any following dig-
1.1.1.2   misho    4162:        its  as  a  decimal  number. If the number is less than 10, or if there
1.1       misho    4163:        have been at least that many previous capturing left parentheses in the
1.1.1.2   misho    4164:        expression,  the  entire  sequence  is  taken  as  a  back reference. A
                   4165:        description of how this works is given later, following the  discussion
1.1       misho    4166:        of parenthesized subpatterns.
                   4167: 
1.1.1.2   misho    4168:        Inside  a  character  class, or if the decimal number is greater than 9
                   4169:        and there have not been that many capturing subpatterns, PCRE  re-reads
1.1       misho    4170:        up to three octal digits following the backslash, and uses them to gen-
1.1.1.2   misho    4171:        erate a data character. Any subsequent digits stand for themselves. The
                   4172:        value  of  the  character  is constrained in the same way as characters
                   4173:        specified in hexadecimal.  For example:
1.1       misho    4174: 
                   4175:          \040   is another way of writing a space
                   4176:          \40    is the same, provided there are fewer than 40
                   4177:                    previous capturing subpatterns
                   4178:          \7     is always a back reference
                   4179:          \11    might be a back reference, or another way of
                   4180:                    writing a tab
                   4181:          \011   is always a tab
                   4182:          \0113  is a tab followed by the character "3"
                   4183:          \113   might be a back reference, otherwise the
                   4184:                    character with octal code 113
                   4185:          \377   might be a back reference, otherwise
1.1.1.2   misho    4186:                    the value 255 (decimal)
1.1       misho    4187:          \81    is either a back reference, or a binary zero
                   4188:                    followed by the two characters "8" and "1"
                   4189: 
                   4190:        Note that octal values of 100 or greater must not be  introduced  by  a
                   4191:        leading zero, because no more than three octal digits are ever read.
                   4192: 
                   4193:        All the sequences that define a single character value can be used both
                   4194:        inside and outside character classes. In addition, inside  a  character
                   4195:        class, \b is interpreted as the backspace character (hex 08).
                   4196: 
                   4197:        \N  is not allowed in a character class. \B, \R, and \X are not special
                   4198:        inside a character class. Like  other  unrecognized  escape  sequences,
                   4199:        they  are  treated  as  the  literal  characters  "B",  "R", and "X" by
                   4200:        default, but cause an error if the PCRE_EXTRA option is set. Outside  a
                   4201:        character class, these sequences have different meanings.
                   4202: 
                   4203:    Unsupported escape sequences
                   4204: 
                   4205:        In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
                   4206:        handler and used  to  modify  the  case  of  following  characters.  By
                   4207:        default,  PCRE does not support these escape sequences. However, if the
                   4208:        PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and
                   4209:        \u can be used to define a character by code point, as described in the
                   4210:        previous section.
                   4211: 
                   4212:    Absolute and relative back references
                   4213: 
                   4214:        The sequence \g followed by an unsigned or a negative  number,  option-
                   4215:        ally  enclosed  in braces, is an absolute or relative back reference. A
                   4216:        named back reference can be coded as \g{name}. Back references are dis-
                   4217:        cussed later, following the discussion of parenthesized subpatterns.
                   4218: 
                   4219:    Absolute and relative subroutine calls
                   4220: 
                   4221:        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
                   4222:        name or a number enclosed either in angle brackets or single quotes, is
                   4223:        an  alternative  syntax for referencing a subpattern as a "subroutine".
                   4224:        Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
                   4225:        \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
                   4226:        reference; the latter is a subroutine call.
                   4227: 
                   4228:    Generic character types
                   4229: 
                   4230:        Another use of backslash is for specifying generic character types:
                   4231: 
                   4232:          \d     any decimal digit
                   4233:          \D     any character that is not a decimal digit
1.1.1.3 ! misho    4234:          \h     any horizontal white space character
        !          4235:          \H     any character that is not a horizontal white space character
        !          4236:          \s     any white space character
        !          4237:          \S     any character that is not a white space character
        !          4238:          \v     any vertical white space character
        !          4239:          \V     any character that is not a vertical white space character
1.1       misho    4240:          \w     any "word" character
                   4241:          \W     any "non-word" character
                   4242: 
                   4243:        There is also the single sequence \N, which matches a non-newline char-
                   4244:        acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
                   4245:        not set. Perl also uses \N to match characters by name; PCRE  does  not
                   4246:        support this.
                   4247: 
                   4248:        Each  pair of lower and upper case escape sequences partitions the com-
                   4249:        plete set of characters into two disjoint  sets.  Any  given  character
                   4250:        matches  one, and only one, of each pair. The sequences can appear both
                   4251:        inside and outside character classes. They each match one character  of
                   4252:        the  appropriate  type.  If the current matching point is at the end of
                   4253:        the subject string, all of them fail, because there is no character  to
                   4254:        match.
                   4255: 
                   4256:        For  compatibility  with Perl, \s does not match the VT character (code
                   4257:        11).  This makes it different from the the POSIX "space" class. The  \s
                   4258:        characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
                   4259:        "use locale;" is included in a Perl script, \s may match the VT charac-
                   4260:        ter. In PCRE, it never does.
                   4261: 
                   4262:        A  "word"  character is an underscore or any character that is a letter
                   4263:        or digit.  By default, the definition of letters  and  digits  is  con-
                   4264:        trolled  by PCRE's low-valued character tables, and may vary if locale-
                   4265:        specific matching is taking place (see "Locale support" in the  pcreapi
                   4266:        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
                   4267:        systems, or "french" in Windows, some character codes greater than  128
                   4268:        are  used  for  accented letters, and these are then matched by \w. The
                   4269:        use of locales with Unicode is discouraged.
                   4270: 
1.1.1.2   misho    4271:        By default, in a UTF mode, characters  with  values  greater  than  128
1.1       misho    4272:        never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
1.1.1.2   misho    4273:        sequences retain their original meanings from before  UTF  support  was
1.1       misho    4274:        available,  mainly for efficiency reasons. However, if PCRE is compiled
                   4275:        with Unicode property support, and the PCRE_UCP option is set, the  be-
                   4276:        haviour  is  changed  so  that Unicode properties are used to determine
                   4277:        character types, as follows:
                   4278: 
                   4279:          \d  any character that \p{Nd} matches (decimal digit)
                   4280:          \s  any character that \p{Z} matches, plus HT, LF, FF, CR
                   4281:          \w  any character that \p{L} or \p{N} matches, plus underscore
                   4282: 
                   4283:        The upper case escapes match the inverse sets of characters. Note  that
                   4284:        \d  matches  only decimal digits, whereas \w matches any Unicode digit,
                   4285:        as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
                   4286:        affects  \b,  and  \B  because  they are defined in terms of \w and \W.
                   4287:        Matching these sequences is noticeably slower when PCRE_UCP is set.
                   4288: 
                   4289:        The sequences \h, \H, \v, and \V are features that were added  to  Perl
                   4290:        at  release  5.10. In contrast to the other sequences, which match only
                   4291:        ASCII characters by default, these  always  match  certain  high-valued
1.1.1.2   misho    4292:        codepoints,  whether or not PCRE_UCP is set. The horizontal space char-
                   4293:        acters are:
1.1       misho    4294: 
                   4295:          U+0009     Horizontal tab
                   4296:          U+0020     Space
                   4297:          U+00A0     Non-break space
                   4298:          U+1680     Ogham space mark
                   4299:          U+180E     Mongolian vowel separator
                   4300:          U+2000     En quad
                   4301:          U+2001     Em quad
                   4302:          U+2002     En space
                   4303:          U+2003     Em space
                   4304:          U+2004     Three-per-em space
                   4305:          U+2005     Four-per-em space
                   4306:          U+2006     Six-per-em space
                   4307:          U+2007     Figure space
                   4308:          U+2008     Punctuation space
                   4309:          U+2009     Thin space
                   4310:          U+200A     Hair space
                   4311:          U+202F     Narrow no-break space
                   4312:          U+205F     Medium mathematical space
                   4313:          U+3000     Ideographic space
                   4314: 
                   4315:        The vertical space characters are:
                   4316: 
                   4317:          U+000A     Linefeed
                   4318:          U+000B     Vertical tab
1.1.1.3 ! misho    4319:          U+000C     Form feed
1.1       misho    4320:          U+000D     Carriage return
                   4321:          U+0085     Next line
                   4322:          U+2028     Line separator
                   4323:          U+2029     Paragraph separator
                   4324: 
1.1.1.2   misho    4325:        In 8-bit, non-UTF-8 mode, only the characters with codepoints less than
                   4326:        256 are relevant.
                   4327: 
1.1       misho    4328:    Newline sequences
                   4329: 
1.1.1.2   misho    4330:        Outside  a  character class, by default, the escape sequence \R matches
                   4331:        any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
                   4332:        to the following:
1.1       misho    4333: 
                   4334:          (?>\r\n|\n|\x0b|\f|\r|\x85)
                   4335: 
1.1.1.2   misho    4336:        This  is  an  example  of an "atomic group", details of which are given
1.1       misho    4337:        below.  This particular group matches either the two-character sequence
1.1.1.2   misho    4338:        CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
1.1.1.3 ! misho    4339:        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
        !          4340:        riage  return,  U+000D),  or NEL (next line, U+0085). The two-character
        !          4341:        sequence is treated as a single unit that cannot be split.
1.1       misho    4342: 
1.1.1.2   misho    4343:        In other modes, two additional characters whose codepoints are  greater
1.1       misho    4344:        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
1.1.1.2   misho    4345:        rator, U+2029).  Unicode character property support is not  needed  for
1.1       misho    4346:        these characters to be recognized.
                   4347: 
                   4348:        It is possible to restrict \R to match only CR, LF, or CRLF (instead of
1.1.1.2   misho    4349:        the complete set  of  Unicode  line  endings)  by  setting  the  option
1.1       misho    4350:        PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
                   4351:        (BSR is an abbrevation for "backslash R".) This can be made the default
1.1.1.2   misho    4352:        when  PCRE  is  built;  if this is the case, the other behaviour can be
                   4353:        requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to
                   4354:        specify  these  settings  by  starting a pattern string with one of the
1.1       misho    4355:        following sequences:
                   4356: 
                   4357:          (*BSR_ANYCRLF)   CR, LF, or CRLF only
                   4358:          (*BSR_UNICODE)   any Unicode newline sequence
                   4359: 
1.1.1.2   misho    4360:        These override the default and the options given to the compiling func-
                   4361:        tion,  but  they  can  themselves  be  overridden by options given to a
                   4362:        matching function. Note that these  special  settings,  which  are  not
                   4363:        Perl-compatible,  are  recognized  only at the very start of a pattern,
                   4364:        and that they must be in upper case.  If  more  than  one  of  them  is
                   4365:        present,  the  last  one is used. They can be combined with a change of
1.1       misho    4366:        newline convention; for example, a pattern can start with:
                   4367: 
                   4368:          (*ANY)(*BSR_ANYCRLF)
                   4369: 
1.1.1.2   misho    4370:        They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special
                   4371:        sequences.  Inside  a character class, \R is treated as an unrecognized
                   4372:        escape sequence, and so matches the letter "R" by default,  but  causes
                   4373:        an error if PCRE_EXTRA is set.
1.1       misho    4374: 
                   4375:    Unicode character properties
                   4376: 
                   4377:        When PCRE is built with Unicode character property support, three addi-
1.1.1.2   misho    4378:        tional escape sequences that match characters with specific  properties
                   4379:        are  available.   When  in 8-bit non-UTF-8 mode, these sequences are of
                   4380:        course limited to testing characters whose  codepoints  are  less  than
                   4381:        256, but they do work in this mode.  The extra escape sequences are:
1.1       misho    4382: 
                   4383:          \p{xx}   a character with the xx property
                   4384:          \P{xx}   a character without the xx property
                   4385:          \X       an extended Unicode sequence
                   4386: 
1.1.1.2   misho    4387:        The  property  names represented by xx above are limited to the Unicode
1.1       misho    4388:        script names, the general category properties, "Any", which matches any
1.1.1.2   misho    4389:        character   (including  newline),  and  some  special  PCRE  properties
                   4390:        (described in the next section).  Other Perl properties such as  "InMu-
                   4391:        sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}
1.1       misho    4392:        does not match any characters, so always causes a match failure.
                   4393: 
                   4394:        Sets of Unicode characters are defined as belonging to certain scripts.
1.1.1.2   misho    4395:        A  character from one of these sets can be matched using a script name.
1.1       misho    4396:        For example:
                   4397: 
                   4398:          \p{Greek}
                   4399:          \P{Han}
                   4400: 
1.1.1.2   misho    4401:        Those that are not part of an identified script are lumped together  as
1.1       misho    4402:        "Common". The current list of scripts is:
                   4403: 
1.1.1.3 ! misho    4404:        Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
        !          4405:        Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
        !          4406:        Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
        !          4407:        Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
        !          4408:        Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
        !          4409:        gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
        !          4410:        tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
        !          4411:        Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
        !          4412:        Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
        !          4413:        Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
        !          4414:        Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
        !          4415:        Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
        !          4416:        tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
        !          4417:        Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
        !          4418:        Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
        !          4419:        Yi.
1.1       misho    4420: 
                   4421:        Each character has exactly one Unicode general category property, spec-
1.1.1.2   misho    4422:        ified  by a two-letter abbreviation. For compatibility with Perl, nega-
                   4423:        tion can be specified by including a  circumflex  between  the  opening
                   4424:        brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
1.1       misho    4425:        \P{Lu}.
                   4426: 
                   4427:        If only one letter is specified with \p or \P, it includes all the gen-
1.1.1.2   misho    4428:        eral  category properties that start with that letter. In this case, in
                   4429:        the absence of negation, the curly brackets in the escape sequence  are
1.1       misho    4430:        optional; these two examples have the same effect:
                   4431: 
                   4432:          \p{L}
                   4433:          \pL
                   4434: 
                   4435:        The following general category property codes are supported:
                   4436: 
                   4437:          C     Other
                   4438:          Cc    Control
                   4439:          Cf    Format
                   4440:          Cn    Unassigned
                   4441:          Co    Private use
                   4442:          Cs    Surrogate
                   4443: 
                   4444:          L     Letter
                   4445:          Ll    Lower case letter
                   4446:          Lm    Modifier letter
                   4447:          Lo    Other letter
                   4448:          Lt    Title case letter
                   4449:          Lu    Upper case letter
                   4450: 
                   4451:          M     Mark
                   4452:          Mc    Spacing mark
                   4453:          Me    Enclosing mark
                   4454:          Mn    Non-spacing mark
                   4455: 
                   4456:          N     Number
                   4457:          Nd    Decimal number
                   4458:          Nl    Letter number
                   4459:          No    Other number
                   4460: 
                   4461:          P     Punctuation
                   4462:          Pc    Connector punctuation
                   4463:          Pd    Dash punctuation
                   4464:          Pe    Close punctuation
                   4465:          Pf    Final punctuation
                   4466:          Pi    Initial punctuation
                   4467:          Po    Other punctuation
                   4468:          Ps    Open punctuation
                   4469: 
                   4470:          S     Symbol
                   4471:          Sc    Currency symbol
                   4472:          Sk    Modifier symbol
                   4473:          Sm    Mathematical symbol
                   4474:          So    Other symbol
                   4475: 
                   4476:          Z     Separator
                   4477:          Zl    Line separator
                   4478:          Zp    Paragraph separator
                   4479:          Zs    Space separator
                   4480: 
1.1.1.2   misho    4481:        The  special property L& is also supported: it matches a character that
                   4482:        has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
1.1       misho    4483:        classified as a modifier or "other".
                   4484: 
1.1.1.2   misho    4485:        The  Cs  (Surrogate)  property  applies only to characters in the range
                   4486:        U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
                   4487:        so  cannot  be  tested  by  PCRE, unless UTF validity checking has been
                   4488:        turned   off   (see   the   discussion   of   PCRE_NO_UTF8_CHECK    and
                   4489:        PCRE_NO_UTF16_CHECK  in the pcreapi page). Perl does not support the Cs
                   4490:        property.
1.1       misho    4491: 
                   4492:        The long synonyms for  property  names  that  Perl  supports  (such  as
                   4493:        \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
                   4494:        any of these properties with "Is".
                   4495: 
                   4496:        No character that is in the Unicode table has the Cn (unassigned) prop-
                   4497:        erty.  Instead, this property is assumed for any code point that is not
                   4498:        in the Unicode table.
                   4499: 
                   4500:        Specifying caseless matching does not affect  these  escape  sequences.
                   4501:        For example, \p{Lu} always matches only upper case letters.
                   4502: 
                   4503:        The  \X  escape  matches  any number of Unicode characters that form an
                   4504:        extended Unicode sequence. \X is equivalent to
                   4505: 
                   4506:          (?>\PM\pM*)
                   4507: 
                   4508:        That is, it matches a character without the "mark"  property,  followed
                   4509:        by  zero  or  more  characters with the "mark" property, and treats the
                   4510:        sequence as an atomic group (see below).  Characters  with  the  "mark"
                   4511:        property  are  typically  accents  that affect the preceding character.
1.1.1.2   misho    4512:        None of them have codepoints less than 256, so in 8-bit non-UTF-8  mode
                   4513:        \X matches any one character.
1.1       misho    4514: 
                   4515:        Note that recent versions of Perl have changed \X to match what Unicode
                   4516:        calls an "extended grapheme cluster", which has a more complicated def-
                   4517:        inition.
                   4518: 
                   4519:        Matching  characters  by Unicode property is not fast, because PCRE has
                   4520:        to search a structure that contains  data  for  over  fifteen  thousand
                   4521:        characters. That is why the traditional escape sequences such as \d and
                   4522:        \w do not use Unicode properties in PCRE by  default,  though  you  can
1.1.1.2   misho    4523:        make  them do so by setting the PCRE_UCP option or by starting the pat-
                   4524:        tern with (*UCP).
1.1       misho    4525: 
                   4526:    PCRE's additional properties
                   4527: 
                   4528:        As well as the standard Unicode properties described  in  the  previous
                   4529:        section,  PCRE supports four more that make it possible to convert tra-
                   4530:        ditional escape sequences such as \w and \s and POSIX character classes
                   4531:        to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
                   4532:        erties internally when PCRE_UCP is set. They are:
                   4533: 
                   4534:          Xan   Any alphanumeric character
                   4535:          Xps   Any POSIX space character
                   4536:          Xsp   Any Perl space character
                   4537:          Xwd   Any Perl "word" character
                   4538: 
                   4539:        Xan matches characters that have either the L (letter) or the  N  (num-
                   4540:        ber)  property. Xps matches the characters tab, linefeed, vertical tab,
1.1.1.3 ! misho    4541:        form feed, or carriage return, and any other character that has  the  Z
1.1       misho    4542:        (separator) property.  Xsp is the same as Xps, except that vertical tab
                   4543:        is excluded. Xwd matches the same characters as Xan, plus underscore.
                   4544: 
                   4545:    Resetting the match start
                   4546: 
                   4547:        The escape sequence \K causes any previously matched characters not  to
                   4548:        be included in the final matched sequence. For example, the pattern:
                   4549: 
                   4550:          foo\Kbar
                   4551: 
                   4552:        matches  "foobar",  but reports that it has matched "bar". This feature
                   4553:        is similar to a lookbehind assertion (described  below).   However,  in
                   4554:        this  case, the part of the subject before the real match does not have
                   4555:        to be of fixed length, as lookbehind assertions do. The use of \K  does
                   4556:        not  interfere  with  the setting of captured substrings.  For example,
                   4557:        when the pattern
                   4558: 
                   4559:          (foo)\Kbar
                   4560: 
                   4561:        matches "foobar", the first substring is still set to "foo".
                   4562: 
                   4563:        Perl documents that the use  of  \K  within  assertions  is  "not  well
                   4564:        defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
                   4565:        assertions, but is ignored in negative assertions.
                   4566: 
                   4567:    Simple assertions
                   4568: 
                   4569:        The final use of backslash is for certain simple assertions. An  asser-
                   4570:        tion  specifies a condition that has to be met at a particular point in
                   4571:        a match, without consuming any characters from the subject string.  The
                   4572:        use  of subpatterns for more complicated assertions is described below.
                   4573:        The backslashed assertions are:
                   4574: 
                   4575:          \b     matches at a word boundary
                   4576:          \B     matches when not at a word boundary
                   4577:          \A     matches at the start of the subject
                   4578:          \Z     matches at the end of the subject
                   4579:                  also matches before a newline at the end of the subject
                   4580:          \z     matches only at the end of the subject
                   4581:          \G     matches at the first matching position in the subject
                   4582: 
                   4583:        Inside a character class, \b has a different meaning;  it  matches  the
                   4584:        backspace  character.  If  any  other  of these assertions appears in a
                   4585:        character class, by default it matches the corresponding literal  char-
                   4586:        acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
                   4587:        PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
                   4588:        ated instead.
                   4589: 
                   4590:        A  word  boundary is a position in the subject string where the current
                   4591:        character and the previous character do not both match \w or  \W  (i.e.
                   4592:        one  matches  \w  and the other matches \W), or the start or end of the
1.1.1.2   misho    4593:        string if the first or last character matches \w,  respectively.  In  a
                   4594:        UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
1.1       misho    4595:        PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
                   4596:        PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
                   4597:        quence. However, whatever follows \b normally determines which  it  is.
                   4598:        For example, the fragment \ba matches "a" at the start of a word.
                   4599: 
                   4600:        The  \A,  \Z,  and \z assertions differ from the traditional circumflex
                   4601:        and dollar (described in the next section) in that they only ever match
                   4602:        at  the  very start and end of the subject string, whatever options are
                   4603:        set. Thus, they are independent of multiline mode. These  three  asser-
                   4604:        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
                   4605:        affect only the behaviour of the circumflex and dollar  metacharacters.
                   4606:        However,  if the startoffset argument of pcre_exec() is non-zero, indi-
                   4607:        cating that matching is to start at a point other than the beginning of
                   4608:        the  subject,  \A  can never match. The difference between \Z and \z is
                   4609:        that \Z matches before a newline at the end of the string as well as at
                   4610:        the very end, whereas \z matches only at the end.
                   4611: 
                   4612:        The  \G assertion is true only when the current matching position is at
                   4613:        the start point of the match, as specified by the startoffset  argument
                   4614:        of  pcre_exec().  It  differs  from \A when the value of startoffset is
                   4615:        non-zero. By calling pcre_exec() multiple times with appropriate  argu-
                   4616:        ments, you can mimic Perl's /g option, and it is in this kind of imple-
                   4617:        mentation where \G can be useful.
                   4618: 
                   4619:        Note, however, that PCRE's interpretation of \G, as the  start  of  the
                   4620:        current match, is subtly different from Perl's, which defines it as the
                   4621:        end of the previous match. In Perl, these can  be  different  when  the
                   4622:        previously  matched  string was empty. Because PCRE does just one match
                   4623:        at a time, it cannot reproduce this behaviour.
                   4624: 
                   4625:        If all the alternatives of a pattern begin with \G, the  expression  is
                   4626:        anchored to the starting match position, and the "anchored" flag is set
                   4627:        in the compiled regular expression.
                   4628: 
                   4629: 
                   4630: CIRCUMFLEX AND DOLLAR
                   4631: 
                   4632:        Outside a character class, in the default matching mode, the circumflex
                   4633:        character  is  an  assertion  that is true only if the current matching
                   4634:        point is at the start of the subject string. If the  startoffset  argu-
                   4635:        ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
                   4636:        PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
                   4637:        has an entirely different meaning (see below).
                   4638: 
                   4639:        Circumflex  need  not be the first character of the pattern if a number
                   4640:        of alternatives are involved, but it should be the first thing in  each
                   4641:        alternative  in  which  it appears if the pattern is ever to match that
                   4642:        branch. If all possible alternatives start with a circumflex, that  is,
                   4643:        if  the  pattern  is constrained to match only at the start of the sub-
                   4644:        ject, it is said to be an "anchored" pattern.  (There  are  also  other
                   4645:        constructs that can cause a pattern to be anchored.)
                   4646: 
                   4647:        A  dollar  character  is  an assertion that is true only if the current
                   4648:        matching point is at the end of  the  subject  string,  or  immediately
                   4649:        before a newline at the end of the string (by default). Dollar need not
                   4650:        be the last character of the pattern if a number  of  alternatives  are
                   4651:        involved,  but  it  should  be  the last item in any branch in which it
                   4652:        appears. Dollar has no special meaning in a character class.
                   4653: 
                   4654:        The meaning of dollar can be changed so that it  matches  only  at  the
                   4655:        very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
                   4656:        compile time. This does not affect the \Z assertion.
                   4657: 
                   4658:        The meanings of the circumflex and dollar characters are changed if the
                   4659:        PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
                   4660:        matches immediately after internal newlines as well as at the start  of
                   4661:        the  subject  string.  It  does not match after a newline that ends the
                   4662:        string. A dollar matches before any newlines in the string, as well  as
                   4663:        at  the very end, when PCRE_MULTILINE is set. When newline is specified
                   4664:        as the two-character sequence CRLF, isolated CR and  LF  characters  do
                   4665:        not indicate newlines.
                   4666: 
                   4667:        For  example, the pattern /^abc$/ matches the subject string "def\nabc"
                   4668:        (where \n represents a newline) in multiline mode, but  not  otherwise.
                   4669:        Consequently,  patterns  that  are anchored in single line mode because
                   4670:        all branches start with ^ are not anchored in  multiline  mode,  and  a
                   4671:        match  for  circumflex  is  possible  when  the startoffset argument of
                   4672:        pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
                   4673:        PCRE_MULTILINE is set.
                   4674: 
                   4675:        Note  that  the sequences \A, \Z, and \z can be used to match the start
                   4676:        and end of the subject in both modes, and if all branches of a  pattern
                   4677:        start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
                   4678:        set.
                   4679: 
                   4680: 
                   4681: FULL STOP (PERIOD, DOT) AND \N
                   4682: 
                   4683:        Outside a character class, a dot in the pattern matches any one charac-
                   4684:        ter  in  the subject string except (by default) a character that signi-
1.1.1.2   misho    4685:        fies the end of a line.
1.1       misho    4686: 
1.1.1.2   misho    4687:        When a line ending is defined as a single character, dot never  matches
                   4688:        that  character; when the two-character sequence CRLF is used, dot does
                   4689:        not match CR if it is immediately followed  by  LF,  but  otherwise  it
                   4690:        matches  all characters (including isolated CRs and LFs). When any Uni-
                   4691:        code line endings are being recognized, dot does not match CR or LF  or
1.1       misho    4692:        any of the other line ending characters.
                   4693: 
1.1.1.2   misho    4694:        The  behaviour  of  dot  with regard to newlines can be changed. If the
                   4695:        PCRE_DOTALL option is set, a dot matches  any  one  character,  without
1.1       misho    4696:        exception. If the two-character sequence CRLF is present in the subject
                   4697:        string, it takes two dots to match it.
                   4698: 
1.1.1.2   misho    4699:        The handling of dot is entirely independent of the handling of  circum-
                   4700:        flex  and  dollar,  the  only relationship being that they both involve
1.1       misho    4701:        newlines. Dot has no special meaning in a character class.
                   4702: 
1.1.1.2   misho    4703:        The escape sequence \N behaves like  a  dot,  except  that  it  is  not
                   4704:        affected  by  the  PCRE_DOTALL  option.  In other words, it matches any
                   4705:        character except one that signifies the end of a line. Perl  also  uses
1.1       misho    4706:        \N to match characters by name; PCRE does not support this.
                   4707: 
                   4708: 
1.1.1.2   misho    4709: MATCHING A SINGLE DATA UNIT
1.1       misho    4710: 
1.1.1.2   misho    4711:        Outside  a character class, the escape sequence \C matches any one data
                   4712:        unit, whether or not a UTF mode is set. In the 8-bit library, one  data
                   4713:        unit  is  one byte; in the 16-bit library it is a 16-bit unit. Unlike a
                   4714:        dot, \C always matches line-ending characters. The feature is  provided
                   4715:        in  Perl  in  order  to match individual bytes in UTF-8 mode, but it is
                   4716:        unclear how it can usefully be used. Because \C  breaks  up  characters
                   4717:        into  individual  data  units,  matching one unit with \C in a UTF mode
                   4718:        means that the rest of the string may start with a malformed UTF  char-
                   4719:        acter.  This  has  undefined  results,  because PCRE assumes that it is
                   4720:        dealing with valid UTF strings (and by default it checks  this  at  the
1.1.1.3 ! misho    4721:        start     of    processing    unless    the    PCRE_NO_UTF8_CHECK    or
        !          4722:        PCRE_NO_UTF16_CHECK option is used).
1.1       misho    4723: 
1.1.1.3 ! misho    4724:        PCRE does not allow \C to appear in  lookbehind  assertions  (described
        !          4725:        below)  in  a UTF mode, because this would make it impossible to calcu-
1.1       misho    4726:        late the length of the lookbehind.
                   4727: 
1.1.1.2   misho    4728:        In general, the \C escape sequence is best avoided. However, one way of
1.1.1.3 ! misho    4729:        using  it that avoids the problem of malformed UTF characters is to use
        !          4730:        a lookahead to check the length of the next character, as in this  pat-
        !          4731:        tern,  which  could be used with a UTF-8 string (ignore white space and
1.1.1.2   misho    4732:        line breaks):
1.1       misho    4733: 
                   4734:          (?| (?=[\x00-\x7f])(\C) |
                   4735:              (?=[\x80-\x{7ff}])(\C)(\C) |
                   4736:              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
                   4737:              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
                   4738: 
1.1.1.3 ! misho    4739:        A group that starts with (?| resets the capturing  parentheses  numbers
        !          4740:        in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
        !          4741:        assertions at the start of each branch check the next  UTF-8  character
        !          4742:        for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
        !          4743:        character's individual bytes are then captured by the appropriate  num-
1.1       misho    4744:        ber of groups.
                   4745: 
                   4746: 
                   4747: SQUARE BRACKETS AND CHARACTER CLASSES
                   4748: 
                   4749:        An opening square bracket introduces a character class, terminated by a
                   4750:        closing square bracket. A closing square bracket on its own is not spe-
                   4751:        cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
                   4752:        a lone closing square bracket causes a compile-time error. If a closing
1.1.1.3 ! misho    4753:        square  bracket  is required as a member of the class, it should be the
        !          4754:        first data character in the class  (after  an  initial  circumflex,  if
1.1       misho    4755:        present) or escaped with a backslash.
                   4756: 
1.1.1.3 ! misho    4757:        A  character  class matches a single character in the subject. In a UTF
        !          4758:        mode, the character may be more than one  data  unit  long.  A  matched
1.1.1.2   misho    4759:        character must be in the set of characters defined by the class, unless
1.1.1.3 ! misho    4760:        the first character in the class definition is a circumflex,  in  which
1.1.1.2   misho    4761:        case the subject character must not be in the set defined by the class.
1.1.1.3 ! misho    4762:        If a circumflex is actually required as a member of the  class,  ensure
1.1.1.2   misho    4763:        it is not the first character, or escape it with a backslash.
1.1       misho    4764: 
1.1.1.3 ! misho    4765:        For  example, the character class [aeiou] matches any lower case vowel,
        !          4766:        while [^aeiou] matches any character that is not a  lower  case  vowel.
1.1       misho    4767:        Note that a circumflex is just a convenient notation for specifying the
1.1.1.3 ! misho    4768:        characters that are in the class by enumerating those that are  not.  A
        !          4769:        class  that starts with a circumflex is not an assertion; it still con-
        !          4770:        sumes a character from the subject string, and therefore  it  fails  if
1.1       misho    4771:        the current pointer is at the end of the string.
                   4772: 
1.1.1.3 ! misho    4773:        In  UTF-8  (UTF-16)  mode,  characters  with  values  greater  than 255
        !          4774:        (0xffff) can be included in a class as a literal string of data  units,
1.1.1.2   misho    4775:        or by using the \x{ escaping mechanism.
                   4776: 
1.1.1.3 ! misho    4777:        When  caseless  matching  is set, any letters in a class represent both
        !          4778:        their upper case and lower case versions, so for  example,  a  caseless
        !          4779:        [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
        !          4780:        match "A", whereas a caseful version would. In a UTF mode, PCRE  always
        !          4781:        understands  the  concept  of case for characters whose values are less
        !          4782:        than 128, so caseless matching is always possible. For characters  with
        !          4783:        higher  values,  the  concept  of case is supported if PCRE is compiled
        !          4784:        with Unicode property support, but not otherwise.  If you want  to  use
        !          4785:        caseless  matching in a UTF mode for characters 128 and above, you must
        !          4786:        ensure that PCRE is compiled with Unicode property support as  well  as
1.1.1.2   misho    4787:        with UTF support.
                   4788: 
1.1.1.3 ! misho    4789:        Characters  that  might  indicate  line breaks are never treated in any
        !          4790:        special way  when  matching  character  classes,  whatever  line-ending
        !          4791:        sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
1.1       misho    4792:        PCRE_MULTILINE options is used. A class such as [^a] always matches one
                   4793:        of these characters.
                   4794: 
1.1.1.3 ! misho    4795:        The  minus (hyphen) character can be used to specify a range of charac-
        !          4796:        ters in a character  class.  For  example,  [d-m]  matches  any  letter
        !          4797:        between  d  and  m,  inclusive.  If  a minus character is required in a
        !          4798:        class, it must be escaped with a backslash  or  appear  in  a  position
        !          4799:        where  it cannot be interpreted as indicating a range, typically as the
1.1       misho    4800:        first or last character in the class.
                   4801: 
                   4802:        It is not possible to have the literal character "]" as the end charac-
1.1.1.3 ! misho    4803:        ter  of a range. A pattern such as [W-]46] is interpreted as a class of
        !          4804:        two characters ("W" and "-") followed by a literal string "46]", so  it
        !          4805:        would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
        !          4806:        backslash it is interpreted as the end of range, so [W-\]46] is  inter-
        !          4807:        preted  as a class containing a range followed by two other characters.
        !          4808:        The octal or hexadecimal representation of "]" can also be used to  end
1.1       misho    4809:        a range.
                   4810: 
1.1.1.3 ! misho    4811:        Ranges  operate in the collating sequence of character values. They can
        !          4812:        also  be  used  for  characters  specified  numerically,  for   example
        !          4813:        [\000-\037].  Ranges  can include any characters that are valid for the
1.1.1.2   misho    4814:        current mode.
1.1       misho    4815: 
                   4816:        If a range that includes letters is used when caseless matching is set,
                   4817:        it matches the letters in either case. For example, [W-c] is equivalent
1.1.1.3 ! misho    4818:        to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
        !          4819:        character  tables  for  a French locale are in use, [\xc8-\xcb] matches
        !          4820:        accented E characters in both cases. In UTF modes,  PCRE  supports  the
        !          4821:        concept  of  case for characters with values greater than 128 only when
1.1       misho    4822:        it is compiled with Unicode property support.
                   4823: 
1.1.1.3 ! misho    4824:        The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
1.1       misho    4825:        \w, and \W may appear in a character class, and add the characters that
1.1.1.3 ! misho    4826:        they match to the class. For example, [\dABCDEF] matches any  hexadeci-
        !          4827:        mal  digit.  In  UTF modes, the PCRE_UCP option affects the meanings of
        !          4828:        \d, \s, \w and their upper case partners, just as  it  does  when  they
        !          4829:        appear  outside a character class, as described in the section entitled
1.1       misho    4830:        "Generic character types" above. The escape sequence \b has a different
1.1.1.3 ! misho    4831:        meaning  inside  a character class; it matches the backspace character.
        !          4832:        The sequences \B, \N, \R, and \X are not  special  inside  a  character
        !          4833:        class.  Like  any other unrecognized escape sequences, they are treated
        !          4834:        as the literal characters "B", "N", "R", and "X" by default, but  cause
1.1       misho    4835:        an error if the PCRE_EXTRA option is set.
                   4836: 
1.1.1.3 ! misho    4837:        A  circumflex  can  conveniently  be used with the upper case character
        !          4838:        types to specify a more restricted set of characters than the  matching
        !          4839:        lower  case  type.  For example, the class [^\W_] matches any letter or
1.1       misho    4840:        digit, but not underscore, whereas [\w] includes underscore. A positive
                   4841:        character class should be read as "something OR something OR ..." and a
                   4842:        negative class as "NOT something AND NOT something AND NOT ...".
                   4843: 
1.1.1.3 ! misho    4844:        The only metacharacters that are recognized in  character  classes  are
        !          4845:        backslash,  hyphen  (only  where  it can be interpreted as specifying a
        !          4846:        range), circumflex (only at the start), opening  square  bracket  (only
        !          4847:        when  it can be interpreted as introducing a POSIX class name - see the
        !          4848:        next section), and the terminating  closing  square  bracket.  However,
1.1       misho    4849:        escaping other non-alphanumeric characters does no harm.
                   4850: 
                   4851: 
                   4852: POSIX CHARACTER CLASSES
                   4853: 
                   4854:        Perl supports the POSIX notation for character classes. This uses names
1.1.1.3 ! misho    4855:        enclosed by [: and :] within the enclosing square brackets.  PCRE  also
1.1       misho    4856:        supports this notation. For example,
                   4857: 
                   4858:          [01[:alpha:]%]
                   4859: 
                   4860:        matches "0", "1", any alphabetic character, or "%". The supported class
                   4861:        names are:
                   4862: 
                   4863:          alnum    letters and digits
                   4864:          alpha    letters
                   4865:          ascii    character codes 0 - 127
                   4866:          blank    space or tab only
                   4867:          cntrl    control characters
                   4868:          digit    decimal digits (same as \d)
                   4869:          graph    printing characters, excluding space
                   4870:          lower    lower case letters
                   4871:          print    printing characters, including space
                   4872:          punct    printing characters, excluding letters and digits and space
                   4873:          space    white space (not quite the same as \s)
                   4874:          upper    upper case letters
                   4875:          word     "word" characters (same as \w)
                   4876:          xdigit   hexadecimal digits
                   4877: 
1.1.1.3 ! misho    4878:        The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
        !          4879:        and  space  (32). Notice that this list includes the VT character (code
1.1       misho    4880:        11). This makes "space" different to \s, which does not include VT (for
                   4881:        Perl compatibility).
                   4882: 
1.1.1.3 ! misho    4883:        The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
        !          4884:        from Perl 5.8. Another Perl extension is negation, which  is  indicated
1.1       misho    4885:        by a ^ character after the colon. For example,
                   4886: 
                   4887:          [12[:^digit:]]
                   4888: 
1.1.1.3 ! misho    4889:        matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
1.1       misho    4890:        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
                   4891:        these are not supported, and an error is given if they are encountered.
                   4892: 
1.1.1.3 ! misho    4893:        By  default,  in  UTF modes, characters with values greater than 128 do
        !          4894:        not match any of the POSIX character classes. However, if the  PCRE_UCP
        !          4895:        option  is passed to pcre_compile(), some of the classes are changed so
1.1       misho    4896:        that Unicode character properties are used. This is achieved by replac-
                   4897:        ing the POSIX classes by other sequences, as follows:
                   4898: 
                   4899:          [:alnum:]  becomes  \p{Xan}
                   4900:          [:alpha:]  becomes  \p{L}
                   4901:          [:blank:]  becomes  \h
                   4902:          [:digit:]  becomes  \p{Nd}
                   4903:          [:lower:]  becomes  \p{Ll}
                   4904:          [:space:]  becomes  \p{Xps}
                   4905:          [:upper:]  becomes  \p{Lu}
                   4906:          [:word:]   becomes  \p{Xwd}
                   4907: 
1.1.1.3 ! misho    4908:        Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
1.1       misho    4909:        POSIX classes are unchanged, and match only characters with code points
                   4910:        less than 128.
                   4911: 
                   4912: 
                   4913: VERTICAL BAR
                   4914: 
1.1.1.3 ! misho    4915:        Vertical  bar characters are used to separate alternative patterns. For
1.1       misho    4916:        example, the pattern
                   4917: 
                   4918:          gilbert|sullivan
                   4919: 
1.1.1.3 ! misho    4920:        matches either "gilbert" or "sullivan". Any number of alternatives  may
        !          4921:        appear,  and  an  empty  alternative  is  permitted (matching the empty
1.1       misho    4922:        string). The matching process tries each alternative in turn, from left
1.1.1.3 ! misho    4923:        to  right, and the first one that succeeds is used. If the alternatives
        !          4924:        are within a subpattern (defined below), "succeeds" means matching  the
1.1       misho    4925:        rest of the main pattern as well as the alternative in the subpattern.
                   4926: 
                   4927: 
                   4928: INTERNAL OPTION SETTING
                   4929: 
1.1.1.3 ! misho    4930:        The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
        !          4931:        PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
        !          4932:        within  the  pattern  by  a  sequence  of  Perl option letters enclosed
1.1       misho    4933:        between "(?" and ")".  The option letters are
                   4934: 
                   4935:          i  for PCRE_CASELESS
                   4936:          m  for PCRE_MULTILINE
                   4937:          s  for PCRE_DOTALL
                   4938:          x  for PCRE_EXTENDED
                   4939: 
                   4940:        For example, (?im) sets caseless, multiline matching. It is also possi-
                   4941:        ble to unset these options by preceding the letter with a hyphen, and a
1.1.1.3 ! misho    4942:        combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
        !          4943:        LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
        !          4944:        is also permitted. If a  letter  appears  both  before  and  after  the
1.1       misho    4945:        hyphen, the option is unset.
                   4946: 
1.1.1.3 ! misho    4947:        The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
        !          4948:        can be changed in the same way as the Perl-compatible options by  using
1.1       misho    4949:        the characters J, U and X respectively.
                   4950: 
1.1.1.3 ! misho    4951:        When  one  of  these  option  changes occurs at top level (that is, not
        !          4952:        inside subpattern parentheses), the change applies to the remainder  of
1.1       misho    4953:        the pattern that follows. If the change is placed right at the start of
                   4954:        a pattern, PCRE extracts it into the global options (and it will there-
                   4955:        fore show up in data extracted by the pcre_fullinfo() function).
                   4956: 
1.1.1.3 ! misho    4957:        An  option  change  within a subpattern (see below for a description of
        !          4958:        subpatterns) affects only that part of the subpattern that follows  it,
1.1       misho    4959:        so
                   4960: 
                   4961:          (a(?i)b)c
                   4962: 
                   4963:        matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
1.1.1.3 ! misho    4964:        used).  By this means, options can be made to have  different  settings
        !          4965:        in  different parts of the pattern. Any changes made in one alternative
        !          4966:        do carry on into subsequent branches within the  same  subpattern.  For
1.1       misho    4967:        example,
                   4968: 
                   4969:          (a(?i)b|c)
                   4970: 
1.1.1.3 ! misho    4971:        matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
        !          4972:        first branch is abandoned before the option setting.  This  is  because
        !          4973:        the  effects  of option settings happen at compile time. There would be
1.1       misho    4974:        some very weird behaviour otherwise.
                   4975: 
1.1.1.3 ! misho    4976:        Note: There are other PCRE-specific options that  can  be  set  by  the
        !          4977:        application  when  the  compiling  or matching functions are called. In
        !          4978:        some cases the pattern can contain special leading  sequences  such  as
        !          4979:        (*CRLF)  to  override  what  the  application  has set or what has been
        !          4980:        defaulted.  Details  are  given  in  the  section   entitled   "Newline
        !          4981:        sequences"  above.  There  are  also  the (*UTF8), (*UTF16), and (*UCP)
        !          4982:        leading sequences that can be used to  set  UTF  and  Unicode  property
        !          4983:        modes;  they  are  equivalent to setting the PCRE_UTF8, PCRE_UTF16, and
1.1.1.2   misho    4984:        the PCRE_UCP options, respectively.
1.1       misho    4985: 
                   4986: 
                   4987: SUBPATTERNS
                   4988: 
                   4989:        Subpatterns are delimited by parentheses (round brackets), which can be
                   4990:        nested.  Turning part of a pattern into a subpattern does two things:
                   4991: 
                   4992:        1. It localizes a set of alternatives. For example, the pattern
                   4993: 
                   4994:          cat(aract|erpillar|)
                   4995: 
1.1.1.3 ! misho    4996:        matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
1.1       misho    4997:        it would match "cataract", "erpillar" or an empty string.
                   4998: 
1.1.1.3 ! misho    4999:        2. It sets up the subpattern as  a  capturing  subpattern.  This  means
        !          5000:        that,  when  the  whole  pattern  matches,  that portion of the subject
1.1       misho    5001:        string that matched the subpattern is passed back to the caller via the
1.1.1.3 ! misho    5002:        ovector  argument  of  the matching function. (This applies only to the
        !          5003:        traditional matching functions; the DFA matching functions do not  sup-
1.1.1.2   misho    5004:        port capturing.)
                   5005: 
                   5006:        Opening parentheses are counted from left to right (starting from 1) to
1.1.1.3 ! misho    5007:        obtain numbers for the  capturing  subpatterns.  For  example,  if  the
1.1.1.2   misho    5008:        string "the red king" is matched against the pattern
1.1       misho    5009: 
                   5010:          the ((red|white) (king|queen))
                   5011: 
                   5012:        the captured substrings are "red king", "red", and "king", and are num-
                   5013:        bered 1, 2, and 3, respectively.
                   5014: 
1.1.1.3 ! misho    5015:        The fact that plain parentheses fulfil  two  functions  is  not  always
        !          5016:        helpful.   There are often times when a grouping subpattern is required
        !          5017:        without a capturing requirement. If an opening parenthesis is  followed
        !          5018:        by  a question mark and a colon, the subpattern does not do any captur-
        !          5019:        ing, and is not counted when computing the  number  of  any  subsequent
        !          5020:        capturing  subpatterns. For example, if the string "the white queen" is
1.1       misho    5021:        matched against the pattern
                   5022: 
                   5023:          the ((?:red|white) (king|queen))
                   5024: 
                   5025:        the captured substrings are "white queen" and "queen", and are numbered
                   5026:        1 and 2. The maximum number of capturing subpatterns is 65535.
                   5027: 
1.1.1.3 ! misho    5028:        As  a  convenient shorthand, if any option settings are required at the
        !          5029:        start of a non-capturing subpattern,  the  option  letters  may  appear
1.1       misho    5030:        between the "?" and the ":". Thus the two patterns
                   5031: 
                   5032:          (?i:saturday|sunday)
                   5033:          (?:(?i)saturday|sunday)
                   5034: 
                   5035:        match exactly the same set of strings. Because alternative branches are
1.1.1.3 ! misho    5036:        tried from left to right, and options are not reset until  the  end  of
        !          5037:        the  subpattern is reached, an option setting in one branch does affect
        !          5038:        subsequent branches, so the above patterns match "SUNDAY"  as  well  as
1.1       misho    5039:        "Saturday".
                   5040: 
                   5041: 
                   5042: DUPLICATE SUBPATTERN NUMBERS
                   5043: 
                   5044:        Perl 5.10 introduced a feature whereby each alternative in a subpattern
1.1.1.3 ! misho    5045:        uses the same numbers for its capturing parentheses. Such a  subpattern
        !          5046:        starts  with (?| and is itself a non-capturing subpattern. For example,
1.1       misho    5047:        consider this pattern:
                   5048: 
                   5049:          (?|(Sat)ur|(Sun))day
                   5050: 
1.1.1.3 ! misho    5051:        Because the two alternatives are inside a (?| group, both sets of  cap-
        !          5052:        turing  parentheses  are  numbered one. Thus, when the pattern matches,
        !          5053:        you can look at captured substring number  one,  whichever  alternative
        !          5054:        matched.  This  construct  is useful when you want to capture part, but
1.1       misho    5055:        not all, of one of a number of alternatives. Inside a (?| group, paren-
1.1.1.3 ! misho    5056:        theses  are  numbered as usual, but the number is reset at the start of
        !          5057:        each branch. The numbers of any capturing parentheses that  follow  the
        !          5058:        subpattern  start after the highest number used in any branch. The fol-
1.1       misho    5059:        lowing example is taken from the Perl documentation. The numbers under-
                   5060:        neath show in which buffer the captured content will be stored.
                   5061: 
                   5062:          # before  ---------------branch-reset----------- after
                   5063:          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
                   5064:          # 1            2         2  3        2     3     4
                   5065: 
1.1.1.3 ! misho    5066:        A  back  reference  to a numbered subpattern uses the most recent value
        !          5067:        that is set for that number by any subpattern.  The  following  pattern
1.1       misho    5068:        matches "abcabc" or "defdef":
                   5069: 
                   5070:          /(?|(abc)|(def))\1/
                   5071: 
1.1.1.3 ! misho    5072:        In  contrast,  a subroutine call to a numbered subpattern always refers
        !          5073:        to the first one in the pattern with the given  number.  The  following
1.1       misho    5074:        pattern matches "abcabc" or "defabc":
                   5075: 
                   5076:          /(?|(abc)|(def))(?1)/
                   5077: 
1.1.1.3 ! misho    5078:        If  a condition test for a subpattern's having matched refers to a non-
        !          5079:        unique number, the test is true if any of the subpatterns of that  num-
1.1       misho    5080:        ber have matched.
                   5081: 
1.1.1.3 ! misho    5082:        An  alternative approach to using this "branch reset" feature is to use
1.1       misho    5083:        duplicate named subpatterns, as described in the next section.
                   5084: 
                   5085: 
                   5086: NAMED SUBPATTERNS
                   5087: 
1.1.1.3 ! misho    5088:        Identifying capturing parentheses by number is simple, but  it  can  be
        !          5089:        very  hard  to keep track of the numbers in complicated regular expres-
        !          5090:        sions. Furthermore, if an  expression  is  modified,  the  numbers  may
        !          5091:        change.  To help with this difficulty, PCRE supports the naming of sub-
1.1       misho    5092:        patterns. This feature was not added to Perl until release 5.10. Python
1.1.1.3 ! misho    5093:        had  the  feature earlier, and PCRE introduced it at release 4.0, using
        !          5094:        the Python syntax. PCRE now supports both the Perl and the Python  syn-
        !          5095:        tax.  Perl  allows  identically  numbered subpatterns to have different
1.1       misho    5096:        names, but PCRE does not.
                   5097: 
1.1.1.3 ! misho    5098:        In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
        !          5099:        or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
        !          5100:        to capturing parentheses from other parts of the pattern, such as  back
        !          5101:        references,  recursion,  and conditions, can be made by name as well as
1.1       misho    5102:        by number.
                   5103: 
1.1.1.3 ! misho    5104:        Names consist of up to  32  alphanumeric  characters  and  underscores.
        !          5105:        Named  capturing  parentheses  are  still  allocated numbers as well as
        !          5106:        names, exactly as if the names were not present. The PCRE API  provides
1.1       misho    5107:        function calls for extracting the name-to-number translation table from
                   5108:        a compiled pattern. There is also a convenience function for extracting
                   5109:        a captured substring by name.
                   5110: 
1.1.1.3 ! misho    5111:        By  default, a name must be unique within a pattern, but it is possible
1.1       misho    5112:        to relax this constraint by setting the PCRE_DUPNAMES option at compile
1.1.1.3 ! misho    5113:        time.  (Duplicate  names are also always permitted for subpatterns with
        !          5114:        the same number, set up as described in the previous  section.)  Dupli-
        !          5115:        cate  names  can  be useful for patterns where only one instance of the
        !          5116:        named parentheses can match. Suppose you want to match the  name  of  a
        !          5117:        weekday,  either as a 3-letter abbreviation or as the full name, and in
1.1       misho    5118:        both cases you want to extract the abbreviation. This pattern (ignoring
                   5119:        the line breaks) does the job:
                   5120: 
                   5121:          (?<DN>Mon|Fri|Sun)(?:day)?|
                   5122:          (?<DN>Tue)(?:sday)?|
                   5123:          (?<DN>Wed)(?:nesday)?|
                   5124:          (?<DN>Thu)(?:rsday)?|
                   5125:          (?<DN>Sat)(?:urday)?
                   5126: 
1.1.1.3 ! misho    5127:        There  are  five capturing substrings, but only one is ever set after a
1.1       misho    5128:        match.  (An alternative way of solving this problem is to use a "branch
                   5129:        reset" subpattern, as described in the previous section.)
                   5130: 
1.1.1.3 ! misho    5131:        The  convenience  function  for extracting the data by name returns the
        !          5132:        substring for the first (and in this example, the only)  subpattern  of
        !          5133:        that  name  that  matched.  This saves searching to find which numbered
1.1       misho    5134:        subpattern it was.
                   5135: 
1.1.1.3 ! misho    5136:        If you make a back reference to  a  non-unique  named  subpattern  from
        !          5137:        elsewhere  in the pattern, the one that corresponds to the first occur-
1.1       misho    5138:        rence of the name is used. In the absence of duplicate numbers (see the
1.1.1.3 ! misho    5139:        previous  section) this is the one with the lowest number. If you use a
        !          5140:        named reference in a condition test (see the section  about  conditions
        !          5141:        below),  either  to check whether a subpattern has matched, or to check
        !          5142:        for recursion, all subpatterns with the same name are  tested.  If  the
        !          5143:        condition  is  true for any one of them, the overall condition is true.
1.1       misho    5144:        This is the same behaviour as testing by number. For further details of
                   5145:        the interfaces for handling named subpatterns, see the pcreapi documen-
                   5146:        tation.
                   5147: 
                   5148:        Warning: You cannot use different names to distinguish between two sub-
1.1.1.3 ! misho    5149:        patterns  with  the same number because PCRE uses only the numbers when
1.1       misho    5150:        matching. For this reason, an error is given at compile time if differ-
1.1.1.3 ! misho    5151:        ent  names  are given to subpatterns with the same number. However, you
        !          5152:        can give the same name to subpatterns with the same number,  even  when
1.1       misho    5153:        PCRE_DUPNAMES is not set.
                   5154: 
                   5155: 
                   5156: REPETITION
                   5157: 
1.1.1.3 ! misho    5158:        Repetition  is  specified  by  quantifiers, which can follow any of the
1.1       misho    5159:        following items:
                   5160: 
                   5161:          a literal data character
                   5162:          the dot metacharacter
                   5163:          the \C escape sequence
1.1.1.2   misho    5164:          the \X escape sequence
1.1       misho    5165:          the \R escape sequence
                   5166:          an escape such as \d or \pL that matches a single character
                   5167:          a character class
                   5168:          a back reference (see next section)
                   5169:          a parenthesized subpattern (including assertions)
                   5170:          a subroutine call to a subpattern (recursive or otherwise)
                   5171: 
1.1.1.3 ! misho    5172:        The general repetition quantifier specifies a minimum and maximum  num-
        !          5173:        ber  of  permitted matches, by giving the two numbers in curly brackets
        !          5174:        (braces), separated by a comma. The numbers must be  less  than  65536,
1.1       misho    5175:        and the first must be less than or equal to the second. For example:
                   5176: 
                   5177:          z{2,4}
                   5178: 
1.1.1.3 ! misho    5179:        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
        !          5180:        special character. If the second number is omitted, but  the  comma  is
        !          5181:        present,  there  is  no upper limit; if the second number and the comma
        !          5182:        are both omitted, the quantifier specifies an exact number of  required
1.1       misho    5183:        matches. Thus
                   5184: 
                   5185:          [aeiou]{3,}
                   5186: 
                   5187:        matches at least 3 successive vowels, but may match many more, while
                   5188: 
                   5189:          \d{8}
                   5190: 
1.1.1.3 ! misho    5191:        matches  exactly  8  digits. An opening curly bracket that appears in a
        !          5192:        position where a quantifier is not allowed, or one that does not  match
        !          5193:        the  syntax of a quantifier, is taken as a literal character. For exam-
1.1       misho    5194:        ple, {,6} is not a quantifier, but a literal string of four characters.
                   5195: 
1.1.1.2   misho    5196:        In UTF modes, quantifiers apply to characters rather than to individual
1.1.1.3 ! misho    5197:        data  units. Thus, for example, \x{100}{2} matches two characters, each
1.1.1.2   misho    5198:        of which is represented by a two-byte sequence in a UTF-8 string. Simi-
1.1.1.3 ! misho    5199:        larly,  \X{3}  matches  three Unicode extended sequences, each of which
1.1.1.2   misho    5200:        may be several data units long (and they may be of different lengths).
1.1       misho    5201: 
                   5202:        The quantifier {0} is permitted, causing the expression to behave as if
                   5203:        the previous item and the quantifier were not present. This may be use-
1.1.1.3 ! misho    5204:        ful for subpatterns that are referenced as subroutines  from  elsewhere
1.1       misho    5205:        in the pattern (but see also the section entitled "Defining subpatterns
1.1.1.3 ! misho    5206:        for use by reference only" below). Items other  than  subpatterns  that
1.1       misho    5207:        have a {0} quantifier are omitted from the compiled pattern.
                   5208: 
1.1.1.3 ! misho    5209:        For  convenience, the three most common quantifiers have single-charac-
1.1       misho    5210:        ter abbreviations:
                   5211: 
                   5212:          *    is equivalent to {0,}
                   5213:          +    is equivalent to {1,}
                   5214:          ?    is equivalent to {0,1}
                   5215: 
1.1.1.3 ! misho    5216:        It is possible to construct infinite loops by  following  a  subpattern
1.1       misho    5217:        that can match no characters with a quantifier that has no upper limit,
                   5218:        for example:
                   5219: 
                   5220:          (a?)*
                   5221: 
                   5222:        Earlier versions of Perl and PCRE used to give an error at compile time
1.1.1.3 ! misho    5223:        for  such  patterns. However, because there are cases where this can be
        !          5224:        useful, such patterns are now accepted, but if any  repetition  of  the
        !          5225:        subpattern  does in fact match no characters, the loop is forcibly bro-
1.1       misho    5226:        ken.
                   5227: 
1.1.1.3 ! misho    5228:        By default, the quantifiers are "greedy", that is, they match  as  much
        !          5229:        as  possible  (up  to  the  maximum number of permitted times), without
        !          5230:        causing the rest of the pattern to fail. The classic example  of  where
1.1       misho    5231:        this gives problems is in trying to match comments in C programs. These
1.1.1.3 ! misho    5232:        appear between /* and */ and within the comment,  individual  *  and  /
        !          5233:        characters  may  appear. An attempt to match C comments by applying the
1.1       misho    5234:        pattern
                   5235: 
                   5236:          /\*.*\*/
                   5237: 
                   5238:        to the string
                   5239: 
                   5240:          /* first comment */  not comment  /* second comment */
                   5241: 
1.1.1.3 ! misho    5242:        fails, because it matches the entire string owing to the greediness  of
1.1       misho    5243:        the .*  item.
                   5244: 
1.1.1.3 ! misho    5245:        However,  if  a quantifier is followed by a question mark, it ceases to
1.1       misho    5246:        be greedy, and instead matches the minimum number of times possible, so
                   5247:        the pattern
                   5248: 
                   5249:          /\*.*?\*/
                   5250: 
1.1.1.3 ! misho    5251:        does  the  right  thing with the C comments. The meaning of the various
        !          5252:        quantifiers is not otherwise changed,  just  the  preferred  number  of
        !          5253:        matches.   Do  not  confuse this use of question mark with its use as a
        !          5254:        quantifier in its own right. Because it has two uses, it can  sometimes
1.1       misho    5255:        appear doubled, as in
                   5256: 
                   5257:          \d??\d
                   5258: 
                   5259:        which matches one digit by preference, but can match two if that is the
                   5260:        only way the rest of the pattern matches.
                   5261: 
1.1.1.3 ! misho    5262:        If the PCRE_UNGREEDY option is set (an option that is not available  in
        !          5263:        Perl),  the  quantifiers are not greedy by default, but individual ones
        !          5264:        can be made greedy by following them with a  question  mark.  In  other
1.1       misho    5265:        words, it inverts the default behaviour.
                   5266: 
1.1.1.3 ! misho    5267:        When  a  parenthesized  subpattern  is quantified with a minimum repeat
        !          5268:        count that is greater than 1 or with a limited maximum, more memory  is
        !          5269:        required  for  the  compiled  pattern, in proportion to the size of the
1.1       misho    5270:        minimum or maximum.
                   5271: 
                   5272:        If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
1.1.1.3 ! misho    5273:        alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
        !          5274:        the pattern is implicitly anchored, because whatever  follows  will  be
        !          5275:        tried  against every character position in the subject string, so there
        !          5276:        is no point in retrying the overall match at  any  position  after  the
        !          5277:        first.  PCRE  normally treats such a pattern as though it were preceded
1.1       misho    5278:        by \A.
                   5279: 
1.1.1.3 ! misho    5280:        In cases where it is known that the subject  string  contains  no  new-
        !          5281:        lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
1.1       misho    5282:        mization, or alternatively using ^ to indicate anchoring explicitly.
                   5283: 
1.1.1.3 ! misho    5284:        However, there is one situation where the optimization cannot be  used.
1.1       misho    5285:        When .*  is inside capturing parentheses that are the subject of a back
                   5286:        reference elsewhere in the pattern, a match at the start may fail where
                   5287:        a later one succeeds. Consider, for example:
                   5288: 
                   5289:          (.*)abc\1
                   5290: 
1.1.1.3 ! misho    5291:        If  the subject is "xyz123abc123" the match point is the fourth charac-
1.1       misho    5292:        ter. For this reason, such a pattern is not implicitly anchored.
                   5293: 
                   5294:        When a capturing subpattern is repeated, the value captured is the sub-
                   5295:        string that matched the final iteration. For example, after
                   5296: 
                   5297:          (tweedle[dume]{3}\s*)+
                   5298: 
                   5299:        has matched "tweedledum tweedledee" the value of the captured substring
1.1.1.3 ! misho    5300:        is "tweedledee". However, if there are  nested  capturing  subpatterns,
        !          5301:        the  corresponding captured values may have been set in previous itera-
1.1       misho    5302:        tions. For example, after
                   5303: 
                   5304:          /(a|(b))+/
                   5305: 
                   5306:        matches "aba" the value of the second captured substring is "b".
                   5307: 
                   5308: 
                   5309: ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
                   5310: 
1.1.1.3 ! misho    5311:        With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
        !          5312:        repetition,  failure  of what follows normally causes the repeated item
        !          5313:        to be re-evaluated to see if a different number of repeats  allows  the
        !          5314:        rest  of  the pattern to match. Sometimes it is useful to prevent this,
        !          5315:        either to change the nature of the match, or to cause it  fail  earlier
        !          5316:        than  it otherwise might, when the author of the pattern knows there is
1.1       misho    5317:        no point in carrying on.
                   5318: 
1.1.1.3 ! misho    5319:        Consider, for example, the pattern \d+foo when applied to  the  subject
1.1       misho    5320:        line
                   5321: 
                   5322:          123456bar
                   5323: 
                   5324:        After matching all 6 digits and then failing to match "foo", the normal
1.1.1.3 ! misho    5325:        action of the matcher is to try again with only 5 digits  matching  the
        !          5326:        \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
        !          5327:        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
        !          5328:        the  means for specifying that once a subpattern has matched, it is not
1.1       misho    5329:        to be re-evaluated in this way.
                   5330: 
1.1.1.3 ! misho    5331:        If we use atomic grouping for the previous example, the  matcher  gives
        !          5332:        up  immediately  on failing to match "foo" the first time. The notation
1.1       misho    5333:        is a kind of special parenthesis, starting with (?> as in this example:
                   5334: 
                   5335:          (?>\d+)foo
                   5336: 
1.1.1.3 ! misho    5337:        This kind of parenthesis "locks up" the  part of the  pattern  it  con-
        !          5338:        tains  once  it  has matched, and a failure further into the pattern is
        !          5339:        prevented from backtracking into it. Backtracking past it  to  previous
1.1       misho    5340:        items, however, works as normal.
                   5341: 
1.1.1.3 ! misho    5342:        An  alternative  description  is that a subpattern of this type matches
        !          5343:        the string of characters that an  identical  standalone  pattern  would
1.1       misho    5344:        match, if anchored at the current point in the subject string.
                   5345: 
                   5346:        Atomic grouping subpatterns are not capturing subpatterns. Simple cases
                   5347:        such as the above example can be thought of as a maximizing repeat that
1.1.1.3 ! misho    5348:        must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
        !          5349:        pared to adjust the number of digits they match in order  to  make  the
1.1       misho    5350:        rest of the pattern match, (?>\d+) can only match an entire sequence of
                   5351:        digits.
                   5352: 
1.1.1.3 ! misho    5353:        Atomic groups in general can of course contain arbitrarily  complicated
        !          5354:        subpatterns,  and  can  be  nested. However, when the subpattern for an
1.1       misho    5355:        atomic group is just a single repeated item, as in the example above, a
1.1.1.3 ! misho    5356:        simpler  notation,  called  a "possessive quantifier" can be used. This
        !          5357:        consists of an additional + character  following  a  quantifier.  Using
1.1       misho    5358:        this notation, the previous example can be rewritten as
                   5359: 
                   5360:          \d++foo
                   5361: 
                   5362:        Note that a possessive quantifier can be used with an entire group, for
                   5363:        example:
                   5364: 
                   5365:          (abc|xyz){2,3}+
                   5366: 
1.1.1.3 ! misho    5367:        Possessive  quantifiers  are  always  greedy;  the   setting   of   the
1.1       misho    5368:        PCRE_UNGREEDY option is ignored. They are a convenient notation for the
1.1.1.3 ! misho    5369:        simpler forms of atomic group. However, there is no difference  in  the
        !          5370:        meaning  of  a  possessive  quantifier and the equivalent atomic group,
        !          5371:        though there may be a performance  difference;  possessive  quantifiers
1.1       misho    5372:        should be slightly faster.
                   5373: 
1.1.1.3 ! misho    5374:        The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
        !          5375:        tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
1.1       misho    5376:        edition of his book. Mike McCloskey liked it, so implemented it when he
1.1.1.3 ! misho    5377:        built Sun's Java package, and PCRE copied it from there. It  ultimately
1.1       misho    5378:        found its way into Perl at release 5.10.
                   5379: 
                   5380:        PCRE has an optimization that automatically "possessifies" certain sim-
1.1.1.3 ! misho    5381:        ple pattern constructs. For example, the sequence  A+B  is  treated  as
        !          5382:        A++B  because  there is no point in backtracking into a sequence of A's
1.1       misho    5383:        when B must follow.
                   5384: 
1.1.1.3 ! misho    5385:        When a pattern contains an unlimited repeat inside  a  subpattern  that
        !          5386:        can  itself  be  repeated  an  unlimited number of times, the use of an
        !          5387:        atomic group is the only way to avoid some  failing  matches  taking  a
1.1       misho    5388:        very long time indeed. The pattern
                   5389: 
                   5390:          (\D+|<\d+>)*[!?]
                   5391: 
1.1.1.3 ! misho    5392:        matches  an  unlimited number of substrings that either consist of non-
        !          5393:        digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
1.1       misho    5394:        matches, it runs quickly. However, if it is applied to
                   5395: 
                   5396:          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
                   5397: 
1.1.1.3 ! misho    5398:        it  takes  a  long  time  before reporting failure. This is because the
        !          5399:        string can be divided between the internal \D+ repeat and the  external
        !          5400:        *  repeat  in  a  large  number of ways, and all have to be tried. (The
        !          5401:        example uses [!?] rather than a single character at  the  end,  because
        !          5402:        both  PCRE  and  Perl have an optimization that allows for fast failure
        !          5403:        when a single character is used. They remember the last single  charac-
        !          5404:        ter  that  is required for a match, and fail early if it is not present
        !          5405:        in the string.) If the pattern is changed so that  it  uses  an  atomic
1.1       misho    5406:        group, like this:
                   5407: 
                   5408:          ((?>\D+)|<\d+>)*[!?]
                   5409: 
                   5410:        sequences of non-digits cannot be broken, and failure happens quickly.
                   5411: 
                   5412: 
                   5413: BACK REFERENCES
                   5414: 
                   5415:        Outside a character class, a backslash followed by a digit greater than
                   5416:        0 (and possibly further digits) is a back reference to a capturing sub-
1.1.1.3 ! misho    5417:        pattern  earlier  (that is, to its left) in the pattern, provided there
1.1       misho    5418:        have been that many previous capturing left parentheses.
                   5419: 
                   5420:        However, if the decimal number following the backslash is less than 10,
1.1.1.3 ! misho    5421:        it  is  always  taken  as a back reference, and causes an error only if
        !          5422:        there are not that many capturing left parentheses in the  entire  pat-
        !          5423:        tern.  In  other words, the parentheses that are referenced need not be
        !          5424:        to the left of the reference for numbers less than 10. A "forward  back
        !          5425:        reference"  of  this  type can make sense when a repetition is involved
        !          5426:        and the subpattern to the right has participated in an  earlier  itera-
1.1       misho    5427:        tion.
                   5428: 
1.1.1.3 ! misho    5429:        It  is  not  possible to have a numerical "forward back reference" to a
        !          5430:        subpattern whose number is 10 or  more  using  this  syntax  because  a
        !          5431:        sequence  such  as  \50 is interpreted as a character defined in octal.
1.1       misho    5432:        See the subsection entitled "Non-printing characters" above for further
1.1.1.3 ! misho    5433:        details  of  the  handling of digits following a backslash. There is no
        !          5434:        such problem when named parentheses are used. A back reference  to  any
1.1       misho    5435:        subpattern is possible using named parentheses (see below).
                   5436: 
1.1.1.3 ! misho    5437:        Another  way  of  avoiding  the ambiguity inherent in the use of digits
        !          5438:        following a backslash is to use the \g  escape  sequence.  This  escape
1.1       misho    5439:        must be followed by an unsigned number or a negative number, optionally
                   5440:        enclosed in braces. These examples are all identical:
                   5441: 
                   5442:          (ring), \1
                   5443:          (ring), \g1
                   5444:          (ring), \g{1}
                   5445: 
1.1.1.3 ! misho    5446:        An unsigned number specifies an absolute reference without the  ambigu-
1.1       misho    5447:        ity that is present in the older syntax. It is also useful when literal
                   5448:        digits follow the reference. A negative number is a relative reference.
                   5449:        Consider this example:
                   5450: 
                   5451:          (abc(def)ghi)\g{-1}
                   5452: 
                   5453:        The sequence \g{-1} is a reference to the most recently started captur-
                   5454:        ing subpattern before \g, that is, is it equivalent to \2 in this exam-
1.1.1.3 ! misho    5455:        ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
        !          5456:        references can be helpful in long patterns, and also in  patterns  that
        !          5457:        are  created  by  joining  together  fragments  that contain references
1.1       misho    5458:        within themselves.
                   5459: 
1.1.1.3 ! misho    5460:        A back reference matches whatever actually matched the  capturing  sub-
        !          5461:        pattern  in  the  current subject string, rather than anything matching
1.1       misho    5462:        the subpattern itself (see "Subpatterns as subroutines" below for a way
                   5463:        of doing that). So the pattern
                   5464: 
                   5465:          (sens|respons)e and \1ibility
                   5466: 
1.1.1.3 ! misho    5467:        matches  "sense and sensibility" and "response and responsibility", but
        !          5468:        not "sense and responsibility". If caseful matching is in force at  the
        !          5469:        time  of the back reference, the case of letters is relevant. For exam-
1.1       misho    5470:        ple,
                   5471: 
                   5472:          ((?i)rah)\s+\1
                   5473: 
1.1.1.3 ! misho    5474:        matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
1.1       misho    5475:        original capturing subpattern is matched caselessly.
                   5476: 
1.1.1.3 ! misho    5477:        There  are  several  different ways of writing back references to named
        !          5478:        subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
        !          5479:        \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
1.1       misho    5480:        unified back reference syntax, in which \g can be used for both numeric
1.1.1.3 ! misho    5481:        and  named  references,  is  also supported. We could rewrite the above
1.1       misho    5482:        example in any of the following ways:
                   5483: 
                   5484:          (?<p1>(?i)rah)\s+\k<p1>
                   5485:          (?'p1'(?i)rah)\s+\k{p1}
                   5486:          (?P<p1>(?i)rah)\s+(?P=p1)
                   5487:          (?<p1>(?i)rah)\s+\g{p1}
                   5488: 
1.1.1.3 ! misho    5489:        A subpattern that is referenced by  name  may  appear  in  the  pattern
1.1       misho    5490:        before or after the reference.
                   5491: 
1.1.1.3 ! misho    5492:        There  may be more than one back reference to the same subpattern. If a
        !          5493:        subpattern has not actually been used in a particular match,  any  back
1.1       misho    5494:        references to it always fail by default. For example, the pattern
                   5495: 
                   5496:          (a|(bc))\2
                   5497: 
1.1.1.3 ! misho    5498:        always  fails  if  it starts to match "a" rather than "bc". However, if
1.1       misho    5499:        the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
                   5500:        ence to an unset value matches an empty string.
                   5501: 
1.1.1.3 ! misho    5502:        Because  there may be many capturing parentheses in a pattern, all dig-
        !          5503:        its following a backslash are taken as part of a potential back  refer-
        !          5504:        ence  number.   If  the  pattern continues with a digit character, some
        !          5505:        delimiter must  be  used  to  terminate  the  back  reference.  If  the
        !          5506:        PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the
        !          5507:        \g{ syntax or an empty comment (see "Comments" below) can be used.
1.1       misho    5508: 
                   5509:    Recursive back references
                   5510: 
1.1.1.3 ! misho    5511:        A back reference that occurs inside the parentheses to which it  refers
        !          5512:        fails  when  the subpattern is first used, so, for example, (a\1) never
        !          5513:        matches.  However, such references can be useful inside  repeated  sub-
1.1       misho    5514:        patterns. For example, the pattern
                   5515: 
                   5516:          (a|b\1)+
                   5517: 
                   5518:        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
1.1.1.3 ! misho    5519:        ation of the subpattern,  the  back  reference  matches  the  character
        !          5520:        string  corresponding  to  the previous iteration. In order for this to
        !          5521:        work, the pattern must be such that the first iteration does  not  need
        !          5522:        to  match the back reference. This can be done using alternation, as in
1.1       misho    5523:        the example above, or by a quantifier with a minimum of zero.
                   5524: 
1.1.1.3 ! misho    5525:        Back references of this type cause the group that they reference to  be
        !          5526:        treated  as  an atomic group.  Once the whole group has been matched, a
        !          5527:        subsequent matching failure cannot cause backtracking into  the  middle
1.1       misho    5528:        of the group.
                   5529: 
                   5530: 
                   5531: ASSERTIONS
                   5532: 
1.1.1.3 ! misho    5533:        An  assertion  is  a  test on the characters following or preceding the
        !          5534:        current matching point that does not actually consume  any  characters.
        !          5535:        The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
1.1       misho    5536:        described above.
                   5537: 
1.1.1.3 ! misho    5538:        More complicated assertions are coded as  subpatterns.  There  are  two
        !          5539:        kinds:  those  that  look  ahead of the current position in the subject
        !          5540:        string, and those that look  behind  it.  An  assertion  subpattern  is
        !          5541:        matched  in  the  normal way, except that it does not cause the current
1.1       misho    5542:        matching position to be changed.
                   5543: 
1.1.1.3 ! misho    5544:        Assertion subpatterns are not capturing subpatterns. If such an  asser-
        !          5545:        tion  contains  capturing  subpatterns within it, these are counted for
        !          5546:        the purposes of numbering the capturing subpatterns in the  whole  pat-
        !          5547:        tern.  However,  substring  capturing  is carried out only for positive
1.1       misho    5548:        assertions, because it does not make sense for negative assertions.
                   5549: 
1.1.1.3 ! misho    5550:        For compatibility with Perl, assertion  subpatterns  may  be  repeated;
        !          5551:        though  it  makes  no sense to assert the same thing several times, the
        !          5552:        side effect of capturing parentheses may  occasionally  be  useful.  In
1.1       misho    5553:        practice, there only three cases:
                   5554: 
1.1.1.3 ! misho    5555:        (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
        !          5556:        matching.  However, it may  contain  internal  capturing  parenthesized
1.1       misho    5557:        groups that are called from elsewhere via the subroutine mechanism.
                   5558: 
1.1.1.3 ! misho    5559:        (2)  If quantifier is {0,n} where n is greater than zero, it is treated
        !          5560:        as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
1.1       misho    5561:        tried with and without the assertion, the order depending on the greed-
                   5562:        iness of the quantifier.
                   5563: 
1.1.1.3 ! misho    5564:        (3) If the minimum repetition is greater than zero, the  quantifier  is
        !          5565:        ignored.   The  assertion  is  obeyed just once when encountered during
1.1       misho    5566:        matching.
                   5567: 
                   5568:    Lookahead assertions
                   5569: 
                   5570:        Lookahead assertions start with (?= for positive assertions and (?! for
                   5571:        negative assertions. For example,
                   5572: 
                   5573:          \w+(?=;)
                   5574: 
1.1.1.3 ! misho    5575:        matches  a word followed by a semicolon, but does not include the semi-
1.1       misho    5576:        colon in the match, and
                   5577: 
                   5578:          foo(?!bar)
                   5579: 
1.1.1.3 ! misho    5580:        matches any occurrence of "foo" that is not  followed  by  "bar".  Note
1.1       misho    5581:        that the apparently similar pattern
                   5582: 
                   5583:          (?!foo)bar
                   5584: 
1.1.1.3 ! misho    5585:        does  not  find  an  occurrence  of "bar" that is preceded by something
        !          5586:        other than "foo"; it finds any occurrence of "bar" whatsoever,  because
1.1       misho    5587:        the assertion (?!foo) is always true when the next three characters are
                   5588:        "bar". A lookbehind assertion is needed to achieve the other effect.
                   5589: 
                   5590:        If you want to force a matching failure at some point in a pattern, the
1.1.1.3 ! misho    5591:        most  convenient  way  to  do  it  is with (?!) because an empty string
        !          5592:        always matches, so an assertion that requires there not to be an  empty
1.1       misho    5593:        string must always fail.  The backtracking control verb (*FAIL) or (*F)
                   5594:        is a synonym for (?!).
                   5595: 
                   5596:    Lookbehind assertions
                   5597: 
1.1.1.3 ! misho    5598:        Lookbehind assertions start with (?<= for positive assertions and  (?<!
1.1       misho    5599:        for negative assertions. For example,
                   5600: 
                   5601:          (?<!foo)bar
                   5602: 
1.1.1.3 ! misho    5603:        does  find  an  occurrence  of "bar" that is not preceded by "foo". The
        !          5604:        contents of a lookbehind assertion are restricted  such  that  all  the
1.1       misho    5605:        strings it matches must have a fixed length. However, if there are sev-
1.1.1.3 ! misho    5606:        eral top-level alternatives, they do not all  have  to  have  the  same
1.1       misho    5607:        fixed length. Thus
                   5608: 
                   5609:          (?<=bullock|donkey)
                   5610: 
                   5611:        is permitted, but
                   5612: 
                   5613:          (?<!dogs?|cats?)
                   5614: 
1.1.1.3 ! misho    5615:        causes  an  error at compile time. Branches that match different length
        !          5616:        strings are permitted only at the top level of a lookbehind  assertion.
1.1       misho    5617:        This is an extension compared with Perl, which requires all branches to
                   5618:        match the same length of string. An assertion such as
                   5619: 
                   5620:          (?<=ab(c|de))
                   5621: 
1.1.1.3 ! misho    5622:        is not permitted, because its single top-level  branch  can  match  two
1.1       misho    5623:        different lengths, but it is acceptable to PCRE if rewritten to use two
                   5624:        top-level branches:
                   5625: 
                   5626:          (?<=abc|abde)
                   5627: 
1.1.1.3 ! misho    5628:        In some cases, the escape sequence \K (see above) can be  used  instead
1.1       misho    5629:        of a lookbehind assertion to get round the fixed-length restriction.
                   5630: 
1.1.1.3 ! misho    5631:        The  implementation  of lookbehind assertions is, for each alternative,
        !          5632:        to temporarily move the current position back by the fixed  length  and
1.1       misho    5633:        then try to match. If there are insufficient characters before the cur-
                   5634:        rent position, the assertion fails.
                   5635: 
1.1.1.3 ! misho    5636:        In a UTF mode, PCRE does not allow the \C escape (which matches a  sin-
        !          5637:        gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
        !          5638:        because it makes it impossible to calculate the length of  the  lookbe-
        !          5639:        hind.  The \X and \R escapes, which can match different numbers of data
1.1.1.2   misho    5640:        units, are also not permitted.
1.1       misho    5641: 
1.1.1.3 ! misho    5642:        "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
        !          5643:        lookbehinds,  as  long as the subpattern matches a fixed-length string.
1.1       misho    5644:        Recursion, however, is not supported.
                   5645: 
1.1.1.3 ! misho    5646:        Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
1.1       misho    5647:        assertions to specify efficient matching of fixed-length strings at the
                   5648:        end of subject strings. Consider a simple pattern such as
                   5649: 
                   5650:          abcd$
                   5651: 
1.1.1.3 ! misho    5652:        when applied to a long string that does  not  match.  Because  matching
1.1       misho    5653:        proceeds from left to right, PCRE will look for each "a" in the subject
1.1.1.3 ! misho    5654:        and then see if what follows matches the rest of the  pattern.  If  the
1.1       misho    5655:        pattern is specified as
                   5656: 
                   5657:          ^.*abcd$
                   5658: 
1.1.1.3 ! misho    5659:        the  initial .* matches the entire string at first, but when this fails
1.1       misho    5660:        (because there is no following "a"), it backtracks to match all but the
1.1.1.3 ! misho    5661:        last  character,  then all but the last two characters, and so on. Once
        !          5662:        again the search for "a" covers the entire string, from right to  left,
1.1       misho    5663:        so we are no better off. However, if the pattern is written as
                   5664: 
                   5665:          ^.*+(?<=abcd)
                   5666: 
1.1.1.3 ! misho    5667:        there  can  be  no backtracking for the .*+ item; it can match only the
        !          5668:        entire string. The subsequent lookbehind assertion does a  single  test
        !          5669:        on  the last four characters. If it fails, the match fails immediately.
        !          5670:        For long strings, this approach makes a significant difference  to  the
1.1       misho    5671:        processing time.
                   5672: 
                   5673:    Using multiple assertions
                   5674: 
                   5675:        Several assertions (of any sort) may occur in succession. For example,
                   5676: 
                   5677:          (?<=\d{3})(?<!999)foo
                   5678: 
1.1.1.3 ! misho    5679:        matches  "foo" preceded by three digits that are not "999". Notice that
        !          5680:        each of the assertions is applied independently at the  same  point  in
        !          5681:        the  subject  string.  First  there  is a check that the previous three
        !          5682:        characters are all digits, and then there is  a  check  that  the  same
1.1       misho    5683:        three characters are not "999".  This pattern does not match "foo" pre-
1.1.1.3 ! misho    5684:        ceded by six characters, the first of which are  digits  and  the  last
        !          5685:        three  of  which  are not "999". For example, it doesn't match "123abc-
1.1       misho    5686:        foo". A pattern to do that is
                   5687: 
                   5688:          (?<=\d{3}...)(?<!999)foo
                   5689: 
1.1.1.3 ! misho    5690:        This time the first assertion looks at the  preceding  six  characters,
1.1       misho    5691:        checking that the first three are digits, and then the second assertion
                   5692:        checks that the preceding three characters are not "999".
                   5693: 
                   5694:        Assertions can be nested in any combination. For example,
                   5695: 
                   5696:          (?<=(?<!foo)bar)baz
                   5697: 
1.1.1.3 ! misho    5698:        matches an occurrence of "baz" that is preceded by "bar" which in  turn
1.1       misho    5699:        is not preceded by "foo", while
                   5700: 
                   5701:          (?<=\d{3}(?!999)...)foo
                   5702: 
1.1.1.3 ! misho    5703:        is  another pattern that matches "foo" preceded by three digits and any
1.1       misho    5704:        three characters that are not "999".
                   5705: 
                   5706: 
                   5707: CONDITIONAL SUBPATTERNS
                   5708: 
1.1.1.3 ! misho    5709:        It is possible to cause the matching process to obey a subpattern  con-
        !          5710:        ditionally  or to choose between two alternative subpatterns, depending
        !          5711:        on the result of an assertion, or whether a specific capturing  subpat-
        !          5712:        tern  has  already  been matched. The two possible forms of conditional
1.1       misho    5713:        subpattern are:
                   5714: 
                   5715:          (?(condition)yes-pattern)
                   5716:          (?(condition)yes-pattern|no-pattern)
                   5717: 
1.1.1.3 ! misho    5718:        If the condition is satisfied, the yes-pattern is used;  otherwise  the
        !          5719:        no-pattern  (if  present)  is used. If there are more than two alterna-
        !          5720:        tives in the subpattern, a compile-time error occurs. Each of  the  two
1.1       misho    5721:        alternatives may itself contain nested subpatterns of any form, includ-
                   5722:        ing  conditional  subpatterns;  the  restriction  to  two  alternatives
                   5723:        applies only at the level of the condition. This pattern fragment is an
                   5724:        example where the alternatives are complex:
                   5725: 
                   5726:          (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
                   5727: 
                   5728: 
1.1.1.3 ! misho    5729:        There are four kinds of condition: references  to  subpatterns,  refer-
1.1       misho    5730:        ences to recursion, a pseudo-condition called DEFINE, and assertions.
                   5731: 
                   5732:    Checking for a used subpattern by number
                   5733: 
1.1.1.3 ! misho    5734:        If  the  text between the parentheses consists of a sequence of digits,
1.1       misho    5735:        the condition is true if a capturing subpattern of that number has pre-
1.1.1.3 ! misho    5736:        viously  matched.  If  there is more than one capturing subpattern with
        !          5737:        the same number (see the earlier  section  about  duplicate  subpattern
        !          5738:        numbers),  the condition is true if any of them have matched. An alter-
        !          5739:        native notation is to precede the digits with a plus or minus sign.  In
        !          5740:        this  case, the subpattern number is relative rather than absolute. The
        !          5741:        most recently opened parentheses can be referenced by (?(-1), the  next
        !          5742:        most  recent  by (?(-2), and so on. Inside loops it can also make sense
1.1       misho    5743:        to refer to subsequent groups. The next parentheses to be opened can be
1.1.1.3 ! misho    5744:        referenced  as (?(+1), and so on. (The value zero in any of these forms
1.1       misho    5745:        is not used; it provokes a compile-time error.)
                   5746: 
1.1.1.3 ! misho    5747:        Consider the following pattern, which  contains  non-significant  white
1.1       misho    5748:        space to make it more readable (assume the PCRE_EXTENDED option) and to
                   5749:        divide it into three parts for ease of discussion:
                   5750: 
                   5751:          ( \( )?    [^()]+    (?(1) \) )
                   5752: 
1.1.1.3 ! misho    5753:        The first part matches an optional opening  parenthesis,  and  if  that
1.1       misho    5754:        character is present, sets it as the first captured substring. The sec-
1.1.1.3 ! misho    5755:        ond part matches one or more characters that are not  parentheses.  The
        !          5756:        third  part  is  a conditional subpattern that tests whether or not the
        !          5757:        first set of parentheses matched. If they  did,  that  is,  if  subject
        !          5758:        started  with an opening parenthesis, the condition is true, and so the
        !          5759:        yes-pattern is executed and a closing parenthesis is  required.  Other-
        !          5760:        wise,  since no-pattern is not present, the subpattern matches nothing.
        !          5761:        In other words, this pattern matches  a  sequence  of  non-parentheses,
1.1       misho    5762:        optionally enclosed in parentheses.
                   5763: 
1.1.1.3 ! misho    5764:        If  you  were  embedding  this pattern in a larger one, you could use a
1.1       misho    5765:        relative reference:
                   5766: 
                   5767:          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
                   5768: 
1.1.1.3 ! misho    5769:        This makes the fragment independent of the parentheses  in  the  larger
1.1       misho    5770:        pattern.
                   5771: 
                   5772:    Checking for a used subpattern by name
                   5773: 
1.1.1.3 ! misho    5774:        Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
        !          5775:        used subpattern by name. For compatibility  with  earlier  versions  of
        !          5776:        PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
        !          5777:        also recognized. However, there is a possible ambiguity with this  syn-
        !          5778:        tax,  because  subpattern  names  may  consist entirely of digits. PCRE
        !          5779:        looks first for a named subpattern; if it cannot find one and the  name
        !          5780:        consists  entirely  of digits, PCRE looks for a subpattern of that num-
        !          5781:        ber, which must be greater than zero. Using subpattern names that  con-
1.1       misho    5782:        sist entirely of digits is not recommended.
                   5783: 
                   5784:        Rewriting the above example to use a named subpattern gives this:
                   5785: 
                   5786:          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
                   5787: 
1.1.1.3 ! misho    5788:        If  the  name used in a condition of this kind is a duplicate, the test
        !          5789:        is applied to all subpatterns of the same name, and is true if any  one
1.1       misho    5790:        of them has matched.
                   5791: 
                   5792:    Checking for pattern recursion
                   5793: 
                   5794:        If the condition is the string (R), and there is no subpattern with the
1.1.1.3 ! misho    5795:        name R, the condition is true if a recursive call to the whole  pattern
1.1       misho    5796:        or any subpattern has been made. If digits or a name preceded by amper-
                   5797:        sand follow the letter R, for example:
                   5798: 
                   5799:          (?(R3)...) or (?(R&name)...)
                   5800: 
                   5801:        the condition is true if the most recent recursion is into a subpattern
                   5802:        whose number or name is given. This condition does not check the entire
1.1.1.3 ! misho    5803:        recursion stack. If the name used in a condition  of  this  kind  is  a
1.1       misho    5804:        duplicate, the test is applied to all subpatterns of the same name, and
                   5805:        is true if any one of them is the most recent recursion.
                   5806: 
1.1.1.3 ! misho    5807:        At "top level", all these recursion test  conditions  are  false.   The
1.1       misho    5808:        syntax for recursive patterns is described below.
                   5809: 
                   5810:    Defining subpatterns for use by reference only
                   5811: 
1.1.1.3 ! misho    5812:        If  the  condition  is  the string (DEFINE), and there is no subpattern
        !          5813:        with the name DEFINE, the condition is  always  false.  In  this  case,
        !          5814:        there  may  be  only  one  alternative  in the subpattern. It is always
        !          5815:        skipped if control reaches this point  in  the  pattern;  the  idea  of
        !          5816:        DEFINE  is that it can be used to define subroutines that can be refer-
        !          5817:        enced from elsewhere. (The use of subroutines is described below.)  For
        !          5818:        example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
        !          5819:        could be written like this (ignore white space and line breaks):
1.1       misho    5820: 
                   5821:          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
                   5822:          \b (?&byte) (\.(?&byte)){3} \b
                   5823: 
1.1.1.3 ! misho    5824:        The first part of the pattern is a DEFINE group inside which a  another
        !          5825:        group  named "byte" is defined. This matches an individual component of
        !          5826:        an IPv4 address (a number less than 256). When  matching  takes  place,
        !          5827:        this  part  of  the pattern is skipped because DEFINE acts like a false
        !          5828:        condition. The rest of the pattern uses references to the  named  group
        !          5829:        to  match the four dot-separated components of an IPv4 address, insist-
1.1       misho    5830:        ing on a word boundary at each end.
                   5831: 
                   5832:    Assertion conditions
                   5833: 
1.1.1.3 ! misho    5834:        If the condition is not in any of the above  formats,  it  must  be  an
        !          5835:        assertion.   This may be a positive or negative lookahead or lookbehind
        !          5836:        assertion. Consider  this  pattern,  again  containing  non-significant
1.1       misho    5837:        white space, and with the two alternatives on the second line:
                   5838: 
                   5839:          (?(?=[^a-z]*[a-z])
                   5840:          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
                   5841: 
1.1.1.3 ! misho    5842:        The  condition  is  a  positive  lookahead  assertion  that  matches an
        !          5843:        optional sequence of non-letters followed by a letter. In other  words,
        !          5844:        it  tests  for the presence of at least one letter in the subject. If a
        !          5845:        letter is found, the subject is matched against the first  alternative;
        !          5846:        otherwise  it  is  matched  against  the  second.  This pattern matches
        !          5847:        strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
1.1       misho    5848:        letters and dd are digits.
                   5849: 
                   5850: 
                   5851: COMMENTS
                   5852: 
                   5853:        There are two ways of including comments in patterns that are processed
                   5854:        by PCRE. In both cases, the start of the comment must not be in a char-
                   5855:        acter class, nor in the middle of any other sequence of related charac-
1.1.1.3 ! misho    5856:        ters such as (?: or a subpattern name or number.  The  characters  that
1.1       misho    5857:        make up a comment play no part in the pattern matching.
                   5858: 
1.1.1.3 ! misho    5859:        The  sequence (?# marks the start of a comment that continues up to the
        !          5860:        next closing parenthesis. Nested parentheses are not permitted. If  the
1.1       misho    5861:        PCRE_EXTENDED option is set, an unescaped # character also introduces a
1.1.1.3 ! misho    5862:        comment, which in this case continues to  immediately  after  the  next
        !          5863:        newline  character  or character sequence in the pattern. Which charac-
1.1       misho    5864:        ters are interpreted as newlines is controlled by the options passed to
1.1.1.3 ! misho    5865:        a  compiling function or by a special sequence at the start of the pat-
1.1.1.2   misho    5866:        tern, as described in the section entitled "Newline conventions" above.
                   5867:        Note that the end of this type of comment is a literal newline sequence
1.1.1.3 ! misho    5868:        in the pattern; escape sequences that happen to represent a newline  do
        !          5869:        not  count.  For  example,  consider this pattern when PCRE_EXTENDED is
1.1.1.2   misho    5870:        set, and the default newline convention is in force:
1.1       misho    5871: 
                   5872:          abc #comment \n still comment
                   5873: 
1.1.1.3 ! misho    5874:        On encountering the # character, pcre_compile()  skips  along,  looking
        !          5875:        for  a newline in the pattern. The sequence \n is still literal at this
        !          5876:        stage, so it does not terminate the comment. Only an  actual  character
1.1       misho    5877:        with the code value 0x0a (the default newline) does so.
                   5878: 
                   5879: 
                   5880: RECURSIVE PATTERNS
                   5881: 
1.1.1.3 ! misho    5882:        Consider  the problem of matching a string in parentheses, allowing for
        !          5883:        unlimited nested parentheses. Without the use of  recursion,  the  best
        !          5884:        that  can  be  done  is  to use a pattern that matches up to some fixed
        !          5885:        depth of nesting. It is not possible to  handle  an  arbitrary  nesting
1.1       misho    5886:        depth.
                   5887: 
                   5888:        For some time, Perl has provided a facility that allows regular expres-
1.1.1.3 ! misho    5889:        sions to recurse (amongst other things). It does this by  interpolating
        !          5890:        Perl  code in the expression at run time, and the code can refer to the
1.1       misho    5891:        expression itself. A Perl pattern using code interpolation to solve the
                   5892:        parentheses problem can be created like this:
                   5893: 
                   5894:          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
                   5895: 
                   5896:        The (?p{...}) item interpolates Perl code at run time, and in this case
                   5897:        refers recursively to the pattern in which it appears.
                   5898: 
                   5899:        Obviously, PCRE cannot support the interpolation of Perl code. Instead,
1.1.1.3 ! misho    5900:        it  supports  special  syntax  for recursion of the entire pattern, and
        !          5901:        also for individual subpattern recursion.  After  its  introduction  in
        !          5902:        PCRE  and  Python,  this  kind of recursion was subsequently introduced
1.1       misho    5903:        into Perl at release 5.10.
                   5904: 
1.1.1.3 ! misho    5905:        A special item that consists of (? followed by a  number  greater  than
        !          5906:        zero  and  a  closing parenthesis is a recursive subroutine call of the
        !          5907:        subpattern of the given number, provided that  it  occurs  inside  that
        !          5908:        subpattern.  (If  not,  it is a non-recursive subroutine call, which is
        !          5909:        described in the next section.) The special item  (?R)  or  (?0)  is  a
1.1       misho    5910:        recursive call of the entire regular expression.
                   5911: 
1.1.1.3 ! misho    5912:        This  PCRE  pattern  solves  the nested parentheses problem (assume the
1.1       misho    5913:        PCRE_EXTENDED option is set so that white space is ignored):
                   5914: 
                   5915:          \( ( [^()]++ | (?R) )* \)
                   5916: 
1.1.1.3 ! misho    5917:        First it matches an opening parenthesis. Then it matches any number  of
        !          5918:        substrings  which  can  either  be  a sequence of non-parentheses, or a
        !          5919:        recursive match of the pattern itself (that is, a  correctly  parenthe-
1.1       misho    5920:        sized substring).  Finally there is a closing parenthesis. Note the use
                   5921:        of a possessive quantifier to avoid backtracking into sequences of non-
                   5922:        parentheses.
                   5923: 
1.1.1.3 ! misho    5924:        If  this  were  part of a larger pattern, you would not want to recurse
1.1       misho    5925:        the entire pattern, so instead you could use this:
                   5926: 
                   5927:          ( \( ( [^()]++ | (?1) )* \) )
                   5928: 
1.1.1.3 ! misho    5929:        We have put the pattern into parentheses, and caused the  recursion  to
1.1       misho    5930:        refer to them instead of the whole pattern.
                   5931: 
1.1.1.3 ! misho    5932:        In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
        !          5933:        tricky. This is made easier by the use of relative references.  Instead
1.1       misho    5934:        of (?1) in the pattern above you can write (?-2) to refer to the second
1.1.1.3 ! misho    5935:        most recently opened parentheses  preceding  the  recursion.  In  other
        !          5936:        words,  a  negative  number counts capturing parentheses leftwards from
1.1       misho    5937:        the point at which it is encountered.
                   5938: 
1.1.1.3 ! misho    5939:        It is also possible to refer to  subsequently  opened  parentheses,  by
        !          5940:        writing  references  such  as (?+2). However, these cannot be recursive
        !          5941:        because the reference is not inside the  parentheses  that  are  refer-
        !          5942:        enced.  They are always non-recursive subroutine calls, as described in
1.1       misho    5943:        the next section.
                   5944: 
1.1.1.3 ! misho    5945:        An alternative approach is to use named parentheses instead.  The  Perl
        !          5946:        syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
1.1       misho    5947:        supported. We could rewrite the above example as follows:
                   5948: 
                   5949:          (?<pn> \( ( [^()]++ | (?&pn) )* \) )
                   5950: 
1.1.1.3 ! misho    5951:        If there is more than one subpattern with the same name,  the  earliest
1.1       misho    5952:        one is used.
                   5953: 
1.1.1.3 ! misho    5954:        This  particular  example pattern that we have been looking at contains
1.1       misho    5955:        nested unlimited repeats, and so the use of a possessive quantifier for
                   5956:        matching strings of non-parentheses is important when applying the pat-
1.1.1.3 ! misho    5957:        tern to strings that do not match. For example, when  this  pattern  is
1.1       misho    5958:        applied to
                   5959: 
                   5960:          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
                   5961: 
1.1.1.3 ! misho    5962:        it  yields  "no  match" quickly. However, if a possessive quantifier is
        !          5963:        not used, the match runs for a very long time indeed because there  are
        !          5964:        so  many  different  ways the + and * repeats can carve up the subject,
1.1       misho    5965:        and all have to be tested before failure can be reported.
                   5966: 
1.1.1.3 ! misho    5967:        At the end of a match, the values of capturing  parentheses  are  those
        !          5968:        from  the outermost level. If you want to obtain intermediate values, a
        !          5969:        callout function can be used (see below and the pcrecallout  documenta-
1.1       misho    5970:        tion). If the pattern above is matched against
                   5971: 
                   5972:          (ab(cd)ef)
                   5973: 
1.1.1.3 ! misho    5974:        the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
        !          5975:        which is the last value taken on at the top level. If a capturing  sub-
        !          5976:        pattern  is  not  matched at the top level, its final captured value is
        !          5977:        unset, even if it was (temporarily) set at a deeper  level  during  the
1.1       misho    5978:        matching process.
                   5979: 
1.1.1.3 ! misho    5980:        If  there are more than 15 capturing parentheses in a pattern, PCRE has
        !          5981:        to obtain extra memory to store data during a recursion, which it  does
1.1       misho    5982:        by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
                   5983:        can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
                   5984: 
1.1.1.3 ! misho    5985:        Do not confuse the (?R) item with the condition (R),  which  tests  for
        !          5986:        recursion.   Consider  this pattern, which matches text in angle brack-
        !          5987:        ets, allowing for arbitrary nesting. Only digits are allowed in  nested
        !          5988:        brackets  (that is, when recursing), whereas any characters are permit-
1.1       misho    5989:        ted at the outer level.
                   5990: 
                   5991:          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
                   5992: 
1.1.1.3 ! misho    5993:        In this pattern, (?(R) is the start of a conditional  subpattern,  with
        !          5994:        two  different  alternatives for the recursive and non-recursive cases.
1.1       misho    5995:        The (?R) item is the actual recursive call.
                   5996: 
                   5997:    Differences in recursion processing between PCRE and Perl
                   5998: 
1.1.1.3 ! misho    5999:        Recursion processing in PCRE differs from Perl in two  important  ways.
        !          6000:        In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
1.1       misho    6001:        always treated as an atomic group. That is, once it has matched some of
                   6002:        the subject string, it is never re-entered, even if it contains untried
1.1.1.3 ! misho    6003:        alternatives and there is a subsequent matching failure.  This  can  be
        !          6004:        illustrated  by the following pattern, which purports to match a palin-
        !          6005:        dromic string that contains an odd number of characters  (for  example,
1.1       misho    6006:        "a", "aba", "abcba", "abcdcba"):
                   6007: 
                   6008:          ^(.|(.)(?1)\2)$
                   6009: 
                   6010:        The idea is that it either matches a single character, or two identical
1.1.1.3 ! misho    6011:        characters surrounding a sub-palindrome. In Perl, this  pattern  works;
        !          6012:        in  PCRE  it  does  not if the pattern is longer than three characters.
1.1       misho    6013:        Consider the subject string "abcba":
                   6014: 
1.1.1.3 ! misho    6015:        At the top level, the first character is matched, but as it is  not  at
1.1       misho    6016:        the end of the string, the first alternative fails; the second alterna-
                   6017:        tive is taken and the recursion kicks in. The recursive call to subpat-
1.1.1.3 ! misho    6018:        tern  1  successfully  matches the next character ("b"). (Note that the
1.1       misho    6019:        beginning and end of line tests are not part of the recursion).
                   6020: 
1.1.1.3 ! misho    6021:        Back at the top level, the next character ("c") is compared  with  what
        !          6022:        subpattern  2 matched, which was "a". This fails. Because the recursion
        !          6023:        is treated as an atomic group, there are now  no  backtracking  points,
        !          6024:        and  so  the  entire  match fails. (Perl is able, at this point, to re-
        !          6025:        enter the recursion and try the second alternative.)  However,  if  the
1.1       misho    6026:        pattern is written with the alternatives in the other order, things are
                   6027:        different:
                   6028: 
                   6029:          ^((.)(?1)\2|.)$
                   6030: 
1.1.1.3 ! misho    6031:        This time, the recursing alternative is tried first, and  continues  to
        !          6032:        recurse  until  it runs out of characters, at which point the recursion
        !          6033:        fails. But this time we do have  another  alternative  to  try  at  the
        !          6034:        higher  level.  That  is  the  big difference: in the previous case the
1.1       misho    6035:        remaining alternative is at a deeper recursion level, which PCRE cannot
                   6036:        use.
                   6037: 
1.1.1.3 ! misho    6038:        To  change  the pattern so that it matches all palindromic strings, not
        !          6039:        just those with an odd number of characters, it is tempting  to  change
1.1       misho    6040:        the pattern to this:
                   6041: 
                   6042:          ^((.)(?1)\2|.?)$
                   6043: 
1.1.1.3 ! misho    6044:        Again,  this  works  in Perl, but not in PCRE, and for the same reason.
        !          6045:        When a deeper recursion has matched a single character,  it  cannot  be
        !          6046:        entered  again  in  order  to match an empty string. The solution is to
        !          6047:        separate the two cases, and write out the odd and even cases as  alter-
1.1       misho    6048:        natives at the higher level:
                   6049: 
                   6050:          ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
                   6051: 
1.1.1.3 ! misho    6052:        If  you  want  to match typical palindromic phrases, the pattern has to
1.1       misho    6053:        ignore all non-word characters, which can be done like this:
                   6054: 
                   6055:          ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
                   6056: 
                   6057:        If run with the PCRE_CASELESS option, this pattern matches phrases such
                   6058:        as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
1.1.1.3 ! misho    6059:        Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
        !          6060:        ing  into  sequences of non-word characters. Without this, PCRE takes a
        !          6061:        great deal longer (ten times or more) to  match  typical  phrases,  and
1.1       misho    6062:        Perl takes so long that you think it has gone into a loop.
                   6063: 
1.1.1.3 ! misho    6064:        WARNING:  The  palindrome-matching patterns above work only if the sub-
        !          6065:        ject string does not start with a palindrome that is shorter  than  the
        !          6066:        entire  string.  For example, although "abcba" is correctly matched, if
        !          6067:        the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
        !          6068:        then  fails at top level because the end of the string does not follow.
        !          6069:        Once again, it cannot jump back into the recursion to try other  alter-
1.1       misho    6070:        natives, so the entire match fails.
                   6071: 
1.1.1.3 ! misho    6072:        The  second  way  in which PCRE and Perl differ in their recursion pro-
        !          6073:        cessing is in the handling of captured values. In Perl, when a  subpat-
        !          6074:        tern  is  called recursively or as a subpattern (see the next section),
        !          6075:        it has no access to any values that were captured  outside  the  recur-
        !          6076:        sion,  whereas  in  PCRE  these values can be referenced. Consider this
1.1       misho    6077:        pattern:
                   6078: 
                   6079:          ^(.)(\1|a(?2))
                   6080: 
1.1.1.3 ! misho    6081:        In PCRE, this pattern matches "bab". The  first  capturing  parentheses
        !          6082:        match  "b",  then in the second group, when the back reference \1 fails
        !          6083:        to match "b", the second alternative matches "a" and then recurses.  In
        !          6084:        the  recursion,  \1 does now match "b" and so the whole match succeeds.
        !          6085:        In Perl, the pattern fails to match because inside the  recursive  call
1.1       misho    6086:        \1 cannot access the externally set value.
                   6087: 
                   6088: 
                   6089: SUBPATTERNS AS SUBROUTINES
                   6090: 
1.1.1.3 ! misho    6091:        If  the  syntax for a recursive subpattern call (either by number or by
        !          6092:        name) is used outside the parentheses to which it refers,  it  operates
        !          6093:        like  a subroutine in a programming language. The called subpattern may
        !          6094:        be defined before or after the reference. A numbered reference  can  be
1.1       misho    6095:        absolute or relative, as in these examples:
                   6096: 
                   6097:          (...(absolute)...)...(?2)...
                   6098:          (...(relative)...)...(?-1)...
                   6099:          (...(?+1)...(relative)...
                   6100: 
                   6101:        An earlier example pointed out that the pattern
                   6102: 
                   6103:          (sens|respons)e and \1ibility
                   6104: 
1.1.1.3 ! misho    6105:        matches  "sense and sensibility" and "response and responsibility", but
1.1       misho    6106:        not "sense and responsibility". If instead the pattern
                   6107: 
                   6108:          (sens|respons)e and (?1)ibility
                   6109: 
1.1.1.3 ! misho    6110:        is used, it does match "sense and responsibility" as well as the  other
        !          6111:        two  strings.  Another  example  is  given  in the discussion of DEFINE
1.1       misho    6112:        above.
                   6113: 
1.1.1.3 ! misho    6114:        All subroutine calls, whether recursive or not, are always  treated  as
        !          6115:        atomic  groups. That is, once a subroutine has matched some of the sub-
1.1       misho    6116:        ject string, it is never re-entered, even if it contains untried alter-
1.1.1.3 ! misho    6117:        natives  and  there  is  a  subsequent  matching failure. Any capturing
        !          6118:        parentheses that are set during the subroutine  call  revert  to  their
1.1       misho    6119:        previous values afterwards.
                   6120: 
1.1.1.3 ! misho    6121:        Processing  options  such as case-independence are fixed when a subpat-
        !          6122:        tern is defined, so if it is used as a subroutine, such options  cannot
1.1       misho    6123:        be changed for different calls. For example, consider this pattern:
                   6124: 
                   6125:          (abc)(?i:(?-1))
                   6126: 
1.1.1.3 ! misho    6127:        It  matches  "abcabc". It does not match "abcABC" because the change of
1.1       misho    6128:        processing option does not affect the called subpattern.
                   6129: 
                   6130: 
                   6131: ONIGURUMA SUBROUTINE SYNTAX
                   6132: 
1.1.1.3 ! misho    6133:        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
1.1       misho    6134:        name or a number enclosed either in angle brackets or single quotes, is
1.1.1.3 ! misho    6135:        an alternative syntax for referencing a  subpattern  as  a  subroutine,
        !          6136:        possibly  recursively. Here are two of the examples used above, rewrit-
1.1       misho    6137:        ten using this syntax:
                   6138: 
                   6139:          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
                   6140:          (sens|respons)e and \g'1'ibility
                   6141: 
1.1.1.3 ! misho    6142:        PCRE supports an extension to Oniguruma: if a number is preceded  by  a
1.1       misho    6143:        plus or a minus sign it is taken as a relative reference. For example:
                   6144: 
                   6145:          (abc)(?i:\g<-1>)
                   6146: 
1.1.1.3 ! misho    6147:        Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
        !          6148:        synonymous. The former is a back reference; the latter is a  subroutine
1.1       misho    6149:        call.
                   6150: 
                   6151: 
                   6152: CALLOUTS
                   6153: 
                   6154:        Perl has a feature whereby using the sequence (?{...}) causes arbitrary
1.1.1.3 ! misho    6155:        Perl code to be obeyed in the middle of matching a regular  expression.
1.1       misho    6156:        This makes it possible, amongst other things, to extract different sub-
                   6157:        strings that match the same pair of parentheses when there is a repeti-
                   6158:        tion.
                   6159: 
                   6160:        PCRE provides a similar feature, but of course it cannot obey arbitrary
                   6161:        Perl code. The feature is called "callout". The caller of PCRE provides
1.1.1.3 ! misho    6162:        an  external function by putting its entry point in the global variable
        !          6163:        pcre_callout (8-bit library) or  pcre16_callout  (16-bit  library).  By
1.1.1.2   misho    6164:        default, this variable contains NULL, which disables all calling out.
1.1       misho    6165: 
1.1.1.3 ! misho    6166:        Within  a  regular  expression,  (?C) indicates the points at which the
        !          6167:        external function is to be called. If you want  to  identify  different
        !          6168:        callout  points, you can put a number less than 256 after the letter C.
        !          6169:        The default value is zero.  For example, this pattern has  two  callout
1.1       misho    6170:        points:
                   6171: 
                   6172:          (?C1)abc(?C2)def
                   6173: 
1.1.1.3 ! misho    6174:        If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
        !          6175:        outs are automatically installed before each item in the pattern.  They
1.1.1.2   misho    6176:        are all numbered 255.
                   6177: 
1.1.1.3 ! misho    6178:        During  matching, when PCRE reaches a callout point, the external func-
        !          6179:        tion is called. It is provided with the  number  of  the  callout,  the
        !          6180:        position  in  the pattern, and, optionally, one item of data originally
        !          6181:        supplied by the caller of the matching function. The  callout  function
        !          6182:        may  cause  matching to proceed, to backtrack, or to fail altogether. A
        !          6183:        complete description of the interface to the callout function is  given
1.1.1.2   misho    6184:        in the pcrecallout documentation.
1.1       misho    6185: 
                   6186: 
                   6187: BACKTRACKING CONTROL
                   6188: 
1.1.1.3 ! misho    6189:        Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
1.1       misho    6190:        which are described in the Perl documentation as "experimental and sub-
1.1.1.3 ! misho    6191:        ject  to  change or removal in a future version of Perl". It goes on to
        !          6192:        say: "Their usage in production code should be noted to avoid  problems
1.1       misho    6193:        during upgrades." The same remarks apply to the PCRE features described
                   6194:        in this section.
                   6195: 
1.1.1.3 ! misho    6196:        Since these verbs are specifically related  to  backtracking,  most  of
        !          6197:        them  can  be  used only when the pattern is to be matched using one of
1.1.1.2   misho    6198:        the traditional matching functions, which use a backtracking algorithm.
1.1.1.3 ! misho    6199:        With  the  exception  of (*FAIL), which behaves like a failing negative
        !          6200:        assertion, they cause an error if encountered by a DFA  matching  func-
1.1.1.2   misho    6201:        tion.
1.1       misho    6202: 
1.1.1.3 ! misho    6203:        If  any of these verbs are used in an assertion or in a subpattern that
1.1       misho    6204:        is called as a subroutine (whether or not recursively), their effect is
                   6205:        confined to that subpattern; it does not extend to the surrounding pat-
                   6206:        tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
1.1.1.3 ! misho    6207:        that  is  encountered in a successful positive assertion is passed back
        !          6208:        when a match succeeds (compare capturing  parentheses  in  assertions).
1.1       misho    6209:        Note that such subpatterns are processed as anchored at the point where
1.1.1.3 ! misho    6210:        they are tested. Note also that Perl's  treatment  of  subroutines  and
        !          6211:        assertions is different in some cases.
1.1       misho    6212: 
1.1.1.3 ! misho    6213:        The  new verbs make use of what was previously invalid syntax: an open-
1.1       misho    6214:        ing parenthesis followed by an asterisk. They are generally of the form
1.1.1.3 ! misho    6215:        (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
        !          6216:        haviour, depending on whether or not an argument is present. A name  is
1.1       misho    6217:        any sequence of characters that does not include a closing parenthesis.
1.1.1.3 ! misho    6218:        The maximum length of name is 255 in the 8-bit library and 65535 in the
        !          6219:        16-bit library. If the name is empty, that is, if the closing parenthe-
        !          6220:        sis immediately follows the colon, the effect is as if the  colon  were
        !          6221:        not there. Any number of these verbs may occur in a pattern.
        !          6222: 
        !          6223:    Optimizations that affect backtracking verbs
1.1       misho    6224: 
1.1.1.2   misho    6225:        PCRE  contains some optimizations that are used to speed up matching by
1.1       misho    6226:        running some checks at the start of each match attempt. For example, it
1.1.1.2   misho    6227:        may  know  the minimum length of matching subject, or that a particular
                   6228:        character must be present. When one of these  optimizations  suppresses
                   6229:        the  running  of  a match, any included backtracking verbs will not, of
1.1       misho    6230:        course, be processed. You can suppress the start-of-match optimizations
1.1.1.2   misho    6231:        by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
1.1       misho    6232:        pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
1.1.1.3 ! misho    6233:        There is more discussion of this option in the section entitled "Option
        !          6234:        bits for pcre_exec()" in the pcreapi documentation.
1.1       misho    6235: 
1.1.1.2   misho    6236:        Experiments with Perl suggest that it too  has  similar  optimizations,
1.1       misho    6237:        sometimes leading to anomalous results.
                   6238: 
                   6239:    Verbs that act immediately
                   6240: 
1.1.1.2   misho    6241:        The  following  verbs act as soon as they are encountered. They may not
1.1       misho    6242:        be followed by a name.
                   6243: 
                   6244:           (*ACCEPT)
                   6245: 
1.1.1.2   misho    6246:        This verb causes the match to end successfully, skipping the  remainder
                   6247:        of  the pattern. However, when it is inside a subpattern that is called
                   6248:        as a subroutine, only that subpattern is ended  successfully.  Matching
                   6249:        then  continues  at  the  outer level. If (*ACCEPT) is inside capturing
1.1       misho    6250:        parentheses, the data so far is captured. For example:
                   6251: 
                   6252:          A((?:A|B(*ACCEPT)|C)D)
                   6253: 
1.1.1.2   misho    6254:        This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
1.1       misho    6255:        tured by the outer parentheses.
                   6256: 
                   6257:          (*FAIL) or (*F)
                   6258: 
1.1.1.2   misho    6259:        This  verb causes a matching failure, forcing backtracking to occur. It
                   6260:        is equivalent to (?!) but easier to read. The Perl documentation  notes
                   6261:        that  it  is  probably  useful only when combined with (?{}) or (??{}).
                   6262:        Those are, of course, Perl features that are not present in  PCRE.  The
                   6263:        nearest  equivalent is the callout feature, as for example in this pat-
1.1       misho    6264:        tern:
                   6265: 
                   6266:          a+(?C)(*FAIL)
                   6267: 
1.1.1.2   misho    6268:        A match with the string "aaaa" always fails, but the callout  is  taken
1.1       misho    6269:        before each backtrack happens (in this example, 10 times).
                   6270: 
                   6271:    Recording which path was taken
                   6272: 
1.1.1.2   misho    6273:        There  is  one  verb  whose  main  purpose  is to track how a match was
                   6274:        arrived at, though it also has a  secondary  use  in  conjunction  with
1.1       misho    6275:        advancing the match starting point (see (*SKIP) below).
                   6276: 
                   6277:          (*MARK:NAME) or (*:NAME)
                   6278: 
1.1.1.2   misho    6279:        A  name  is  always  required  with  this  verb.  There  may be as many
                   6280:        instances of (*MARK) as you like in a pattern, and their names  do  not
1.1       misho    6281:        have to be unique.
                   6282: 
1.1.1.2   misho    6283:        When  a match succeeds, the name of the last-encountered (*MARK) on the
                   6284:        matching path is passed back to the caller as described in the  section
                   6285:        entitled  "Extra  data  for  pcre_exec()" in the pcreapi documentation.
                   6286:        Here is an example of pcretest output, where the /K  modifier  requests
                   6287:        the retrieval and outputting of (*MARK) data:
1.1       misho    6288: 
                   6289:            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
                   6290:          data> XY
                   6291:           0: XY
                   6292:          MK: A
                   6293:          XZ
                   6294:           0: XZ
                   6295:          MK: B
                   6296: 
                   6297:        The (*MARK) name is tagged with "MK:" in this output, and in this exam-
1.1.1.2   misho    6298:        ple it indicates which of the two alternatives matched. This is a  more
                   6299:        efficient  way of obtaining this information than putting each alterna-
1.1       misho    6300:        tive in its own capturing parentheses.
                   6301: 
                   6302:        If (*MARK) is encountered in a positive assertion, its name is recorded
                   6303:        and passed back if it is the last-encountered. This does not happen for
                   6304:        negative assertions.
                   6305: 
1.1.1.2   misho    6306:        After a partial match or a failed match, the name of the  last  encoun-
1.1       misho    6307:        tered (*MARK) in the entire match process is returned. For example:
                   6308: 
                   6309:            re> /X(*MARK:A)Y|X(*MARK:B)Z/K
                   6310:          data> XP
                   6311:          No match, mark = B
                   6312: 
1.1.1.2   misho    6313:        Note  that  in  this  unanchored  example the mark is retained from the
1.1.1.3 ! misho    6314:        match attempt that started at the letter "X" in the subject. Subsequent
        !          6315:        match attempts starting at "P" and then with an empty string do not get
        !          6316:        as far as the (*MARK) item, but nevertheless do not reset it.
        !          6317: 
        !          6318:        If you are interested in  (*MARK)  values  after  failed  matches,  you
        !          6319:        should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to
        !          6320:        ensure that the match is always attempted.
1.1       misho    6321: 
                   6322:    Verbs that act after backtracking
                   6323: 
                   6324:        The following verbs do nothing when they are encountered. Matching con-
1.1.1.2   misho    6325:        tinues  with what follows, but if there is no subsequent match, causing
                   6326:        a backtrack to the verb, a failure is  forced.  That  is,  backtracking
                   6327:        cannot  pass  to the left of the verb. However, when one of these verbs
                   6328:        appears inside an atomic group, its effect is confined to  that  group,
                   6329:        because  once the group has been matched, there is never any backtrack-
                   6330:        ing into it. In this situation, backtracking can  "jump  back"  to  the
                   6331:        left  of the entire atomic group. (Remember also, as stated above, that
1.1       misho    6332:        this localization also applies in subroutine calls and assertions.)
                   6333: 
1.1.1.2   misho    6334:        These verbs differ in exactly what kind of failure  occurs  when  back-
1.1       misho    6335:        tracking reaches them.
                   6336: 
                   6337:          (*COMMIT)
                   6338: 
1.1.1.2   misho    6339:        This  verb, which may not be followed by a name, causes the whole match
1.1       misho    6340:        to fail outright if the rest of the pattern does not match. Even if the
                   6341:        pattern is unanchored, no further attempts to find a match by advancing
                   6342:        the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
1.1.1.2   misho    6343:        pcre_exec()  is  committed  to  finding a match at the current starting
1.1       misho    6344:        point, or not at all. For example:
                   6345: 
                   6346:          a+(*COMMIT)b
                   6347: 
1.1.1.2   misho    6348:        This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
1.1       misho    6349:        of dynamic anchor, or "I've started, so I must finish." The name of the
1.1.1.2   misho    6350:        most recently passed (*MARK) in the path is passed back when  (*COMMIT)
1.1       misho    6351:        forces a match failure.
                   6352: 
1.1.1.2   misho    6353:        Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
                   6354:        anchor, unless PCRE's start-of-match optimizations are turned  off,  as
1.1       misho    6355:        shown in this pcretest example:
                   6356: 
                   6357:            re> /(*COMMIT)abc/
                   6358:          data> xyzabc
                   6359:           0: abc
                   6360:          xyzabc\Y
                   6361:          No match
                   6362: 
1.1.1.2   misho    6363:        PCRE  knows  that  any  match  must start with "a", so the optimization
                   6364:        skips along the subject to "a" before running the first match  attempt,
                   6365:        which  succeeds.  When the optimization is disabled by the \Y escape in
1.1       misho    6366:        the second subject, the match starts at "x" and so the (*COMMIT) causes
                   6367:        it to fail without trying any other starting points.
                   6368: 
                   6369:          (*PRUNE) or (*PRUNE:NAME)
                   6370: 
1.1.1.2   misho    6371:        This  verb causes the match to fail at the current starting position in
                   6372:        the subject if the rest of the pattern does not match. If  the  pattern
                   6373:        is  unanchored,  the  normal  "bumpalong"  advance to the next starting
                   6374:        character then happens. Backtracking can occur as usual to the left  of
                   6375:        (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
                   6376:        (*PRUNE), but if there is no match to the  right,  backtracking  cannot
                   6377:        cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
                   6378:        native to an atomic group or possessive quantifier, but there are  some
1.1       misho    6379:        uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
1.1.1.2   misho    6380:        iour of (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE).  In  an
1.1       misho    6381:        anchored pattern (*PRUNE) has the same effect as (*COMMIT).
                   6382: 
                   6383:          (*SKIP)
                   6384: 
1.1.1.2   misho    6385:        This  verb, when given without a name, is like (*PRUNE), except that if
                   6386:        the pattern is unanchored, the "bumpalong" advance is not to  the  next
1.1       misho    6387:        character, but to the position in the subject where (*SKIP) was encoun-
1.1.1.2   misho    6388:        tered. (*SKIP) signifies that whatever text was matched leading  up  to
1.1       misho    6389:        it cannot be part of a successful match. Consider:
                   6390: 
                   6391:          a+(*SKIP)b
                   6392: 
1.1.1.2   misho    6393:        If  the  subject  is  "aaaac...",  after  the first match attempt fails
                   6394:        (starting at the first character in the  string),  the  starting  point
1.1       misho    6395:        skips on to start the next attempt at "c". Note that a possessive quan-
1.1.1.2   misho    6396:        tifer does not have the same effect as this example; although it  would
                   6397:        suppress  backtracking  during  the  first  match  attempt,  the second
                   6398:        attempt would start at the second character instead of skipping  on  to
1.1       misho    6399:        "c".
                   6400: 
                   6401:          (*SKIP:NAME)
                   6402: 
1.1.1.2   misho    6403:        When  (*SKIP) has an associated name, its behaviour is modified. If the
1.1       misho    6404:        following pattern fails to match, the previous path through the pattern
1.1.1.2   misho    6405:        is  searched for the most recent (*MARK) that has the same name. If one
                   6406:        is found, the "bumpalong" advance is to the subject position that  cor-
                   6407:        responds  to  that (*MARK) instead of to where (*SKIP) was encountered.
1.1       misho    6408:        If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
                   6409: 
                   6410:          (*THEN) or (*THEN:NAME)
                   6411: 
1.1.1.2   misho    6412:        This verb causes a skip to the next innermost alternative if  the  rest
                   6413:        of  the  pattern does not match. That is, it cancels pending backtrack-
                   6414:        ing, but only within the current alternative. Its name comes  from  the
1.1       misho    6415:        observation that it can be used for a pattern-based if-then-else block:
                   6416: 
                   6417:          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
                   6418: 
1.1.1.2   misho    6419:        If  the COND1 pattern matches, FOO is tried (and possibly further items
                   6420:        after the end of the group if FOO succeeds); on  failure,  the  matcher
                   6421:        skips  to  the second alternative and tries COND2, without backtracking
                   6422:        into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
                   6423:        (*MARK:NAME)(*THEN).   If (*THEN) is not inside an alternation, it acts
1.1       misho    6424:        like (*PRUNE).
                   6425: 
1.1.1.2   misho    6426:        Note that a subpattern that does not contain a | character  is  just  a
                   6427:        part  of the enclosing alternative; it is not a nested alternation with
                   6428:        only one alternative. The effect of (*THEN) extends beyond such a  sub-
                   6429:        pattern  to  the enclosing alternative. Consider this pattern, where A,
1.1       misho    6430:        B, etc. are complex pattern fragments that do not contain any | charac-
                   6431:        ters at this level:
                   6432: 
                   6433:          A (B(*THEN)C) | D
                   6434: 
1.1.1.2   misho    6435:        If  A and B are matched, but there is a failure in C, matching does not
1.1       misho    6436:        backtrack into A; instead it moves to the next alternative, that is, D.
1.1.1.2   misho    6437:        However,  if the subpattern containing (*THEN) is given an alternative,
1.1       misho    6438:        it behaves differently:
                   6439: 
                   6440:          A (B(*THEN)C | (*FAIL)) | D
                   6441: 
1.1.1.2   misho    6442:        The effect of (*THEN) is now confined to the inner subpattern. After  a
1.1       misho    6443:        failure in C, matching moves to (*FAIL), which causes the whole subpat-
1.1.1.2   misho    6444:        tern to fail because there are no more alternatives  to  try.  In  this
1.1       misho    6445:        case, matching does now backtrack into A.
                   6446: 
                   6447:        Note also that a conditional subpattern is not considered as having two
1.1.1.2   misho    6448:        alternatives, because only one is ever used.  In  other  words,  the  |
1.1       misho    6449:        character in a conditional subpattern has a different meaning. Ignoring
                   6450:        white space, consider:
                   6451: 
                   6452:          ^.*? (?(?=a) a | b(*THEN)c )
                   6453: 
1.1.1.2   misho    6454:        If the subject is "ba", this pattern does not  match.  Because  .*?  is
                   6455:        ungreedy,  it  initially  matches  zero characters. The condition (?=a)
                   6456:        then fails, the character "b" is matched,  but  "c"  is  not.  At  this
                   6457:        point,  matching does not backtrack to .*? as might perhaps be expected
                   6458:        from the presence of the | character.  The  conditional  subpattern  is
1.1       misho    6459:        part of the single alternative that comprises the whole pattern, and so
1.1.1.2   misho    6460:        the match fails. (If there was a backtrack into  .*?,  allowing  it  to
1.1       misho    6461:        match "b", the match would succeed.)
                   6462: 
1.1.1.2   misho    6463:        The  verbs just described provide four different "strengths" of control
1.1       misho    6464:        when subsequent matching fails. (*THEN) is the weakest, carrying on the
1.1.1.2   misho    6465:        match  at  the next alternative. (*PRUNE) comes next, failing the match
                   6466:        at the current starting position, but allowing an advance to  the  next
                   6467:        character  (for an unanchored pattern). (*SKIP) is similar, except that
1.1       misho    6468:        the advance may be more than one character. (*COMMIT) is the strongest,
                   6469:        causing the entire match to fail.
                   6470: 
                   6471:        If more than one such verb is present in a pattern, the "strongest" one
                   6472:        wins.  For example, consider this pattern, where A, B, etc. are complex
                   6473:        pattern fragments:
                   6474: 
                   6475:          (A(*COMMIT)B(*THEN)C|D)
                   6476: 
1.1.1.2   misho    6477:        Once  A  has  matched,  PCRE is committed to this match, at the current
                   6478:        starting position. If subsequently B matches, but C does not, the  nor-
1.1       misho    6479:        mal (*THEN) action of trying the next alternative (that is, D) does not
                   6480:        happen because (*COMMIT) overrides.
                   6481: 
                   6482: 
                   6483: SEE ALSO
                   6484: 
1.1.1.2   misho    6485:        pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),
                   6486:        pcre16(3).
1.1       misho    6487: 
                   6488: 
                   6489: AUTHOR
                   6490: 
                   6491:        Philip Hazel
                   6492:        University Computing Service
                   6493:        Cambridge CB2 3QH, England.
                   6494: 
                   6495: 
                   6496: REVISION
                   6497: 
1.1.1.3 ! misho    6498:        Last updated: 17 June 2012
1.1.1.2   misho    6499:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    6500: ------------------------------------------------------------------------------
                   6501: 
                   6502: 
                   6503: PCRESYNTAX(3)                                                    PCRESYNTAX(3)
                   6504: 
                   6505: 
                   6506: NAME
                   6507:        PCRE - Perl-compatible regular expressions
                   6508: 
                   6509: 
                   6510: PCRE REGULAR EXPRESSION SYNTAX SUMMARY
                   6511: 
                   6512:        The  full syntax and semantics of the regular expressions that are sup-
                   6513:        ported by PCRE are described in  the  pcrepattern  documentation.  This
1.1.1.2   misho    6514:        document contains a quick-reference summary of the syntax.
1.1       misho    6515: 
                   6516: 
                   6517: QUOTING
                   6518: 
                   6519:          \x         where x is non-alphanumeric is a literal x
                   6520:          \Q...\E    treat enclosed characters as literal
                   6521: 
                   6522: 
                   6523: CHARACTERS
                   6524: 
                   6525:          \a         alarm, that is, the BEL character (hex 07)
                   6526:          \cx        "control-x", where x is any ASCII character
                   6527:          \e         escape (hex 1B)
1.1.1.3 ! misho    6528:          \f         form feed (hex 0C)
1.1       misho    6529:          \n         newline (hex 0A)
                   6530:          \r         carriage return (hex 0D)
                   6531:          \t         tab (hex 09)
                   6532:          \ddd       character with octal code ddd, or backreference
                   6533:          \xhh       character with hex code hh
                   6534:          \x{hhh..}  character with hex code hhh..
                   6535: 
                   6536: 
                   6537: CHARACTER TYPES
                   6538: 
                   6539:          .          any character except newline;
                   6540:                       in dotall mode, any character whatsoever
1.1.1.2   misho    6541:          \C         one data unit, even in UTF mode (best avoided)
1.1       misho    6542:          \d         a decimal digit
                   6543:          \D         a character that is not a decimal digit
1.1.1.3 ! misho    6544:          \h         a horizontal white space character
        !          6545:          \H         a character that is not a horizontal white space character
1.1       misho    6546:          \N         a character that is not a newline
                   6547:          \p{xx}     a character with the xx property
                   6548:          \P{xx}     a character without the xx property
                   6549:          \R         a newline sequence
1.1.1.3 ! misho    6550:          \s         a white space character
        !          6551:          \S         a character that is not a white space character
        !          6552:          \v         a vertical white space character
        !          6553:          \V         a character that is not a vertical white space character
1.1       misho    6554:          \w         a "word" character
                   6555:          \W         a "non-word" character
                   6556:          \X         an extended Unicode sequence
                   6557: 
                   6558:        In  PCRE,  by  default, \d, \D, \s, \S, \w, and \W recognize only ASCII
1.1.1.2   misho    6559:        characters, even in a UTF mode. However, this can be changed by setting
1.1       misho    6560:        the PCRE_UCP option.
                   6561: 
                   6562: 
                   6563: GENERAL CATEGORY PROPERTIES FOR \p and \P
                   6564: 
                   6565:          C          Other
                   6566:          Cc         Control
                   6567:          Cf         Format
                   6568:          Cn         Unassigned
                   6569:          Co         Private use
                   6570:          Cs         Surrogate
                   6571: 
                   6572:          L          Letter
                   6573:          Ll         Lower case letter
                   6574:          Lm         Modifier letter
                   6575:          Lo         Other letter
                   6576:          Lt         Title case letter
                   6577:          Lu         Upper case letter
                   6578:          L&         Ll, Lu, or Lt
                   6579: 
                   6580:          M          Mark
                   6581:          Mc         Spacing mark
                   6582:          Me         Enclosing mark
                   6583:          Mn         Non-spacing mark
                   6584: 
                   6585:          N          Number
                   6586:          Nd         Decimal number
                   6587:          Nl         Letter number
                   6588:          No         Other number
                   6589: 
                   6590:          P          Punctuation
                   6591:          Pc         Connector punctuation
                   6592:          Pd         Dash punctuation
                   6593:          Pe         Close punctuation
                   6594:          Pf         Final punctuation
                   6595:          Pi         Initial punctuation
                   6596:          Po         Other punctuation
                   6597:          Ps         Open punctuation
                   6598: 
                   6599:          S          Symbol
                   6600:          Sc         Currency symbol
                   6601:          Sk         Modifier symbol
                   6602:          Sm         Mathematical symbol
                   6603:          So         Other symbol
                   6604: 
                   6605:          Z          Separator
                   6606:          Zl         Line separator
                   6607:          Zp         Paragraph separator
                   6608:          Zs         Space separator
                   6609: 
                   6610: 
                   6611: PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P
                   6612: 
                   6613:          Xan        Alphanumeric: union of properties L and N
                   6614:          Xps        POSIX space: property Z or tab, NL, VT, FF, CR
                   6615:          Xsp        Perl space: property Z or tab, NL, FF, CR
                   6616:          Xwd        Perl word: property Xan or underscore
                   6617: 
                   6618: 
                   6619: SCRIPT NAMES FOR \p AND \P
                   6620: 
1.1.1.3 ! misho    6621:        Arabic,  Armenian,  Avestan, Balinese, Bamum, Batak, Bengali, Bopomofo,
        !          6622:        Brahmi, Braille, Buginese, Buhid, Canadian_Aboriginal, Carian,  Chakma,
        !          6623:        Cham,  Cherokee, Common, Coptic, Cuneiform, Cypriot, Cyrillic, Deseret,
        !          6624:        Devanagari,  Egyptian_Hieroglyphs,  Ethiopic,   Georgian,   Glagolitic,
        !          6625:        Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
        !          6626:        gana,  Imperial_Aramaic,  Inherited,  Inscriptional_Pahlavi,   Inscrip-
        !          6627:        tional_Parthian,   Javanese,   Kaithi,   Kannada,  Katakana,  Kayah_Li,
        !          6628:        Kharoshthi, Khmer, Lao, Latin, Lepcha, Limbu, Linear_B,  Lisu,  Lycian,
        !          6629:        Lydian,    Malayalam,    Mandaic,    Meetei_Mayek,    Meroitic_Cursive,
        !          6630:        Meroitic_Hieroglyphs,  Miao,  Mongolian,  Myanmar,  New_Tai_Lue,   Nko,
        !          6631:        Ogham,    Old_Italic,   Old_Persian,   Old_South_Arabian,   Old_Turkic,
        !          6632:        Ol_Chiki, Oriya, Osmanya, Phags_Pa, Phoenician, Rejang, Runic,  Samari-
        !          6633:        tan,  Saurashtra,  Sharada,  Shavian, Sinhala, Sora_Sompeng, Sundanese,
        !          6634:        Syloti_Nagri, Syriac, Tagalog, Tagbanwa,  Tai_Le,  Tai_Tham,  Tai_Viet,
        !          6635:        Takri,  Tamil,  Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Vai,
        !          6636:        Yi.
1.1       misho    6637: 
                   6638: 
                   6639: CHARACTER CLASSES
                   6640: 
                   6641:          [...]       positive character class
                   6642:          [^...]      negative character class
                   6643:          [x-y]       range (can be used for hex characters)
                   6644:          [[:xxx:]]   positive POSIX named set
                   6645:          [[:^xxx:]]  negative POSIX named set
                   6646: 
                   6647:          alnum       alphanumeric
                   6648:          alpha       alphabetic
                   6649:          ascii       0-127
                   6650:          blank       space or tab
                   6651:          cntrl       control character
                   6652:          digit       decimal digit
                   6653:          graph       printing, excluding space
                   6654:          lower       lower case letter
                   6655:          print       printing, including space
                   6656:          punct       printing, excluding alphanumeric
1.1.1.3 ! misho    6657:          space       white space
1.1       misho    6658:          upper       upper case letter
                   6659:          word        same as \w
                   6660:          xdigit      hexadecimal digit
                   6661: 
                   6662:        In PCRE, POSIX character set names recognize only ASCII  characters  by
                   6663:        default,  but  some  of them use Unicode properties if PCRE_UCP is set.
                   6664:        You can use \Q...\E inside a character class.
                   6665: 
                   6666: 
                   6667: QUANTIFIERS
                   6668: 
                   6669:          ?           0 or 1, greedy
                   6670:          ?+          0 or 1, possessive
                   6671:          ??          0 or 1, lazy
                   6672:          *           0 or more, greedy
                   6673:          *+          0 or more, possessive
                   6674:          *?          0 or more, lazy
                   6675:          +           1 or more, greedy
                   6676:          ++          1 or more, possessive
                   6677:          +?          1 or more, lazy
                   6678:          {n}         exactly n
                   6679:          {n,m}       at least n, no more than m, greedy
                   6680:          {n,m}+      at least n, no more than m, possessive
                   6681:          {n,m}?      at least n, no more than m, lazy
                   6682:          {n,}        n or more, greedy
                   6683:          {n,}+       n or more, possessive
                   6684:          {n,}?       n or more, lazy
                   6685: 
                   6686: 
                   6687: ANCHORS AND SIMPLE ASSERTIONS
                   6688: 
                   6689:          \b          word boundary
                   6690:          \B          not a word boundary
                   6691:          ^           start of subject
                   6692:                       also after internal newline in multiline mode
                   6693:          \A          start of subject
                   6694:          $           end of subject
                   6695:                       also before newline at end of subject
                   6696:                       also before internal newline in multiline mode
                   6697:          \Z          end of subject
                   6698:                       also before newline at end of subject
                   6699:          \z          end of subject
                   6700:          \G          first matching position in subject
                   6701: 
                   6702: 
                   6703: MATCH POINT RESET
                   6704: 
                   6705:          \K          reset start of match
                   6706: 
                   6707: 
                   6708: ALTERNATION
                   6709: 
                   6710:          expr|expr|expr...
                   6711: 
                   6712: 
                   6713: CAPTURING
                   6714: 
                   6715:          (...)           capturing group
                   6716:          (?<name>...)    named capturing group (Perl)
                   6717:          (?'name'...)    named capturing group (Perl)
                   6718:          (?P<name>...)   named capturing group (Python)
                   6719:          (?:...)         non-capturing group
                   6720:          (?|...)         non-capturing group; reset group numbers for
                   6721:                           capturing groups in each alternative
                   6722: 
                   6723: 
                   6724: ATOMIC GROUPS
                   6725: 
                   6726:          (?>...)         atomic, non-capturing group
                   6727: 
                   6728: 
                   6729: COMMENT
                   6730: 
                   6731:          (?#....)        comment (not nestable)
                   6732: 
                   6733: 
                   6734: OPTION SETTING
                   6735: 
                   6736:          (?i)            caseless
                   6737:          (?J)            allow duplicate names
                   6738:          (?m)            multiline
                   6739:          (?s)            single line (dotall)
                   6740:          (?U)            default ungreedy (lazy)
                   6741:          (?x)            extended (ignore white space)
                   6742:          (?-...)         unset option(s)
                   6743: 
                   6744:        The following are recognized only at the start of a  pattern  or  after
                   6745:        one of the newline-setting options with similar syntax:
                   6746: 
                   6747:          (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
1.1.1.2   misho    6748:          (*UTF8)         set UTF-8 mode: 8-bit library (PCRE_UTF8)
                   6749:          (*UTF16)        set UTF-16 mode: 16-bit library (PCRE_UTF16)
1.1       misho    6750:          (*UCP)          set PCRE_UCP (use Unicode properties for \d etc)
                   6751: 
                   6752: 
                   6753: LOOKAHEAD AND LOOKBEHIND ASSERTIONS
                   6754: 
                   6755:          (?=...)         positive look ahead
                   6756:          (?!...)         negative look ahead
                   6757:          (?<=...)        positive look behind
                   6758:          (?<!...)        negative look behind
                   6759: 
                   6760:        Each top-level branch of a look behind must be of a fixed length.
                   6761: 
                   6762: 
                   6763: BACKREFERENCES
                   6764: 
                   6765:          \n              reference by number (can be ambiguous)
                   6766:          \gn             reference by number
                   6767:          \g{n}           reference by number
                   6768:          \g{-n}          relative reference by number
                   6769:          \k<name>        reference by name (Perl)
                   6770:          \k'name'        reference by name (Perl)
                   6771:          \g{name}        reference by name (Perl)
                   6772:          \k{name}        reference by name (.NET)
                   6773:          (?P=name)       reference by name (Python)
                   6774: 
                   6775: 
                   6776: SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
                   6777: 
                   6778:          (?R)            recurse whole pattern
                   6779:          (?n)            call subpattern by absolute number
                   6780:          (?+n)           call subpattern by relative number
                   6781:          (?-n)           call subpattern by relative number
                   6782:          (?&name)        call subpattern by name (Perl)
                   6783:          (?P>name)       call subpattern by name (Python)
                   6784:          \g<name>        call subpattern by name (Oniguruma)
                   6785:          \g'name'        call subpattern by name (Oniguruma)
                   6786:          \g<n>           call subpattern by absolute number (Oniguruma)
                   6787:          \g'n'           call subpattern by absolute number (Oniguruma)
                   6788:          \g<+n>          call subpattern by relative number (PCRE extension)
                   6789:          \g'+n'          call subpattern by relative number (PCRE extension)
                   6790:          \g<-n>          call subpattern by relative number (PCRE extension)
                   6791:          \g'-n'          call subpattern by relative number (PCRE extension)
                   6792: 
                   6793: 
                   6794: CONDITIONAL PATTERNS
                   6795: 
                   6796:          (?(condition)yes-pattern)
                   6797:          (?(condition)yes-pattern|no-pattern)
                   6798: 
                   6799:          (?(n)...        absolute reference condition
                   6800:          (?(+n)...       relative reference condition
                   6801:          (?(-n)...       relative reference condition
                   6802:          (?(<name>)...   named reference condition (Perl)
                   6803:          (?('name')...   named reference condition (Perl)
                   6804:          (?(name)...     named reference condition (PCRE)
                   6805:          (?(R)...        overall recursion condition
                   6806:          (?(Rn)...       specific group recursion condition
                   6807:          (?(R&name)...   specific recursion condition
                   6808:          (?(DEFINE)...   define subpattern for reference
                   6809:          (?(assert)...   assertion condition
                   6810: 
                   6811: 
                   6812: BACKTRACKING CONTROL
                   6813: 
                   6814:        The following act immediately they are reached:
                   6815: 
                   6816:          (*ACCEPT)       force successful match
                   6817:          (*FAIL)         force backtrack; synonym (*F)
1.1.1.2   misho    6818:          (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
1.1       misho    6819: 
                   6820:        The  following  act only when a subsequent match failure causes a back-
                   6821:        track to reach them. They all force a match failure, but they differ in
                   6822:        what happens afterwards. Those that advance the start-of-match point do
                   6823:        so only if the pattern is not anchored.
                   6824: 
                   6825:          (*COMMIT)       overall failure, no advance of starting point
                   6826:          (*PRUNE)        advance to next starting character
1.1.1.2   misho    6827:          (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
                   6828:          (*SKIP)         advance to current matching position
                   6829:          (*SKIP:NAME)    advance to position corresponding to an earlier
                   6830:                          (*MARK:NAME); if not found, the (*SKIP) is ignored
1.1       misho    6831:          (*THEN)         local failure, backtrack to next alternation
1.1.1.2   misho    6832:          (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
1.1       misho    6833: 
                   6834: 
                   6835: NEWLINE CONVENTIONS
                   6836: 
                   6837:        These are recognized only at the very start of the pattern or  after  a
1.1.1.2   misho    6838:        (*BSR_...), (*UTF8), (*UTF16) or (*UCP) option.
1.1       misho    6839: 
                   6840:          (*CR)           carriage return only
                   6841:          (*LF)           linefeed only
                   6842:          (*CRLF)         carriage return followed by linefeed
                   6843:          (*ANYCRLF)      all three of the above
                   6844:          (*ANY)          any Unicode newline sequence
                   6845: 
                   6846: 
                   6847: WHAT \R MATCHES
                   6848: 
                   6849:        These  are  recognized only at the very start of the pattern or after a
1.1.1.2   misho    6850:        (*...) option that sets the newline convention or a UTF or UCP mode.
1.1       misho    6851: 
                   6852:          (*BSR_ANYCRLF)  CR, LF, or CRLF
                   6853:          (*BSR_UNICODE)  any Unicode newline sequence
                   6854: 
                   6855: 
                   6856: CALLOUTS
                   6857: 
                   6858:          (?C)      callout
                   6859:          (?Cn)     callout with data n
                   6860: 
                   6861: 
                   6862: SEE ALSO
                   6863: 
                   6864:        pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
                   6865: 
                   6866: 
                   6867: AUTHOR
                   6868: 
                   6869:        Philip Hazel
                   6870:        University Computing Service
                   6871:        Cambridge CB2 3QH, England.
                   6872: 
                   6873: 
                   6874: REVISION
                   6875: 
1.1.1.2   misho    6876:        Last updated: 10 January 2012
                   6877:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    6878: ------------------------------------------------------------------------------
                   6879: 
                   6880: 
                   6881: PCREUNICODE(3)                                                  PCREUNICODE(3)
                   6882: 
                   6883: 
                   6884: NAME
                   6885:        PCRE - Perl-compatible regular expressions
                   6886: 
                   6887: 
1.1.1.2   misho    6888: UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT
                   6889: 
                   6890:        From Release 8.30, in addition to its previous UTF-8 support, PCRE also
                   6891:        supports UTF-16 by means of a separate  16-bit  library.  This  can  be
                   6892:        built as well as, or instead of, the 8-bit library.
                   6893: 
                   6894: 
                   6895: UTF-8 SUPPORT
1.1       misho    6896: 
1.1.1.2   misho    6897:        In  order  process  UTF-8  strings, you must build PCRE's 8-bit library
                   6898:        with UTF support, and, in addition, you must call  pcre_compile()  with
                   6899:        the  PCRE_UTF8 option flag, or the pattern must start with the sequence
                   6900:        (*UTF8). When either of these is the case, both  the  pattern  and  any
                   6901:        subject  strings  that  are  matched  against  it  are treated as UTF-8
                   6902:        strings instead of strings of 1-byte characters.
1.1       misho    6903: 
1.1.1.2   misho    6904: 
                   6905: UTF-16 SUPPORT
                   6906: 
                   6907:        In order process UTF-16 strings, you must build PCRE's  16-bit  library
                   6908:        with UTF support, and, in addition, you must call pcre16_compile() with
                   6909:        the PCRE_UTF16 option flag, or the pattern must start with the sequence
                   6910:        (*UTF16).  When  either  of these is the case, both the pattern and any
                   6911:        subject strings that are matched  against  it  are  treated  as  UTF-16
                   6912:        strings instead of strings of 16-bit characters.
                   6913: 
                   6914: 
                   6915: UTF SUPPORT OVERHEAD
                   6916: 
                   6917:        If  you  compile  PCRE with UTF support, but do not use it at run time,
1.1       misho    6918:        the library will be a bit bigger, but the additional run time  overhead
1.1.1.2   misho    6919:        is limited to testing the PCRE_UTF8/16 flag occasionally, so should not
                   6920:        be very big.
                   6921: 
                   6922: 
                   6923: UNICODE PROPERTY SUPPORT
1.1       misho    6924: 
                   6925:        If PCRE is built with Unicode character property support (which implies
1.1.1.2   misho    6926:        UTF  support), the escape sequences \p{..}, \P{..}, and \X can be used.
                   6927:        The available properties that can be tested are limited to the  general
                   6928:        category  properties  such  as  Lu for an upper case letter or Nd for a
                   6929:        decimal number, the Unicode script names such as Arabic or Han, and the
                   6930:        derived  properties Any and L&. A full list is given in the pcrepattern
                   6931:        documentation. Only the short names for properties are  supported.  For
                   6932:        example,  \p{L}  matches a letter. Its Perl synonym, \p{Letter}, is not
                   6933:        supported.  Furthermore, in Perl, many  properties  may  optionally  be
                   6934:        prefixed  by  "Is", for compatibility with Perl 5.6. PCRE does not sup-
                   6935:        port this.
1.1       misho    6936: 
                   6937:    Validity of UTF-8 strings
                   6938: 
1.1.1.2   misho    6939:        When you set the PCRE_UTF8 flag, the byte strings  passed  as  patterns
                   6940:        and subjects are (by default) checked for validity on entry to the rel-
1.1.1.3 ! misho    6941:        evant functions. The entire string is checked before any other process-
        !          6942:        ing  takes  place. From release 7.3 of PCRE, the check is according the
1.1.1.2   misho    6943:        rules of RFC 3629, which are themselves derived from the Unicode speci-
1.1.1.3 ! misho    6944:        fication.  Earlier  releases  of  PCRE  followed the rules of RFC 2279,
        !          6945:        which allows the full range of 31-bit values  (0  to  0x7FFFFFFF).  The
        !          6946:        current  check allows only values in the range U+0 to U+10FFFF, exclud-
1.1.1.2   misho    6947:        ing U+D800 to U+DFFF.
                   6948: 
1.1.1.3 ! misho    6949:        The excluded code points are the "Surrogate Area" of Unicode. They  are
        !          6950:        reserved  for  use  by  UTF-16,  where they are used in pairs to encode
        !          6951:        codepoints with values greater than 0xFFFF. The code  points  that  are
1.1.1.2   misho    6952:        encoded by UTF-16 pairs are available independently in the UTF-8 encod-
1.1.1.3 ! misho    6953:        ing. (In other words, the whole surrogate thing is a fudge  for  UTF-16
1.1.1.2   misho    6954:        which unfortunately messes up UTF-8.)
1.1       misho    6955: 
                   6956:        If an invalid UTF-8 string is passed to PCRE, an error return is given.
1.1.1.3 ! misho    6957:        At compile time, the only additional information is the offset  to  the
        !          6958:        first byte of the failing character. The run-time functions pcre_exec()
        !          6959:        and pcre_dfa_exec() also pass back this information, as well as a  more
        !          6960:        detailed  reason  code if the caller has provided memory in which to do
1.1       misho    6961:        this.
                   6962: 
1.1.1.3 ! misho    6963:        In some situations, you may already know that your strings  are  valid,
        !          6964:        and  therefore  want  to  skip these checks in order to improve perfor-
        !          6965:        mance, for example in the case of a long subject string that  is  being
        !          6966:        scanned   repeatedly   with   different   patterns.   If  you  set  the
        !          6967:        PCRE_NO_UTF8_CHECK flag at compile time or at run  time,  PCRE  assumes
        !          6968:        that  the  pattern  or subject it is given (respectively) contains only
        !          6969:        valid UTF-8 codes. In this case, it does not diagnose an invalid  UTF-8
        !          6970:        string.
1.1       misho    6971: 
1.1.1.3 ! misho    6972:        If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
        !          6973:        what happens depends on why the string is invalid. If the  string  con-
1.1       misho    6974:        forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
1.1.1.3 ! misho    6975:        string of characters in the range 0 to  0x7FFFFFFF  by  pcre_dfa_exec()
        !          6976:        and  the interpreted version of pcre_exec(). In other words, apart from
        !          6977:        the initial validity test, these functions (when in UTF-8 mode)  handle
        !          6978:        strings  according  to the more liberal rules of RFC 2279. However, the
1.1       misho    6979:        just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.
1.1.1.3 ! misho    6980:        If  you are using JIT optimization, or if the string does not even con-
1.1       misho    6981:        form to RFC 2279, the result is undefined. Your program may crash.
                   6982: 
1.1.1.3 ! misho    6983:        If you want to process strings  of  values  in  the  full  range  0  to
        !          6984:        0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
1.1       misho    6985:        set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
1.1.1.3 ! misho    6986:        this  situation,  you  will  have to apply your own validity check, and
1.1       misho    6987:        avoid the use of JIT optimization.
                   6988: 
1.1.1.2   misho    6989:    Validity of UTF-16 strings
1.1       misho    6990: 
1.1.1.2   misho    6991:        When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
                   6992:        are passed as patterns and subjects are (by default) checked for valid-
1.1.1.3 ! misho    6993:        ity on entry to the relevant functions. Values other than those in  the
1.1.1.2   misho    6994:        surrogate range U+D800 to U+DFFF are independent code points. Values in
                   6995:        the surrogate range must be used in pairs in the correct manner.
                   6996: 
1.1.1.3 ! misho    6997:        If an invalid UTF-16 string is passed  to  PCRE,  an  error  return  is
        !          6998:        given.  At  compile time, the only additional information is the offset
        !          6999:        to the first data unit of the failing character. The run-time functions
1.1.1.2   misho    7000:        pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
1.1.1.3 ! misho    7001:        well as a more detailed reason code if the caller has  provided  memory
1.1.1.2   misho    7002:        in which to do this.
                   7003: 
1.1.1.3 ! misho    7004:        In  some  situations, you may already know that your strings are valid,
        !          7005:        and therefore want to skip these checks in  order  to  improve  perfor-
        !          7006:        mance.  If  you  set the PCRE_NO_UTF16_CHECK flag at compile time or at
1.1.1.2   misho    7007:        run time, PCRE assumes that the pattern or subject it is given (respec-
                   7008:        tively) contains only valid UTF-16 sequences. In this case, it does not
                   7009:        diagnose an invalid UTF-16 string.
                   7010: 
                   7011:    General comments about UTF modes
                   7012: 
1.1.1.3 ! misho    7013:        1. Codepoints less than 256  can  be  specified  by  either  braced  or
        !          7014:        unbraced  hexadecimal  escape  sequences (for example, \x{b3} or \xb3).
1.1.1.2   misho    7015:        Larger values have to use braced sequences.
                   7016: 
1.1.1.3 ! misho    7017:        2. Octal numbers up to \777 are recognized, and  in  UTF-8  mode,  they
1.1.1.2   misho    7018:        match two-byte characters for values greater than \177.
                   7019: 
                   7020:        3. Repeat quantifiers apply to complete UTF characters, not to individ-
                   7021:        ual data units, for example: \x{100}{3}.
                   7022: 
1.1.1.3 ! misho    7023:        4. The dot metacharacter matches one UTF character instead of a  single
1.1.1.2   misho    7024:        data unit.
                   7025: 
1.1.1.3 ! misho    7026:        5.  The  escape sequence \C can be used to match a single byte in UTF-8
1.1.1.2   misho    7027:        mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead
                   7028:        to some strange effects because it breaks up multi-unit characters (see
1.1.1.3 ! misho    7029:        the description of \C in the pcrepattern documentation). The use of  \C
        !          7030:        is    not    supported    in    the   alternative   matching   function
        !          7031:        pcre[16]_dfa_exec(), nor is it supported in UTF mode by the  JIT  opti-
1.1.1.2   misho    7032:        mization of pcre[16]_exec(). If JIT optimization is requested for a UTF
                   7033:        pattern that contains \C, it will not succeed, and so the matching will
                   7034:        be carried out by the normal interpretive function.
1.1       misho    7035: 
1.1.1.3 ! misho    7036:        6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
1.1       misho    7037:        test characters of any code value, but, by default, the characters that
1.1.1.3 ! misho    7038:        PCRE  recognizes  as digits, spaces, or word characters remain the same
        !          7039:        set as in non-UTF mode, all with values less  than  256.  This  remains
        !          7040:        true  even  when  PCRE  is  built  to include Unicode property support,
1.1.1.2   misho    7041:        because to do otherwise would slow down PCRE in many common cases. Note
1.1.1.3 ! misho    7042:        in  particular that this applies to \b and \B, because they are defined
1.1.1.2   misho    7043:        in terms of \w and \W. If you really want to test for a wider sense of,
1.1.1.3 ! misho    7044:        say,  "digit",  you  can  use  explicit  Unicode property tests such as
1.1.1.2   misho    7045:        \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
1.1.1.3 ! misho    7046:        character  escapes  work is changed so that Unicode properties are used
1.1.1.2   misho    7047:        to determine which characters match. There are more details in the sec-
                   7048:        tion on generic character types in the pcrepattern documentation.
1.1       misho    7049: 
1.1.1.3 ! misho    7050:        7.  Similarly,  characters that match the POSIX named character classes
1.1       misho    7051:        are all low-valued characters, unless the PCRE_UCP option is set.
                   7052: 
1.1.1.3 ! misho    7053:        8. However, the horizontal and vertical white  space  matching  escapes
        !          7054:        (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
1.1       misho    7055:        whether or not PCRE_UCP is set.
                   7056: 
1.1.1.3 ! misho    7057:        9. Case-insensitive matching applies only to  characters  whose  values
        !          7058:        are  less than 128, unless PCRE is built with Unicode property support.
        !          7059:        Even when Unicode property support is available, PCRE  still  uses  its
        !          7060:        own  character  tables when checking the case of low-valued characters,
        !          7061:        so as not to degrade performance.  The Unicode property information  is
1.1       misho    7062:        used only for characters with higher values. Furthermore, PCRE supports
1.1.1.3 ! misho    7063:        case-insensitive matching only  when  there  is  a  one-to-one  mapping
        !          7064:        between  a letter's cases. There are a small number of many-to-one map-
1.1       misho    7065:        pings in Unicode; these are not supported by PCRE.
                   7066: 
                   7067: 
                   7068: AUTHOR
                   7069: 
                   7070:        Philip Hazel
                   7071:        University Computing Service
                   7072:        Cambridge CB2 3QH, England.
                   7073: 
                   7074: 
                   7075: REVISION
                   7076: 
1.1.1.3 ! misho    7077:        Last updated: 14 April 2012
1.1.1.2   misho    7078:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    7079: ------------------------------------------------------------------------------
                   7080: 
                   7081: 
                   7082: PCREJIT(3)                                                          PCREJIT(3)
                   7083: 
                   7084: 
                   7085: NAME
                   7086:        PCRE - Perl-compatible regular expressions
                   7087: 
                   7088: 
                   7089: PCRE JUST-IN-TIME COMPILER SUPPORT
                   7090: 
                   7091:        Just-in-time  compiling  is a heavyweight optimization that can greatly
                   7092:        speed up pattern matching. However, it comes at the cost of extra  pro-
                   7093:        cessing before the match is performed. Therefore, it is of most benefit
                   7094:        when the same pattern is going to be matched many times. This does  not
1.1.1.2   misho    7095:        necessarily  mean  many calls of a matching function; if the pattern is
                   7096:        not anchored, matching attempts may take place many  times  at  various
                   7097:        positions  in  the  subject, even for a single call.  Therefore, if the
1.1       misho    7098:        subject string is very long, it may still pay to use  JIT  for  one-off
                   7099:        matches.
                   7100: 
1.1.1.2   misho    7101:        JIT  support  applies  only to the traditional Perl-compatible matching
                   7102:        function.  It does not apply when the DFA matching  function  is  being
                   7103:        used. The code for this support was written by Zoltan Herczeg.
                   7104: 
                   7105: 
                   7106: 8-BIT and 16-BIT SUPPORT
                   7107: 
                   7108:        JIT  support is available for both the 8-bit and 16-bit PCRE libraries.
                   7109:        To  keep  this  documentation  simple,  only  the  8-bit  interface  is
                   7110:        described in what follows. If you are using the 16-bit library, substi-
                   7111:        tute  the  16-bit  functions  and  16-bit  structures   (for   example,
                   7112:        pcre16_jit_stack instead of pcre_jit_stack).
1.1       misho    7113: 
                   7114: 
                   7115: AVAILABILITY OF JIT SUPPORT
                   7116: 
                   7117:        JIT  support  is  an  optional  feature of PCRE. The "configure" option
                   7118:        --enable-jit (or equivalent CMake option) must  be  set  when  PCRE  is
                   7119:        built  if  you want to use JIT. The support is limited to the following
                   7120:        hardware platforms:
                   7121: 
                   7122:          ARM v5, v7, and Thumb2
                   7123:          Intel x86 32-bit and 64-bit
                   7124:          MIPS 32-bit
1.1.1.2   misho    7125:          Power PC 32-bit and 64-bit
1.1       misho    7126: 
1.1.1.3 ! misho    7127:        If --enable-jit is set on an unsupported platform, compilation fails.
1.1       misho    7128: 
                   7129:        A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
                   7130:        port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
                   7131:        option. The result is 1 when JIT is available, and  0  otherwise.  How-
                   7132:        ever, a simple program does not need to check this in order to use JIT.
1.1.1.3 ! misho    7133:        The API is implemented in a way that falls  back  to  the  interpretive
1.1       misho    7134:        code if JIT is not available.
                   7135: 
                   7136:        If  your program may sometimes be linked with versions of PCRE that are
                   7137:        older than 8.20, but you want to use JIT when it is available, you  can
                   7138:        test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
                   7139:        macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
                   7140: 
                   7141: 
                   7142: SIMPLE USE OF JIT
                   7143: 
                   7144:        You have to do two things to make use of the JIT support  in  the  sim-
                   7145:        plest way:
                   7146: 
                   7147:          (1) Call pcre_study() with the PCRE_STUDY_JIT_COMPILE option for
                   7148:              each compiled pattern, and pass the resulting pcre_extra block to
                   7149:              pcre_exec().
                   7150: 
                   7151:          (2) Use pcre_free_study() to free the pcre_extra block when it is
1.1.1.3 ! misho    7152:              no longer needed, instead of just freeing it yourself. This
1.1       misho    7153:              ensures that any JIT data is also freed.
                   7154: 
                   7155:        For  a  program  that may be linked with pre-8.20 versions of PCRE, you
                   7156:        can insert
                   7157: 
                   7158:          #ifndef PCRE_STUDY_JIT_COMPILE
                   7159:          #define PCRE_STUDY_JIT_COMPILE 0
                   7160:          #endif
                   7161: 
                   7162:        so that no option is passed to pcre_study(),  and  then  use  something
                   7163:        like this to free the study data:
                   7164: 
                   7165:          #ifdef PCRE_CONFIG_JIT
                   7166:              pcre_free_study(study_ptr);
                   7167:          #else
                   7168:              pcre_free(study_ptr);
                   7169:          #endif
                   7170: 
1.1.1.3 ! misho    7171:        PCRE_STUDY_JIT_COMPILE  requests  the JIT compiler to generate code for
        !          7172:        complete matches.  If  you  want  to  run  partial  matches  using  the
        !          7173:        PCRE_PARTIAL_HARD  or  PCRE_PARTIAL_SOFT  options  of  pcre_exec(), you
        !          7174:        should set one or both of the following  options  in  addition  to,  or
        !          7175:        instead of, PCRE_STUDY_JIT_COMPILE when you call pcre_study():
        !          7176: 
        !          7177:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
        !          7178:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
        !          7179: 
        !          7180:        The  JIT  compiler  generates  different optimized code for each of the
        !          7181:        three modes (normal, soft partial, hard partial). When  pcre_exec()  is
        !          7182:        called,  the appropriate code is run if it is available. Otherwise, the
        !          7183:        pattern is matched using interpretive code.
        !          7184: 
        !          7185:        In some circumstances you may need to call additional functions.  These
        !          7186:        are  described  in  the  section  entitled  "Controlling the JIT stack"
1.1       misho    7187:        below.
                   7188: 
1.1.1.3 ! misho    7189:        If JIT  support  is  not  available,  PCRE_STUDY_JIT_COMPILE  etc.  are
        !          7190:        ignored, and no JIT data is created. Otherwise, the compiled pattern is
        !          7191:        passed to the JIT compiler, which turns it into machine code that  exe-
        !          7192:        cutes  much  faster than the normal interpretive code. When pcre_exec()
        !          7193:        is passed a pcre_extra block containing a pointer to JIT  code  of  the
        !          7194:        appropriate  mode  (normal  or  hard/soft  partial), it obeys that code
        !          7195:        instead of running the interpreter. The result is  identical,  but  the
        !          7196:        compiled JIT code runs much faster.
1.1       misho    7197: 
                   7198:        There  are some pcre_exec() options that are not supported for JIT exe-
                   7199:        cution. There are also some  pattern  items  that  JIT  cannot  handle.
                   7200:        Details  are  given below. In both cases, execution automatically falls
1.1.1.3 ! misho    7201:        back to the interpretive code. If you want  to  know  whether  JIT  was
        !          7202:        actually  used  for  a  particular  match, you should arrange for a JIT
        !          7203:        callback function to be set up as described  in  the  section  entitled
        !          7204:        "Controlling  the JIT stack" below, even if you do not need to supply a
        !          7205:        non-default JIT stack. Such a callback function is called whenever  JIT
        !          7206:        code  is about to be obeyed. If the execution options are not right for
        !          7207:        JIT execution, the callback function is not obeyed.
1.1       misho    7208: 
                   7209:        If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
                   7210:        ated.  You  can find out if JIT execution is available after studying a
                   7211:        pattern by calling pcre_fullinfo() with  the  PCRE_INFO_JIT  option.  A
                   7212:        result  of  1  means that JIT compilation was successful. A result of 0
                   7213:        means that JIT support is not available, or the pattern was not studied
1.1.1.3 ! misho    7214:        with  PCRE_STUDY_JIT_COMPILE  etc., or the JIT compiler was not able to
        !          7215:        handle the pattern.
1.1       misho    7216: 
                   7217:        Once a pattern has been studied, with or without JIT, it can be used as
                   7218:        many times as you like for matching different subject strings.
                   7219: 
                   7220: 
                   7221: UNSUPPORTED OPTIONS AND PATTERN ITEMS
                   7222: 
                   7223:        The  only  pcre_exec() options that are supported for JIT execution are
1.1.1.3 ! misho    7224:        PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK,   PCRE_NOTBOL,   PCRE_NOTEOL,
        !          7225:        PCRE_NOTEMPTY,  PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PAR-
        !          7226:        TIAL_SOFT.
1.1       misho    7227: 
                   7228:        The unsupported pattern items are:
                   7229: 
                   7230:          \C             match a single byte; not supported in UTF-8 mode
                   7231:          (?Cn)          callouts
1.1.1.3 ! misho    7232:          (*PRUNE)       )
        !          7233:          (*SKIP)        ) backtracking control verbs
1.1       misho    7234:          (*THEN)        )
                   7235: 
                   7236:        Support for some of these may be added in future.
                   7237: 
                   7238: 
                   7239: RETURN VALUES FROM JIT EXECUTION
                   7240: 
                   7241:        When a pattern is matched using JIT execution, the  return  values  are
                   7242:        the  same as those given by the interpretive pcre_exec() code, with the
                   7243:        addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT.  This  means
                   7244:        that  the memory used for the JIT stack was insufficient. See "Control-
                   7245:        ling the JIT stack" below for a discussion of JIT stack usage. For com-
                   7246:        patibility  with  the  interpretive pcre_exec() code, no more than two-
                   7247:        thirds of the ovector argument is used for passing back  captured  sub-
                   7248:        strings.
                   7249: 
                   7250:        The  error  code  PCRE_ERROR_MATCHLIMIT  is returned by the JIT code if
                   7251:        searching a very large pattern tree goes on for too long, as it  is  in
                   7252:        the  same circumstance when JIT is not used, but the details of exactly
                   7253:        what is counted are not the same. The  PCRE_ERROR_RECURSIONLIMIT  error
                   7254:        code is never returned by JIT execution.
                   7255: 
                   7256: 
                   7257: SAVING AND RESTORING COMPILED PATTERNS
                   7258: 
                   7259:        The  code  that  is  generated by the JIT compiler is architecture-spe-
                   7260:        cific, and is also position dependent. For those reasons it  cannot  be
                   7261:        saved  (in a file or database) and restored later like the bytecode and
                   7262:        other data of a compiled pattern. Saving and  restoring  compiled  pat-
                   7263:        terns  is not something many people do. More detail about this facility
                   7264:        is given in the pcreprecompile documentation. It should be possible  to
                   7265:        run  pcre_study() on a saved and restored pattern, and thereby recreate
                   7266:        the JIT data, but because JIT compilation uses  significant  resources,
                   7267:        it  is  probably  not worth doing this; you might as well recompile the
                   7268:        original pattern.
                   7269: 
                   7270: 
                   7271: CONTROLLING THE JIT STACK
                   7272: 
                   7273:        When the compiled JIT code runs, it needs a block of memory to use as a
                   7274:        stack.   By  default,  it  uses 32K on the machine stack. However, some
                   7275:        large  or  complicated  patterns  need  more  than  this.   The   error
                   7276:        PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.
                   7277:        Three functions are provided for managing blocks of memory for  use  as
                   7278:        JIT  stacks. There is further discussion about the use of JIT stacks in
                   7279:        the section entitled "JIT stack FAQ" below.
                   7280: 
                   7281:        The pcre_jit_stack_alloc() function creates a JIT stack. Its  arguments
                   7282:        are  a starting size and a maximum size, and it returns a pointer to an
                   7283:        opaque structure of type pcre_jit_stack, or NULL if there is an  error.
                   7284:        The  pcre_jit_stack_free() function can be used to free a stack that is
                   7285:        no longer needed. (For the technically minded:  the  address  space  is
                   7286:        allocated by mmap or VirtualAlloc.)
                   7287: 
                   7288:        JIT  uses far less memory for recursion than the interpretive code, and
                   7289:        a maximum stack size of 512K to 1M should be more than enough  for  any
                   7290:        pattern.
                   7291: 
                   7292:        The  pcre_assign_jit_stack()  function  specifies  which stack JIT code
                   7293:        should use. Its arguments are as follows:
                   7294: 
                   7295:          pcre_extra         *extra
                   7296:          pcre_jit_callback  callback
                   7297:          void               *data
                   7298: 
                   7299:        The extra argument must be  the  result  of  studying  a  pattern  with
1.1.1.3 ! misho    7300:        PCRE_STUDY_JIT_COMPILE etc. There are three cases for the values of the
1.1       misho    7301:        other two options:
                   7302: 
                   7303:          (1) If callback is NULL and data is NULL, an internal 32K block
                   7304:              on the machine stack is used.
                   7305: 
                   7306:          (2) If callback is NULL and data is not NULL, data must be
                   7307:              a valid JIT stack, the result of calling pcre_jit_stack_alloc().
                   7308: 
1.1.1.3 ! misho    7309:          (3) If callback is not NULL, it must point to a function that is
        !          7310:              called with data as an argument at the start of matching, in
        !          7311:              order to set up a JIT stack. If the return from the callback
        !          7312:              function is NULL, the internal 32K stack is used; otherwise the
        !          7313:              return value must be a valid JIT stack, the result of calling
        !          7314:              pcre_jit_stack_alloc().
        !          7315: 
        !          7316:        A callback function is obeyed whenever JIT code is about to be run;  it
        !          7317:        is  not  obeyed when pcre_exec() is called with options that are incom-
        !          7318:        patible for JIT execution. A callback function can therefore be used to
        !          7319:        determine  whether  a  match  operation  was  executed by JIT or by the
        !          7320:        interpreter.
        !          7321: 
        !          7322:        You may safely use the same JIT stack for more than one pattern (either
        !          7323:        by  assigning directly or by callback), as long as the patterns are all
        !          7324:        matched sequentially in the same thread. In a multithread  application,
        !          7325:        if  you  do not specify a JIT stack, or if you assign or pass back NULL
        !          7326:        from a callback, that is thread-safe, because each thread has  its  own
        !          7327:        machine  stack.  However,  if  you  assign  or pass back a non-NULL JIT
        !          7328:        stack, this must be a different stack  for  each  thread  so  that  the
        !          7329:        application is thread-safe.
        !          7330: 
        !          7331:        Strictly  speaking,  even more is allowed. You can assign the same non-
        !          7332:        NULL stack to any number of patterns as long as they are not  used  for
        !          7333:        matching  by  multiple  threads  at the same time. For example, you can
        !          7334:        assign the same stack to all compiled patterns, and use a global  mutex
        !          7335:        in  the callback to wait until the stack is available for use. However,
        !          7336:        this is an inefficient solution, and not recommended.
1.1       misho    7337: 
1.1.1.3 ! misho    7338:        This is a suggestion for how a multithreaded program that needs to  set
        !          7339:        up non-default JIT stacks might operate:
1.1       misho    7340: 
                   7341:          During thread initalization
                   7342:            thread_local_var = pcre_jit_stack_alloc(...)
                   7343: 
                   7344:          During thread exit
                   7345:            pcre_jit_stack_free(thread_local_var)
                   7346: 
                   7347:          Use a one-line callback function
                   7348:            return thread_local_var
                   7349: 
1.1.1.3 ! misho    7350:        All  the  functions  described in this section do nothing if JIT is not
        !          7351:        available, and pcre_assign_jit_stack() does nothing  unless  the  extra
        !          7352:        argument  is  non-NULL  and  points  to  a pcre_extra block that is the
        !          7353:        result of a successful study with PCRE_STUDY_JIT_COMPILE etc.
1.1       misho    7354: 
                   7355: 
                   7356: JIT STACK FAQ
                   7357: 
                   7358:        (1) Why do we need JIT stacks?
                   7359: 
1.1.1.3 ! misho    7360:        PCRE (and JIT) is a recursive, depth-first engine, so it needs a  stack
        !          7361:        where  the local data of the current node is pushed before checking its
1.1       misho    7362:        child nodes.  Allocating real machine stack on some platforms is diffi-
                   7363:        cult. For example, the stack chain needs to be updated every time if we
1.1.1.3 ! misho    7364:        extend the stack on PowerPC.  Although it  is  possible,  its  updating
1.1       misho    7365:        time overhead decreases performance. So we do the recursion in memory.
                   7366: 
                   7367:        (2) Why don't we simply allocate blocks of memory with malloc()?
                   7368: 
1.1.1.3 ! misho    7369:        Modern  operating  systems  have  a  nice  feature: they can reserve an
1.1       misho    7370:        address space instead of allocating memory. We can safely allocate mem-
1.1.1.3 ! misho    7371:        ory  pages  inside  this address space, so the stack could grow without
1.1       misho    7372:        moving memory data (this is important because of pointers). Thus we can
1.1.1.3 ! misho    7373:        allocate  1M  address space, and use only a single memory page (usually
        !          7374:        4K) if that is enough. However, we can still grow up to 1M  anytime  if
1.1       misho    7375:        needed.
                   7376: 
                   7377:        (3) Who "owns" a JIT stack?
                   7378: 
                   7379:        The owner of the stack is the user program, not the JIT studied pattern
1.1.1.3 ! misho    7380:        or anything else. The user program must ensure that if a stack is  used
        !          7381:        by  pcre_exec(), (that is, it is assigned to the pattern currently run-
1.1       misho    7382:        ning), that stack must not be used by any other threads (to avoid over-
                   7383:        writing the same memory area). The best practice for multithreaded pro-
1.1.1.3 ! misho    7384:        grams is to allocate a stack for each thread,  and  return  this  stack
1.1       misho    7385:        through the JIT callback function.
                   7386: 
                   7387:        (4) When should a JIT stack be freed?
                   7388: 
                   7389:        You can free a JIT stack at any time, as long as it will not be used by
1.1.1.3 ! misho    7390:        pcre_exec() again. When you assign the  stack  to  a  pattern,  only  a
        !          7391:        pointer  is set. There is no reference counting or any other magic. You
        !          7392:        can free the patterns and stacks in any order,  anytime.  Just  do  not
        !          7393:        call  pcre_exec() with a pattern pointing to an already freed stack, as
        !          7394:        that will cause SEGFAULT. (Also, do not free a stack currently used  by
        !          7395:        pcre_exec()  in  another  thread). You can also replace the stack for a
        !          7396:        pattern at any time. You  can  even  free  the  previous  stack  before
1.1       misho    7397:        assigning a replacement.
                   7398: 
1.1.1.3 ! misho    7399:        (5)  Should  I  allocate/free  a  stack every time before/after calling
1.1       misho    7400:        pcre_exec()?
                   7401: 
1.1.1.3 ! misho    7402:        No, because this is too costly in  terms  of  resources.  However,  you
        !          7403:        could  implement  some clever idea which release the stack if it is not
1.1       misho    7404:        used in let's say two minutes. The JIT callback can help to achive this
                   7405:        without keeping a list of the currently JIT studied patterns.
                   7406: 
1.1.1.3 ! misho    7407:        (6)  OK, the stack is for long term memory allocation. But what happens
        !          7408:        if a pattern causes stack overflow with a stack of 1M? Is that 1M  kept
1.1       misho    7409:        until the stack is freed?
                   7410: 
1.1.1.3 ! misho    7411:        Especially  on embedded sytems, it might be a good idea to release mem-
        !          7412:        ory sometimes without freeing the stack. There is no API  for  this  at
        !          7413:        the  moment.  Probably a function call which returns with the currently
        !          7414:        allocated memory for any stack and another which allows releasing  mem-
1.1       misho    7415:        ory (shrinking the stack) would be a good idea if someone needs this.
                   7416: 
                   7417:        (7) This is too much of a headache. Isn't there any better solution for
                   7418:        JIT stack handling?
                   7419: 
1.1.1.3 ! misho    7420:        No, thanks to Windows. If POSIX threads were used everywhere, we  could
1.1       misho    7421:        throw out this complicated API.
                   7422: 
                   7423: 
                   7424: EXAMPLE CODE
                   7425: 
1.1.1.3 ! misho    7426:        This  is  a  single-threaded example that specifies a JIT stack without
1.1       misho    7427:        using a callback.
                   7428: 
                   7429:          int rc;
                   7430:          int ovector[30];
                   7431:          pcre *re;
                   7432:          pcre_extra *extra;
                   7433:          pcre_jit_stack *jit_stack;
                   7434: 
                   7435:          re = pcre_compile(pattern, 0, &error, &erroffset, NULL);
                   7436:          /* Check for errors */
                   7437:          extra = pcre_study(re, PCRE_STUDY_JIT_COMPILE, &error);
                   7438:          jit_stack = pcre_jit_stack_alloc(32*1024, 512*1024);
                   7439:          /* Check for error (NULL) */
                   7440:          pcre_assign_jit_stack(extra, NULL, jit_stack);
                   7441:          rc = pcre_exec(re, extra, subject, length, 0, 0, ovector, 30);
                   7442:          /* Check results */
                   7443:          pcre_free(re);
                   7444:          pcre_free_study(extra);
                   7445:          pcre_jit_stack_free(jit_stack);
                   7446: 
                   7447: 
                   7448: SEE ALSO
                   7449: 
                   7450:        pcreapi(3)
                   7451: 
                   7452: 
                   7453: AUTHOR
                   7454: 
                   7455:        Philip Hazel (FAQ by Zoltan Herczeg)
                   7456:        University Computing Service
                   7457:        Cambridge CB2 3QH, England.
                   7458: 
                   7459: 
                   7460: REVISION
                   7461: 
1.1.1.3 ! misho    7462:        Last updated: 04 May 2012
1.1.1.2   misho    7463:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    7464: ------------------------------------------------------------------------------
                   7465: 
                   7466: 
                   7467: PCREPARTIAL(3)                                                  PCREPARTIAL(3)
                   7468: 
                   7469: 
                   7470: NAME
                   7471:        PCRE - Perl-compatible regular expressions
                   7472: 
                   7473: 
                   7474: PARTIAL MATCHING IN PCRE
                   7475: 
1.1.1.2   misho    7476:        In normal use of PCRE, if the subject string that is passed to a match-
                   7477:        ing function matches as far as it goes, but is too short to  match  the
                   7478:        entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances
                   7479:        where it might be helpful to distinguish this case from other cases  in
                   7480:        which there is no match.
1.1       misho    7481: 
                   7482:        Consider, for example, an application where a human is required to type
                   7483:        in data for a field with specific formatting requirements.  An  example
                   7484:        might be a date in the form ddmmmyy, defined by this pattern:
                   7485: 
                   7486:          ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
                   7487: 
                   7488:        If the application sees the user's keystrokes one by one, and can check
                   7489:        that what has been typed so far is potentially valid,  it  is  able  to
                   7490:        raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
                   7491:        reflecting the character that has been typed, for example. This immedi-
                   7492:        ate  feedback is likely to be a better user interface than a check that
                   7493:        is delayed until the entire string has been entered.  Partial  matching
                   7494:        can  also be useful when the subject string is very long and is not all
                   7495:        available at once.
                   7496: 
                   7497:        PCRE supports partial matching by means of  the  PCRE_PARTIAL_SOFT  and
1.1.1.2   misho    7498:        PCRE_PARTIAL_HARD  options,  which  can  be set when calling any of the
                   7499:        matching functions. For backwards compatibility, PCRE_PARTIAL is a syn-
                   7500:        onym  for  PCRE_PARTIAL_SOFT.  The essential difference between the two
                   7501:        options is whether or not a partial match is preferred to  an  alterna-
                   7502:        tive complete match, though the details differ between the two types of
                   7503:        matching function. If both options  are  set,  PCRE_PARTIAL_HARD  takes
                   7504:        precedence.
                   7505: 
1.1.1.3 ! misho    7506:        If  you  want to use partial matching with just-in-time optimized code,
        !          7507:        you must call pcre_study() or pcre16_study() with one or both of  these
        !          7508:        options:
        !          7509: 
        !          7510:          PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
        !          7511:          PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
        !          7512: 
        !          7513:        PCRE_STUDY_JIT_COMPILE  should also be set if you are going to run non-
        !          7514:        partial matches on the same pattern. If the appropriate JIT study  mode
        !          7515:        has not been set for a match, the interpretive matching code is used.
        !          7516: 
        !          7517:        Setting a partial matching option disables two of PCRE's standard opti-
        !          7518:        mizations. PCRE remembers the last literal data unit in a pattern,  and
        !          7519:        abandons  matching  immediately  if  it  is  not present in the subject
1.1.1.2   misho    7520:        string. This optimization cannot be used  for  a  subject  string  that
                   7521:        might  match only partially. If the pattern was studied, PCRE knows the
                   7522:        minimum length of a matching string, and does not  bother  to  run  the
                   7523:        matching  function  on  shorter strings. This optimization is also dis-
1.1       misho    7524:        abled for partial matching.
                   7525: 
                   7526: 
1.1.1.2   misho    7527: PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()
1.1       misho    7528: 
1.1.1.2   misho    7529:        A partial match occurs during a call to  pcre_exec()  or  pcre16_exec()
                   7530:        when  the end of the subject string is reached successfully, but match-
                   7531:        ing cannot continue because more characters  are  needed.  However,  at
                   7532:        least one character in the subject must have been inspected. This char-
                   7533:        acter need not form part of the final matched string; lookbehind asser-
                   7534:        tions  and the \K escape sequence provide ways of inspecting characters
                   7535:        before the start of a matched substring. The requirement for inspecting
                   7536:        at  least  one  character  exists because an empty string can always be
                   7537:        matched; without such a restriction there would  always  be  a  partial
                   7538:        match of an empty string at the end of the subject.
                   7539: 
                   7540:        If  there  are  at least two slots in the offsets vector when a partial
                   7541:        match is returned, the first slot is set to the offset of the  earliest
                   7542:        character that was inspected. For convenience, the second offset points
                   7543:        to the end of the subject so that a substring can easily be identified.
1.1       misho    7544: 
                   7545:        For the majority of patterns, the first offset identifies the start  of
                   7546:        the  partially matched string. However, for patterns that contain look-
                   7547:        behind assertions, or \K, or begin with \b or  \B,  earlier  characters
                   7548:        have been inspected while carrying out the match. For example:
                   7549: 
                   7550:          /(?<=abc)123/
                   7551: 
                   7552:        This pattern matches "123", but only if it is preceded by "abc". If the
                   7553:        subject string is "xyzabc12", the offsets after a partial match are for
                   7554:        the  substring  "abc12",  because  all  these  characters are needed if
                   7555:        another match is tried with extra characters added to the subject.
                   7556: 
                   7557:        What happens when a partial match is identified depends on which of the
                   7558:        two partial matching options are set.
                   7559: 
1.1.1.2   misho    7560:    PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()
1.1       misho    7561: 
1.1.1.2   misho    7562:        If  PCRE_PARTIAL_SOFT  is set when pcre_exec() or pcre16_exec() identi-
                   7563:        fies a partial match, the partial match  is  remembered,  but  matching
                   7564:        continues  as  normal, and other alternatives in the pattern are tried.
                   7565:        If no complete match  can  be  found,  PCRE_ERROR_PARTIAL  is  returned
                   7566:        instead of PCRE_ERROR_NOMATCH.
1.1       misho    7567: 
                   7568:        This  option  is "soft" because it prefers a complete match over a par-
                   7569:        tial match.  All the various matching items in a pattern behave  as  if
                   7570:        the  subject string is potentially complete. For example, \z, \Z, and $
                   7571:        match at the end of the subject, as normal, and for \b and \B  the  end
                   7572:        of the subject is treated as a non-alphanumeric.
                   7573: 
                   7574:        If  there  is more than one partial match, the first one that was found
                   7575:        provides the data that is returned. Consider this pattern:
                   7576: 
                   7577:          /123\w+X|dogY/
                   7578: 
                   7579:        If this is matched against the subject string "abc123dog", both  alter-
                   7580:        natives  fail  to  match,  but the end of the subject is reached during
                   7581:        matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set  to  3
                   7582:        and  9, identifying "123dog" as the first partial match that was found.
                   7583:        (In this example, there are two partial matches, because "dog"  on  its
                   7584:        own partially matches the second alternative.)
                   7585: 
1.1.1.2   misho    7586:    PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()
1.1       misho    7587: 
1.1.1.2   misho    7588:        If   PCRE_PARTIAL_HARD   is   set  for  pcre_exec()  or  pcre16_exec(),
                   7589:        PCRE_ERROR_PARTIAL is returned as soon as a  partial  match  is  found,
                   7590:        without continuing to search for possible complete matches. This option
                   7591:        is "hard" because it prefers an earlier partial match over a later com-
                   7592:        plete  match.  For  this reason, the assumption is made that the end of
                   7593:        the supplied subject string may not be the true end  of  the  available
                   7594:        data, and so, if \z, \Z, \b, \B, or $ are encountered at the end of the
                   7595:        subject, the result is PCRE_ERROR_PARTIAL, provided that at  least  one
                   7596:        character in the subject has been inspected.
                   7597: 
                   7598:        Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 subject
                   7599:        strings are checked for validity. Normally, an invalid sequence  causes
                   7600:        the  error  PCRE_ERROR_BADUTF8  or PCRE_ERROR_BADUTF16. However, in the
                   7601:        special case of a truncated  character  at  the  end  of  the  subject,
                   7602:        PCRE_ERROR_SHORTUTF8   or   PCRE_ERROR_SHORTUTF16   is   returned  when
                   7603:        PCRE_PARTIAL_HARD is set.
1.1       misho    7604: 
                   7605:    Comparing hard and soft partial matching
                   7606: 
                   7607:        The difference between the two partial matching options can  be  illus-
                   7608:        trated by a pattern such as:
                   7609: 
                   7610:          /dog(sbody)?/
                   7611: 
                   7612:        This  matches either "dog" or "dogsbody", greedily (that is, it prefers
                   7613:        the longer string if possible). If it is  matched  against  the  string
                   7614:        "dog"  with  PCRE_PARTIAL_SOFT,  it  yields a complete match for "dog".
                   7615:        However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL.
                   7616:        On  the  other hand, if the pattern is made ungreedy the result is dif-
                   7617:        ferent:
                   7618: 
                   7619:          /dog(sbody)??/
                   7620: 
1.1.1.2   misho    7621:        In this case the result is always a  complete  match  because  that  is
                   7622:        found  first,  and  matching  never  continues after finding a complete
                   7623:        match. It might be easier to follow this explanation by thinking of the
                   7624:        two patterns like this:
1.1       misho    7625: 
                   7626:          /dog(sbody)?/    is the same as  /dogsbody|dog/
                   7627:          /dog(sbody)??/   is the same as  /dog|dogsbody/
                   7628: 
1.1.1.2   misho    7629:        The  second pattern will never match "dogsbody", because it will always
                   7630:        find the shorter match first.
1.1       misho    7631: 
                   7632: 
1.1.1.2   misho    7633: PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()
1.1       misho    7634: 
1.1.1.2   misho    7635:        The DFA functions move along the subject string character by character,
                   7636:        without  backtracking,  searching  for  all possible matches simultane-
                   7637:        ously. If the end of the subject is reached before the end of the  pat-
                   7638:        tern,  there is the possibility of a partial match, again provided that
                   7639:        at least one character has been inspected.
1.1       misho    7640: 
                   7641:        When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned  only  if
                   7642:        there  have  been  no complete matches. Otherwise, the complete matches
                   7643:        are returned.  However, if PCRE_PARTIAL_HARD is set,  a  partial  match
                   7644:        takes  precedence  over any complete matches. The portion of the string
                   7645:        that was inspected when the longest partial match was found is  set  as
                   7646:        the first matching string, provided there are at least two slots in the
                   7647:        offsets vector.
                   7648: 
1.1.1.2   misho    7649:        Because the DFA functions always search for all possible  matches,  and
                   7650:        there  is  no  difference between greedy and ungreedy repetition, their
                   7651:        behaviour is different  from  the  standard  functions  when  PCRE_PAR-
                   7652:        TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
                   7653:        ungreedy pattern shown above:
1.1       misho    7654: 
                   7655:          /dog(sbody)??/
                   7656: 
1.1.1.2   misho    7657:        Whereas the standard functions stop as soon as they find  the  complete
                   7658:        match  for  "dog",  the  DFA  functions also find the partial match for
                   7659:        "dogsbody", and so return that when PCRE_PARTIAL_HARD is set.
1.1       misho    7660: 
                   7661: 
                   7662: PARTIAL MATCHING AND WORD BOUNDARIES
                   7663: 
                   7664:        If a pattern ends with one of sequences \b or \B, which test  for  word
                   7665:        boundaries,  partial  matching with PCRE_PARTIAL_SOFT can give counter-
                   7666:        intuitive results. Consider this pattern:
                   7667: 
                   7668:          /\bcat\b/
                   7669: 
                   7670:        This matches "cat", provided there is a word boundary at either end. If
                   7671:        the subject string is "the cat", the comparison of the final "t" with a
                   7672:        following character cannot take place, so a  partial  match  is  found.
1.1.1.2   misho    7673:        However,  normal  matching carries on, and \b matches at the end of the
                   7674:        subject when the last character is a letter, so  a  complete  match  is
                   7675:        found.   The   result,  therefore,  is  not  PCRE_ERROR_PARTIAL.  Using
                   7676:        PCRE_PARTIAL_HARD in this case does yield  PCRE_ERROR_PARTIAL,  because
                   7677:        then the partial match takes precedence.
1.1       misho    7678: 
                   7679: 
                   7680: FORMERLY RESTRICTED PATTERNS
                   7681: 
                   7682:        For releases of PCRE prior to 8.00, because of the way certain internal
1.1.1.2   misho    7683:        optimizations  were  implemented  in  the  pcre_exec()  function,   the
                   7684:        PCRE_PARTIAL  option  (predecessor  of  PCRE_PARTIAL_SOFT) could not be
                   7685:        used with all patterns. From release 8.00 onwards, the restrictions  no
                   7686:        longer  apply,  and partial matching with can be requested for any pat-
                   7687:        tern.
1.1       misho    7688: 
                   7689:        Items that were formerly restricted were repeated single characters and
1.1.1.2   misho    7690:        repeated  metasequences. If PCRE_PARTIAL was set for a pattern that did
                   7691:        not conform to the restrictions, pcre_exec() returned  the  error  code
                   7692:        PCRE_ERROR_BADPARTIAL  (-13).  This error code is no longer in use. The
                   7693:        PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if  a  compiled
1.1       misho    7694:        pattern can be used for partial matching now always returns 1.
                   7695: 
                   7696: 
                   7697: EXAMPLE OF PARTIAL MATCHING USING PCRETEST
                   7698: 
1.1.1.2   misho    7699:        If  the  escape  sequence  \P  is  present in a pcretest data line, the
                   7700:        PCRE_PARTIAL_SOFT option is used for  the  match.  Here  is  a  run  of
1.1       misho    7701:        pcretest that uses the date example quoted above:
                   7702: 
                   7703:            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                   7704:          data> 25jun04\P
                   7705:           0: 25jun04
                   7706:           1: jun
                   7707:          data> 25dec3\P
                   7708:          Partial match: 23dec3
                   7709:          data> 3ju\P
                   7710:          Partial match: 3ju
                   7711:          data> 3juj\P
                   7712:          No match
                   7713:          data> j\P
                   7714:          No match
                   7715: 
1.1.1.2   misho    7716:        The  first  data  string  is  matched completely, so pcretest shows the
                   7717:        matched substrings. The remaining four strings do not  match  the  com-
1.1       misho    7718:        plete pattern, but the first two are partial matches. Similar output is
1.1.1.2   misho    7719:        obtained if DFA matching is used.
1.1       misho    7720: 
1.1.1.2   misho    7721:        If the escape sequence \P is present more than once in a pcretest  data
1.1       misho    7722:        line, the PCRE_PARTIAL_HARD option is set for the match.
                   7723: 
                   7724: 
1.1.1.2   misho    7725: MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()
1.1       misho    7726: 
1.1.1.2   misho    7727:        When  a  partial match has been found using a DFA matching function, it
                   7728:        is possible to continue the match by providing additional subject  data
                   7729:        and  calling  the function again with the same compiled regular expres-
                   7730:        sion, this time setting the PCRE_DFA_RESTART option. You must pass  the
1.1       misho    7731:        same working space as before, because this is where details of the pre-
1.1.1.2   misho    7732:        vious partial match are stored. Here  is  an  example  using  pcretest,
                   7733:        using  the  \R  escape  sequence to set the PCRE_DFA_RESTART option (\D
                   7734:        specifies the use of the DFA matching function):
1.1       misho    7735: 
                   7736:            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
                   7737:          data> 23ja\P\D
                   7738:          Partial match: 23ja
                   7739:          data> n05\R\D
                   7740:           0: n05
                   7741: 
1.1.1.2   misho    7742:        The first call has "23ja" as the subject, and requests  partial  match-
                   7743:        ing;  the  second  call  has  "n05"  as  the  subject for the continued
                   7744:        (restarted) match.  Notice that when the match is  complete,  only  the
                   7745:        last  part  is  shown;  PCRE  does not retain the previously partially-
                   7746:        matched string. It is up to the calling program to do that if it  needs
1.1       misho    7747:        to.
                   7748: 
1.1.1.2   misho    7749:        You  can  set  the  PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
                   7750:        PCRE_DFA_RESTART to continue partial matching over  multiple  segments.
                   7751:        This  facility can be used to pass very long subject strings to the DFA
                   7752:        matching functions.
                   7753: 
                   7754: 
                   7755: MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()
                   7756: 
                   7757:        From release 8.00, the standard matching functions can also be used  to
                   7758:        do multi-segment matching. Unlike the DFA functions, it is not possible
                   7759:        to restart the previous match with a new segment of data. Instead,  new
                   7760:        data must be added to the previous subject string, and the entire match
                   7761:        re-run, starting from the point where the partial match occurred.  Ear-
                   7762:        lier data can be discarded.
                   7763: 
                   7764:        It  is best to use PCRE_PARTIAL_HARD in this situation, because it does
                   7765:        not treat the end of a segment as the end of the subject when  matching
                   7766:        \z,  \Z,  \b,  \B,  and  $. Consider an unanchored pattern that matches
                   7767:        dates:
1.1       misho    7768: 
                   7769:            re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
                   7770:          data> The date is 23ja\P\P
                   7771:          Partial match: 23ja
                   7772: 
1.1.1.2   misho    7773:        At this stage, an application could discard the text preceding  "23ja",
                   7774:        add  on  text  from  the  next  segment, and call the matching function
1.1.1.3 ! misho    7775:        again. Unlike the DFA matching functions, the  entire  matching  string
1.1.1.2   misho    7776:        must  always be available, and the complete matching process occurs for
                   7777:        each call, so more memory and more processing time is needed.
                   7778: 
                   7779:        Note: If the pattern contains lookbehind assertions, or \K,  or  starts
                   7780:        with \b or \B, the string that is returned for a partial match includes
                   7781:        characters that precede the partially matched  string  itself,  because
                   7782:        these  must be retained when adding on more characters for a subsequent
1.1.1.3 ! misho    7783:        matching attempt.  However, in some cases you may need to  retain  even
        !          7784:        earlier characters, as discussed in the next section.
1.1       misho    7785: 
                   7786: 
                   7787: ISSUES WITH MULTI-SEGMENT MATCHING
                   7788: 
                   7789:        Certain types of pattern may give problems with multi-segment matching,
                   7790:        whichever matching function is used.
                   7791: 
                   7792:        1. If the pattern contains a test for the beginning of a line, you need
1.1.1.3 ! misho    7793:        to  pass  the  PCRE_NOTBOL  option when the subject string for any call
        !          7794:        does start at the beginning of a line.  There  is  also  a  PCRE_NOTEOL
1.1       misho    7795:        option, but in practice when doing multi-segment matching you should be
                   7796:        using PCRE_PARTIAL_HARD, which includes the effect of PCRE_NOTEOL.
                   7797: 
1.1.1.3 ! misho    7798:        2. Lookbehind assertions that have already been obeyed are catered  for
        !          7799:        in the offsets that are returned for a partial match. However a lookbe-
        !          7800:        hind assertion later in the pattern could require even earlier  charac-
        !          7801:        ters   to  be  inspected.  You  can  handle  this  case  by  using  the
        !          7802:        PCRE_INFO_MAXLOOKBEHIND    option    of    the    pcre_fullinfo()    or
        !          7803:        pcre16_fullinfo() functions to obtain the length of the largest lookbe-
        !          7804:        hind in the pattern. This length is given in characters, not bytes.  If
        !          7805:        you  always  retain  at least that many characters before the partially
        !          7806:        matched string, all should be well. (Of course, near the start  of  the
        !          7807:        subject,  fewer  characters may be present; in that case all characters
        !          7808:        should be retained.)
        !          7809: 
        !          7810:        3. Because a partial match must always contain at least one  character,
        !          7811:        what  might  be  considered a partial match of an empty string actually
        !          7812:        gives a "no match" result. For example:
        !          7813: 
        !          7814:            re> /c(?<=abc)x/
        !          7815:          data> ab\P
        !          7816:          No match
        !          7817: 
        !          7818:        If the next segment begins "cx", a match should be found, but this will
        !          7819:        only  happen  if characters from the previous segment are retained. For
        !          7820:        this reason, a "no match" result  should  be  interpreted  as  "partial
        !          7821:        match of an empty string" when the pattern contains lookbehinds.
1.1       misho    7822: 
1.1.1.3 ! misho    7823:        4.  Matching  a subject string that is split into multiple segments may
1.1.1.2   misho    7824:        not always produce exactly the same result as matching over one  single
                   7825:        long  string,  especially  when  PCRE_PARTIAL_SOFT is used. The section
                   7826:        "Partial Matching and Word Boundaries" above describes  an  issue  that
                   7827:        arises  if  the  pattern ends with \b or \B. Another kind of difference
                   7828:        may occur when there are multiple matching possibilities, because  (for
                   7829:        PCRE_PARTIAL_SOFT)  a partial match result is given only when there are
1.1       misho    7830:        no completed matches. This means that as soon as the shortest match has
1.1.1.2   misho    7831:        been  found,  continuation to a new subject segment is no longer possi-
1.1       misho    7832:        ble. Consider again this pcretest example:
                   7833: 
                   7834:            re> /dog(sbody)?/
                   7835:          data> dogsb\P
                   7836:           0: dog
                   7837:          data> do\P\D
                   7838:          Partial match: do
                   7839:          data> gsb\R\P\D
                   7840:           0: g
                   7841:          data> dogsbody\D
                   7842:           0: dogsbody
                   7843:           1: dog
                   7844: 
1.1.1.2   misho    7845:        The first data line passes the string "dogsb" to  a  standard  matching
                   7846:        function,  setting the PCRE_PARTIAL_SOFT option. Although the string is
                   7847:        a partial match for "dogsbody", the result is  not  PCRE_ERROR_PARTIAL,
                   7848:        because  the  shorter string "dog" is a complete match. Similarly, when
                   7849:        the subject is presented to a DFA matching function  in  several  parts
                   7850:        ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
                   7851:        been found, and it is not possible to continue.  On the other hand,  if
                   7852:        "dogsbody"  is  presented  as  a single string, a DFA matching function
                   7853:        finds both matches.
1.1       misho    7854: 
                   7855:        Because of these problems, it is best  to  use  PCRE_PARTIAL_HARD  when
                   7856:        matching  multi-segment  data.  The  example above then behaves differ-
                   7857:        ently:
                   7858: 
                   7859:            re> /dog(sbody)?/
                   7860:          data> dogsb\P\P
                   7861:          Partial match: dogsb
                   7862:          data> do\P\D
                   7863:          Partial match: do
                   7864:          data> gsb\R\P\P\D
                   7865:          Partial match: gsb
                   7866: 
1.1.1.3 ! misho    7867:        5. Patterns that contain alternatives at the top level which do not all
1.1       misho    7868:        start  with  the  same  pattern  item  may  not  work  as expected when
1.1.1.2   misho    7869:        PCRE_DFA_RESTART is used. For example, consider this pattern:
1.1       misho    7870: 
                   7871:          1234|3789
                   7872: 
1.1.1.2   misho    7873:        If the first part of the subject is "ABC123", a partial  match  of  the
                   7874:        first  alternative  is found at offset 3. There is no partial match for
1.1       misho    7875:        the second alternative, because such a match does not start at the same
1.1.1.2   misho    7876:        point  in  the  subject  string. Attempting to continue with the string
                   7877:        "7890" does not yield a match  because  only  those  alternatives  that
                   7878:        match  at  one  point in the subject are remembered. The problem arises
                   7879:        because the start of the second alternative matches  within  the  first
                   7880:        alternative.  There  is  no  problem with anchored patterns or patterns
1.1       misho    7881:        such as:
                   7882: 
                   7883:          1234|ABCD
                   7884: 
1.1.1.2   misho    7885:        where no string can be a partial match for both alternatives.  This  is
                   7886:        not  a  problem  if  a  standard matching function is used, because the
                   7887:        entire match has to be rerun each time:
1.1       misho    7888: 
                   7889:            re> /1234|3789/
                   7890:          data> ABC123\P\P
                   7891:          Partial match: 123
                   7892:          data> 1237890
                   7893:           0: 3789
                   7894: 
                   7895:        Of course, instead of using PCRE_DFA_RESTART, the same technique of re-
1.1.1.2   misho    7896:        running  the  entire match can also be used with the DFA matching func-
                   7897:        tions. Another possibility is to work with two buffers.  If  a  partial
                   7898:        match  at  offset  n in the first buffer is followed by "no match" when
                   7899:        PCRE_DFA_RESTART is used on the second buffer, you can then try  a  new
                   7900:        match starting at offset n+1 in the first buffer.
1.1       misho    7901: 
                   7902: 
                   7903: AUTHOR
                   7904: 
                   7905:        Philip Hazel
                   7906:        University Computing Service
                   7907:        Cambridge CB2 3QH, England.
                   7908: 
                   7909: 
                   7910: REVISION
                   7911: 
1.1.1.3 ! misho    7912:        Last updated: 24 February 2012
1.1.1.2   misho    7913:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    7914: ------------------------------------------------------------------------------
                   7915: 
                   7916: 
                   7917: PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
                   7918: 
                   7919: 
                   7920: NAME
                   7921:        PCRE - Perl-compatible regular expressions
                   7922: 
                   7923: 
                   7924: SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
                   7925: 
                   7926:        If  you  are running an application that uses a large number of regular
                   7927:        expression patterns, it may be useful to store them  in  a  precompiled
                   7928:        form  instead  of  having to compile them every time the application is
                   7929:        run.  If you are not  using  any  private  character  tables  (see  the
                   7930:        pcre_maketables()  documentation),  this is relatively straightforward.
                   7931:        If you are using private tables, it is a little bit  more  complicated.
1.1.1.2   misho    7932:        However,  if you are using the just-in-time optimization feature, it is
                   7933:        not possible to save and reload the JIT data.
1.1       misho    7934: 
                   7935:        If you save compiled patterns to a file, you can copy them to a differ-
1.1.1.2   misho    7936:        ent host and run them there. If the two hosts have different endianness
                   7937:        (byte order), you should run the  pcre[16]_pattern_to_host_byte_order()
                   7938:        function on the new host before trying to match the pattern. The match-
                   7939:        ing functions return PCRE_ERROR_BADENDIANNESS if they detect a  pattern
                   7940:        with the wrong endianness.
                   7941: 
                   7942:        Compiling  regular  expressions with one version of PCRE for use with a
                   7943:        different version is not guaranteed to work and may cause crashes,  and
                   7944:        saving  and  restoring  a  compiled  pattern loses any JIT optimization
                   7945:        data.
1.1       misho    7946: 
                   7947: 
                   7948: SAVING A COMPILED PATTERN
                   7949: 
1.1.1.2   misho    7950:        The value returned by pcre[16]_compile() points to a  single  block  of
                   7951:        memory  that  holds  the  compiled pattern and associated data. You can
                   7952:        find the length of this block in bytes by  calling  pcre[16]_fullinfo()
                   7953:        with  an  argument of PCRE_INFO_SIZE. You can then save the data in any
                   7954:        appropriate manner. Here is sample code for the 8-bit library that com-
                   7955:        piles  a  pattern and writes it to a file. It assumes that the variable
                   7956:        fd refers to a file that is open for output:
1.1       misho    7957: 
                   7958:          int erroroffset, rc, size;
                   7959:          char *error;
                   7960:          pcre *re;
                   7961: 
                   7962:          re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
                   7963:          if (re == NULL) { ... handle errors ... }
                   7964:          rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
                   7965:          if (rc < 0) { ... handle errors ... }
                   7966:          rc = fwrite(re, 1, size, fd);
                   7967:          if (rc != size) { ... handle errors ... }
                   7968: 
1.1.1.2   misho    7969:        In this example, the bytes  that  comprise  the  compiled  pattern  are
                   7970:        copied  exactly.  Note that this is binary data that may contain any of
                   7971:        the 256 possible byte  values.  On  systems  that  make  a  distinction
1.1       misho    7972:        between binary and non-binary data, be sure that the file is opened for
                   7973:        binary output.
                   7974: 
1.1.1.2   misho    7975:        If you want to write more than one pattern to a file, you will have  to
                   7976:        devise  a  way of separating them. For binary data, preceding each pat-
                   7977:        tern with its length is probably  the  most  straightforward  approach.
                   7978:        Another  possibility is to write out the data in hexadecimal instead of
1.1       misho    7979:        binary, one pattern to a line.
                   7980: 
1.1.1.2   misho    7981:        Saving compiled patterns in a file is only one possible way of  storing
                   7982:        them  for later use. They could equally well be saved in a database, or
                   7983:        in the memory of some daemon process that passes them  via  sockets  to
1.1       misho    7984:        the processes that want them.
                   7985: 
                   7986:        If the pattern has been studied, it is also possible to save the normal
                   7987:        study data in a similar way to the compiled pattern itself. However, if
                   7988:        the PCRE_STUDY_JIT_COMPILE was used, the just-in-time data that is cre-
1.1.1.2   misho    7989:        ated cannot be saved because it is too dependent on the  current  envi-
                   7990:        ronment.    When    studying    generates    additional    information,
                   7991:        pcre[16]_study() returns a pointer to a pcre[16]_extra data block.  Its
                   7992:        format  is  defined in the section on matching a pattern in the pcreapi
                   7993:        documentation. The study_data field points to the  binary  study  data,
                   7994:        and  this  is what you must save (not the pcre[16]_extra block itself).
                   7995:        The  length  of  the  study   data   can   be   obtained   by   calling
                   7996:        pcre[16]_fullinfo()  with  an argument of PCRE_INFO_STUDYSIZE. Remember
                   7997:        to check that pcre[16]_study() did return a non-NULL value before  try-
                   7998:        ing to save the study data.
1.1       misho    7999: 
                   8000: 
                   8001: RE-USING A PRECOMPILED PATTERN
                   8002: 
                   8003:        Re-using  a  precompiled pattern is straightforward. Having reloaded it
1.1.1.2   misho    8004:        into main memory, called pcre[16]_pattern_to_host_byte_order() if  nec-
                   8005:        essary,  you pass its pointer to pcre[16]_exec() or pcre[16]_dfa_exec()
                   8006:        in the usual way.
                   8007: 
                   8008:        However, if you passed a pointer to custom character  tables  when  the
                   8009:        pattern was compiled (the tableptr argument of pcre[16]_compile()), you
                   8010:        must   now   pass   a   similar   pointer   to    pcre[16]_exec()    or
                   8011:        pcre[16]_dfa_exec(),  because the value saved with the compiled pattern
                   8012:        will obviously be nonsense. A field in a pcre[16]_extra() block is used
                   8013:        to pass this data, as described in the section on matching a pattern in
                   8014:        the pcreapi documentation.
                   8015: 
                   8016:        If you did not provide custom character tables  when  the  pattern  was
                   8017:        compiled, the pointer in the compiled pattern is NULL, which causes the
                   8018:        matching functions to use PCRE's internal tables. Thus, you do not need
                   8019:        to take any special action at run time in this case.
                   8020: 
                   8021:        If  you  saved study data with the compiled pattern, you need to create
                   8022:        your own pcre[16]_extra data block and  set  the  study_data  field  to
                   8023:        point   to   the   reloaded   study   data.   You  must  also  set  the
                   8024:        PCRE_EXTRA_STUDY_DATA bit in the flags field  to  indicate  that  study
                   8025:        data  is  present.  Then  pass the pcre[16]_extra block to the matching
                   8026:        function in the usual way. If the pattern was studied for  just-in-time
                   8027:        optimization,  that  data  cannot  be  saved,  and  so  is  lost  by  a
                   8028:        save/restore cycle.
1.1       misho    8029: 
                   8030: 
                   8031: COMPATIBILITY WITH DIFFERENT PCRE RELEASES
                   8032: 
                   8033:        In general, it is safest to  recompile  all  saved  patterns  when  you
                   8034:        update  to  a new PCRE release, though not all updates actually require
                   8035:        this.
                   8036: 
                   8037: 
                   8038: AUTHOR
                   8039: 
                   8040:        Philip Hazel
                   8041:        University Computing Service
                   8042:        Cambridge CB2 3QH, England.
                   8043: 
                   8044: 
                   8045: REVISION
                   8046: 
1.1.1.2   misho    8047:        Last updated: 10 January 2012
                   8048:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    8049: ------------------------------------------------------------------------------
                   8050: 
                   8051: 
                   8052: PCREPERFORM(3)                                                  PCREPERFORM(3)
                   8053: 
                   8054: 
                   8055: NAME
                   8056:        PCRE - Perl-compatible regular expressions
                   8057: 
                   8058: 
                   8059: PCRE PERFORMANCE
                   8060: 
                   8061:        Two  aspects  of performance are discussed below: memory usage and pro-
                   8062:        cessing time. The way you express your pattern as a regular  expression
                   8063:        can affect both of them.
                   8064: 
                   8065: 
                   8066: COMPILED PATTERN MEMORY USAGE
                   8067: 
1.1.1.2   misho    8068:        Patterns  are compiled by PCRE into a reasonably efficient interpretive
                   8069:        code, so that most simple patterns do not  use  much  memory.  However,
                   8070:        there  is  one case where the memory usage of a compiled pattern can be
                   8071:        unexpectedly large. If a parenthesized subpattern has a quantifier with
                   8072:        a minimum greater than 1 and/or a limited maximum, the whole subpattern
                   8073:        is repeated in the compiled code. For example, the pattern
1.1       misho    8074: 
                   8075:          (abc|def){2,4}
                   8076: 
                   8077:        is compiled as if it were
                   8078: 
                   8079:          (abc|def)(abc|def)((abc|def)(abc|def)?)?
                   8080: 
                   8081:        (Technical aside: It is done this way so that backtrack  points  within
                   8082:        each of the repetitions can be independently maintained.)
                   8083: 
                   8084:        For  regular expressions whose quantifiers use only small numbers, this
                   8085:        is not usually a problem. However, if the numbers are large,  and  par-
                   8086:        ticularly  if  such repetitions are nested, the memory usage can become
                   8087:        an embarrassment. For example, the very simple pattern
                   8088: 
                   8089:          ((ab){1,1000}c){1,3}
                   8090: 
1.1.1.2   misho    8091:        uses 51K bytes when compiled using the 8-bit library. When PCRE is com-
                   8092:        piled  with  its  default  internal pointer size of two bytes, the size
                   8093:        limit on a compiled pattern is 64K data units, and this is reached with
                   8094:        the  above  pattern  if  the outer repetition is increased from 3 to 4.
                   8095:        PCRE can be compiled to use larger internal pointers  and  thus  handle
                   8096:        larger  compiled patterns, but it is better to try to rewrite your pat-
                   8097:        tern to use less memory if you can.
1.1       misho    8098: 
1.1.1.2   misho    8099:        One way of reducing the memory usage for such patterns is to  make  use
1.1       misho    8100:        of PCRE's "subroutine" facility. Re-writing the above pattern as
                   8101: 
                   8102:          ((ab)(?2){0,999}c)(?1){0,2}
                   8103: 
                   8104:        reduces the memory requirements to 18K, and indeed it remains under 20K
1.1.1.2   misho    8105:        even with the outer repetition increased to 100. However, this  pattern
                   8106:        is  not  exactly equivalent, because the "subroutine" calls are treated
                   8107:        as atomic groups into which there can be no backtracking if there is  a
                   8108:        subsequent  matching  failure.  Therefore,  PCRE cannot do this kind of
                   8109:        rewriting automatically.  Furthermore, there is a  noticeable  loss  of
                   8110:        speed  when executing the modified pattern. Nevertheless, if the atomic
                   8111:        grouping is not a problem and the loss of  speed  is  acceptable,  this
                   8112:        kind  of  rewriting will allow you to process patterns that PCRE cannot
1.1       misho    8113:        otherwise handle.
                   8114: 
                   8115: 
                   8116: STACK USAGE AT RUN TIME
                   8117: 
1.1.1.2   misho    8118:        When pcre_exec() or pcre16_exec() is used for matching,  certain  kinds
                   8119:        of  pattern  can cause it to use large amounts of the process stack. In
                   8120:        some environments the default process stack is quite small, and  if  it
                   8121:        runs  out  the result is often SIGSEGV. This issue is probably the most
                   8122:        frequently raised problem with PCRE. Rewriting your pattern  can  often
                   8123:        help. The pcrestack documentation discusses this issue in detail.
1.1       misho    8124: 
                   8125: 
                   8126: PROCESSING TIME
                   8127: 
1.1.1.2   misho    8128:        Certain  items  in regular expression patterns are processed more effi-
1.1       misho    8129:        ciently than others. It is more efficient to use a character class like
1.1.1.2   misho    8130:        [aeiou]   than   a   set   of  single-character  alternatives  such  as
                   8131:        (a|e|i|o|u). In general, the simplest construction  that  provides  the
1.1       misho    8132:        required behaviour is usually the most efficient. Jeffrey Friedl's book
1.1.1.2   misho    8133:        contains a lot of useful general discussion  about  optimizing  regular
                   8134:        expressions  for  efficient  performance.  This document contains a few
1.1       misho    8135:        observations about PCRE.
                   8136: 
1.1.1.2   misho    8137:        Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is
                   8138:        slow,  because PCRE has to scan a structure that contains data for over
                   8139:        fifteen thousand characters whenever it needs a  character's  property.
                   8140:        If  you  can  find  an  alternative pattern that does not use character
1.1       misho    8141:        properties, it will probably be faster.
                   8142: 
1.1.1.2   misho    8143:        By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
                   8144:        character  classes  such  as  [:alpha:]  do not use Unicode properties,
1.1       misho    8145:        partly for backwards compatibility, and partly for performance reasons.
1.1.1.2   misho    8146:        However,  you can set PCRE_UCP if you want Unicode character properties
                   8147:        to be used. This can double the matching time for  items  such  as  \d,
                   8148:        when matched with a traditional matching function; the performance loss
                   8149:        is less with a DFA matching function, and in both cases  there  is  not
                   8150:        much difference for \b.
1.1       misho    8151: 
                   8152:        When  a  pattern  begins  with .* not in parentheses, or in parentheses
                   8153:        that are not the subject of a backreference, and the PCRE_DOTALL option
                   8154:        is  set, the pattern is implicitly anchored by PCRE, since it can match
                   8155:        only at the start of a subject string. However, if PCRE_DOTALL  is  not
                   8156:        set,  PCRE  cannot  make this optimization, because the . metacharacter
                   8157:        does not then match a newline, and if the subject string contains  new-
                   8158:        lines,  the  pattern may match from the character immediately following
                   8159:        one of them instead of from the very start. For example, the pattern
                   8160: 
                   8161:          .*second
                   8162: 
                   8163:        matches the subject "first\nand second" (where \n stands for a  newline
                   8164:        character),  with the match starting at the seventh character. In order
                   8165:        to do this, PCRE has to retry the match starting after every newline in
                   8166:        the subject.
                   8167: 
                   8168:        If  you  are using such a pattern with subject strings that do not con-
                   8169:        tain newlines, the best performance is obtained by setting PCRE_DOTALL,
                   8170:        or  starting  the pattern with ^.* or ^.*? to indicate explicit anchor-
                   8171:        ing. That saves PCRE from having to scan along the subject looking  for
                   8172:        a newline to restart at.
                   8173: 
                   8174:        Beware  of  patterns  that contain nested indefinite repeats. These can
                   8175:        take a long time to run when applied to a string that does  not  match.
                   8176:        Consider the pattern fragment
                   8177: 
                   8178:          ^(a+)*
                   8179: 
                   8180:        This  can  match "aaaa" in 16 different ways, and this number increases
                   8181:        very rapidly as the string gets longer. (The * repeat can match  0,  1,
                   8182:        2,  3, or 4 times, and for each of those cases other than 0 or 4, the +
                   8183:        repeats can match different numbers of times.) When  the  remainder  of
                   8184:        the pattern is such that the entire match is going to fail, PCRE has in
                   8185:        principle to try  every  possible  variation,  and  this  can  take  an
                   8186:        extremely long time, even for relatively short strings.
                   8187: 
                   8188:        An optimization catches some of the more simple cases such as
                   8189: 
                   8190:          (a+)*b
                   8191: 
                   8192:        where  a  literal  character  follows. Before embarking on the standard
                   8193:        matching procedure, PCRE checks that there is a "b" later in  the  sub-
                   8194:        ject  string, and if there is not, it fails the match immediately. How-
                   8195:        ever, when there is no following literal this  optimization  cannot  be
                   8196:        used. You can see the difference by comparing the behaviour of
                   8197: 
                   8198:          (a+)*\d
                   8199: 
                   8200:        with  the  pattern  above.  The former gives a failure almost instantly
                   8201:        when applied to a whole line of  "a"  characters,  whereas  the  latter
                   8202:        takes an appreciable time with strings longer than about 20 characters.
                   8203: 
                   8204:        In many cases, the solution to this kind of performance issue is to use
                   8205:        an atomic group or a possessive quantifier.
                   8206: 
                   8207: 
                   8208: AUTHOR
                   8209: 
                   8210:        Philip Hazel
                   8211:        University Computing Service
                   8212:        Cambridge CB2 3QH, England.
                   8213: 
                   8214: 
                   8215: REVISION
                   8216: 
1.1.1.2   misho    8217:        Last updated: 09 January 2012
                   8218:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    8219: ------------------------------------------------------------------------------
                   8220: 
                   8221: 
                   8222: PCREPOSIX(3)                                                      PCREPOSIX(3)
                   8223: 
                   8224: 
                   8225: NAME
                   8226:        PCRE - Perl-compatible regular expressions.
                   8227: 
                   8228: 
                   8229: SYNOPSIS OF POSIX API
                   8230: 
                   8231:        #include <pcreposix.h>
                   8232: 
                   8233:        int regcomp(regex_t *preg, const char *pattern,
                   8234:             int cflags);
                   8235: 
                   8236:        int regexec(regex_t *preg, const char *string,
                   8237:             size_t nmatch, regmatch_t pmatch[], int eflags);
                   8238: 
                   8239:        size_t regerror(int errcode, const regex_t *preg,
                   8240:             char *errbuf, size_t errbuf_size);
                   8241: 
                   8242:        void regfree(regex_t *preg);
                   8243: 
                   8244: 
                   8245: DESCRIPTION
                   8246: 
1.1.1.2   misho    8247:        This  set  of functions provides a POSIX-style API for the PCRE regular
                   8248:        expression 8-bit library. See the pcreapi documentation for a  descrip-
                   8249:        tion  of  PCRE's native API, which contains much additional functional-
                   8250:        ity. There is no POSIX-style wrapper for PCRE's 16-bit library.
1.1       misho    8251: 
                   8252:        The functions described here are just wrapper functions that ultimately
                   8253:        call  the  PCRE  native  API.  Their  prototypes  are  defined  in  the
1.1.1.2   misho    8254:        pcreposix.h header file, and on Unix  systems  the  library  itself  is
                   8255:        called  pcreposix.a,  so  can  be accessed by adding -lpcreposix to the
                   8256:        command for linking an application that uses them.  Because  the  POSIX
1.1       misho    8257:        functions call the native ones, it is also necessary to add -lpcre.
                   8258: 
1.1.1.2   misho    8259:        I  have implemented only those POSIX option bits that can be reasonably
                   8260:        mapped to PCRE native options. In addition, the option REG_EXTENDED  is
                   8261:        defined  with  the  value  zero. This has no effect, but since programs
                   8262:        that are written to the POSIX interface often use  it,  this  makes  it
                   8263:        easier  to  slot  in PCRE as a replacement library. Other POSIX options
1.1       misho    8264:        are not even defined.
                   8265: 
1.1.1.2   misho    8266:        There are also some other options that are not defined by POSIX.  These
1.1       misho    8267:        have been added at the request of users who want to make use of certain
                   8268:        PCRE-specific features via the POSIX calling interface.
                   8269: 
1.1.1.2   misho    8270:        When PCRE is called via these functions, it is only  the  API  that  is
                   8271:        POSIX-like  in  style.  The syntax and semantics of the regular expres-
                   8272:        sions themselves are still those of Perl, subject  to  the  setting  of
                   8273:        various  PCRE  options, as described below. "POSIX-like in style" means
                   8274:        that the API approximates to the POSIX  definition;  it  is  not  fully
                   8275:        POSIX-compatible,  and  in  multi-byte  encoding domains it is probably
1.1       misho    8276:        even less compatible.
                   8277: 
1.1.1.2   misho    8278:        The header for these functions is supplied as pcreposix.h to avoid  any
                   8279:        potential  clash  with  other  POSIX  libraries.  It can, of course, be
1.1       misho    8280:        renamed or aliased as regex.h, which is the "correct" name. It provides
1.1.1.2   misho    8281:        two  structure  types,  regex_t  for  compiled internal forms, and reg-
                   8282:        match_t for returning captured substrings. It also  defines  some  con-
                   8283:        stants  whose  names  start  with  "REG_";  these  are used for setting
1.1       misho    8284:        options and identifying error codes.
                   8285: 
                   8286: 
                   8287: COMPILING A PATTERN
                   8288: 
1.1.1.2   misho    8289:        The function regcomp() is called to compile a pattern into an  internal
                   8290:        form.  The  pattern  is  a C string terminated by a binary zero, and is
                   8291:        passed in the argument pattern. The preg argument is  a  pointer  to  a
                   8292:        regex_t  structure that is used as a base for storing information about
1.1       misho    8293:        the compiled regular expression.
                   8294: 
                   8295:        The argument cflags is either zero, or contains one or more of the bits
                   8296:        defined by the following macros:
                   8297: 
                   8298:          REG_DOTALL
                   8299: 
                   8300:        The PCRE_DOTALL option is set when the regular expression is passed for
                   8301:        compilation to the native function. Note that REG_DOTALL is not part of
                   8302:        the POSIX standard.
                   8303: 
                   8304:          REG_ICASE
                   8305: 
1.1.1.2   misho    8306:        The  PCRE_CASELESS  option is set when the regular expression is passed
1.1       misho    8307:        for compilation to the native function.
                   8308: 
                   8309:          REG_NEWLINE
                   8310: 
1.1.1.2   misho    8311:        The PCRE_MULTILINE option is set when the regular expression is  passed
                   8312:        for  compilation  to the native function. Note that this does not mimic
                   8313:        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
1.1       misho    8314:        tion).
                   8315: 
                   8316:          REG_NOSUB
                   8317: 
1.1.1.2   misho    8318:        The  PCRE_NO_AUTO_CAPTURE  option is set when the regular expression is
1.1       misho    8319:        passed for compilation to the native function. In addition, when a pat-
1.1.1.2   misho    8320:        tern  that is compiled with this flag is passed to regexec() for match-
                   8321:        ing, the nmatch and pmatch  arguments  are  ignored,  and  no  captured
1.1       misho    8322:        strings are returned.
                   8323: 
                   8324:          REG_UCP
                   8325: 
1.1.1.2   misho    8326:        The  PCRE_UCP  option  is set when the regular expression is passed for
                   8327:        compilation to the native function. This causes  PCRE  to  use  Unicode
                   8328:        properties  when  matchine  \d,  \w,  etc., instead of just recognizing
1.1       misho    8329:        ASCII values. Note that REG_UTF8 is not part of the POSIX standard.
                   8330: 
                   8331:          REG_UNGREEDY
                   8332: 
1.1.1.2   misho    8333:        The PCRE_UNGREEDY option is set when the regular expression  is  passed
                   8334:        for  compilation  to the native function. Note that REG_UNGREEDY is not
1.1       misho    8335:        part of the POSIX standard.
                   8336: 
                   8337:          REG_UTF8
                   8338: 
1.1.1.2   misho    8339:        The PCRE_UTF8 option is set when the regular expression is  passed  for
                   8340:        compilation  to the native function. This causes the pattern itself and
                   8341:        all data strings used for matching it to be treated as  UTF-8  strings.
1.1       misho    8342:        Note that REG_UTF8 is not part of the POSIX standard.
                   8343: 
1.1.1.2   misho    8344:        In  the  absence  of  these  flags, no options are passed to the native
                   8345:        function.  This means the the  regex  is  compiled  with  PCRE  default
                   8346:        semantics.  In particular, the way it handles newline characters in the
                   8347:        subject string is the Perl way, not the POSIX way.  Note  that  setting
                   8348:        PCRE_MULTILINE  has only some of the effects specified for REG_NEWLINE.
                   8349:        It does not affect the way newlines are matched by . (they are not)  or
1.1       misho    8350:        by a negative class such as [^a] (they are).
                   8351: 
1.1.1.2   misho    8352:        The  yield of regcomp() is zero on success, and non-zero otherwise. The
1.1       misho    8353:        preg structure is filled in on success, and one member of the structure
1.1.1.2   misho    8354:        is  public: re_nsub contains the number of capturing subpatterns in the
1.1       misho    8355:        regular expression. Various error codes are defined in the header file.
                   8356: 
1.1.1.2   misho    8357:        NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
1.1       misho    8358:        use the contents of the preg structure. If, for example, you pass it to
                   8359:        regexec(), the result is undefined and your program is likely to crash.
                   8360: 
                   8361: 
                   8362: MATCHING NEWLINE CHARACTERS
                   8363: 
                   8364:        This area is not simple, because POSIX and Perl take different views of
1.1.1.2   misho    8365:        things.   It  is  not possible to get PCRE to obey POSIX semantics, but
                   8366:        then PCRE was never intended to be a POSIX engine. The following  table
                   8367:        lists  the  different  possibilities for matching newline characters in
1.1       misho    8368:        PCRE:
                   8369: 
                   8370:                                  Default   Change with
                   8371: 
                   8372:          . matches newline          no     PCRE_DOTALL
                   8373:          newline matches [^a]       yes    not changeable
                   8374:          $ matches \n at end        yes    PCRE_DOLLARENDONLY
                   8375:          $ matches \n in middle     no     PCRE_MULTILINE
                   8376:          ^ matches \n in middle     no     PCRE_MULTILINE
                   8377: 
                   8378:        This is the equivalent table for POSIX:
                   8379: 
                   8380:                                  Default   Change with
                   8381: 
                   8382:          . matches newline          yes    REG_NEWLINE
                   8383:          newline matches [^a]       yes    REG_NEWLINE
                   8384:          $ matches \n at end        no     REG_NEWLINE
                   8385:          $ matches \n in middle     no     REG_NEWLINE
                   8386:          ^ matches \n in middle     no     REG_NEWLINE
                   8387: 
                   8388:        PCRE's behaviour is the same as Perl's, except that there is no equiva-
1.1.1.2   misho    8389:        lent  for  PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
1.1       misho    8390:        no way to stop newline from matching [^a].
                   8391: 
1.1.1.2   misho    8392:        The  default  POSIX  newline  handling  can  be  obtained  by   setting
                   8393:        PCRE_DOTALL  and  PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
1.1       misho    8394:        behave exactly as for the REG_NEWLINE action.
                   8395: 
                   8396: 
                   8397: MATCHING A PATTERN
                   8398: 
1.1.1.2   misho    8399:        The function regexec() is called  to  match  a  compiled  pattern  preg
                   8400:        against  a  given string, which is by default terminated by a zero byte
                   8401:        (but see REG_STARTEND below), subject to the options in  eflags.  These
1.1       misho    8402:        can be:
                   8403: 
                   8404:          REG_NOTBOL
                   8405: 
                   8406:        The PCRE_NOTBOL option is set when calling the underlying PCRE matching
                   8407:        function.
                   8408: 
                   8409:          REG_NOTEMPTY
                   8410: 
                   8411:        The PCRE_NOTEMPTY option is set when calling the underlying PCRE match-
                   8412:        ing function. Note that REG_NOTEMPTY is not part of the POSIX standard.
                   8413:        However, setting this option can give more POSIX-like behaviour in some
                   8414:        situations.
                   8415: 
                   8416:          REG_NOTEOL
                   8417: 
                   8418:        The PCRE_NOTEOL option is set when calling the underlying PCRE matching
                   8419:        function.
                   8420: 
                   8421:          REG_STARTEND
                   8422: 
1.1.1.2   misho    8423:        The string is considered to start at string +  pmatch[0].rm_so  and  to
                   8424:        have  a terminating NUL located at string + pmatch[0].rm_eo (there need
                   8425:        not actually be a NUL at that location), regardless  of  the  value  of
                   8426:        nmatch.  This  is a BSD extension, compatible with but not specified by
                   8427:        IEEE Standard 1003.2 (POSIX.2), and should  be  used  with  caution  in
1.1       misho    8428:        software intended to be portable to other systems. Note that a non-zero
                   8429:        rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
                   8430:        of the string, not how it is matched.
                   8431: 
1.1.1.2   misho    8432:        If  the pattern was compiled with the REG_NOSUB flag, no data about any
                   8433:        matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
1.1       misho    8434:        regexec() are ignored.
                   8435: 
                   8436:        If the value of nmatch is zero, or if the value pmatch is NULL, no data
                   8437:        about any matched strings is returned.
                   8438: 
                   8439:        Otherwise,the portion of the string that was matched, and also any cap-
                   8440:        tured substrings, are returned via the pmatch argument, which points to
1.1.1.2   misho    8441:        an array of nmatch structures of type regmatch_t, containing  the  mem-
                   8442:        bers  rm_so  and rm_eo. These contain the offset to the first character
                   8443:        of each substring and the offset to the first character after  the  end
                   8444:        of  each substring, respectively. The 0th element of the vector relates
                   8445:        to the entire portion of string that was matched;  subsequent  elements
                   8446:        relate  to  the capturing subpatterns of the regular expression. Unused
1.1       misho    8447:        entries in the array have both structure members set to -1.
                   8448: 
1.1.1.2   misho    8449:        A successful match yields  a  zero  return;  various  error  codes  are
                   8450:        defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
1.1       misho    8451:        failure code.
                   8452: 
                   8453: 
                   8454: ERROR MESSAGES
                   8455: 
                   8456:        The regerror() function maps a non-zero errorcode from either regcomp()
1.1.1.2   misho    8457:        or  regexec()  to  a  printable message. If preg is not NULL, the error
1.1       misho    8458:        should have arisen from the use of that structure. A message terminated
1.1.1.2   misho    8459:        by  a  binary  zero  is  placed  in  errbuf. The length of the message,
                   8460:        including the zero, is limited to errbuf_size. The yield of  the  func-
1.1       misho    8461:        tion is the size of buffer needed to hold the whole message.
                   8462: 
                   8463: 
                   8464: MEMORY USAGE
                   8465: 
1.1.1.2   misho    8466:        Compiling  a regular expression causes memory to be allocated and asso-
                   8467:        ciated with the preg structure. The function regfree() frees  all  such
                   8468:        memory,  after  which  preg may no longer be used as a compiled expres-
1.1       misho    8469:        sion.
                   8470: 
                   8471: 
                   8472: AUTHOR
                   8473: 
                   8474:        Philip Hazel
                   8475:        University Computing Service
                   8476:        Cambridge CB2 3QH, England.
                   8477: 
                   8478: 
                   8479: REVISION
                   8480: 
1.1.1.2   misho    8481:        Last updated: 09 January 2012
                   8482:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    8483: ------------------------------------------------------------------------------
                   8484: 
                   8485: 
                   8486: PCRECPP(3)                                                          PCRECPP(3)
                   8487: 
                   8488: 
                   8489: NAME
                   8490:        PCRE - Perl-compatible regular expressions.
                   8491: 
                   8492: 
                   8493: SYNOPSIS OF C++ WRAPPER
                   8494: 
                   8495:        #include <pcrecpp.h>
                   8496: 
                   8497: 
                   8498: DESCRIPTION
                   8499: 
                   8500:        The  C++  wrapper  for PCRE was provided by Google Inc. Some additional
                   8501:        functionality was added by Giuseppe Maxia. This brief man page was con-
                   8502:        structed  from  the  notes  in the pcrecpp.h file, which should be con-
1.1.1.2   misho    8503:        sulted for further details. Note that the C++ wrapper supports only the
                   8504:        original 8-bit PCRE library. There is no 16-bit support at present.
1.1       misho    8505: 
                   8506: 
                   8507: MATCHING INTERFACE
                   8508: 
1.1.1.2   misho    8509:        The  "FullMatch" operation checks that supplied text matches a supplied
                   8510:        pattern exactly. If pointer arguments are supplied, it  copies  matched
1.1       misho    8511:        sub-strings that match sub-patterns into them.
                   8512: 
                   8513:          Example: successful match
                   8514:             pcrecpp::RE re("h.*o");
                   8515:             re.FullMatch("hello");
                   8516: 
                   8517:          Example: unsuccessful match (requires full match):
                   8518:             pcrecpp::RE re("e");
                   8519:             !re.FullMatch("hello");
                   8520: 
                   8521:          Example: creating a temporary RE object:
                   8522:             pcrecpp::RE("h.*o").FullMatch("hello");
                   8523: 
1.1.1.2   misho    8524:        You  can pass in a "const char*" or a "string" for "text". The examples
                   8525:        below tend to use a const char*. You can, as in the different  examples
                   8526:        above,  store the RE object explicitly in a variable or use a temporary
                   8527:        RE object. The examples below use one mode or  the  other  arbitrarily.
1.1       misho    8528:        Either could correctly be used for any of these examples.
                   8529: 
                   8530:        You must supply extra pointer arguments to extract matched subpieces.
                   8531: 
                   8532:          Example: extracts "ruby" into "s" and 1234 into "i"
                   8533:             int i;
                   8534:             string s;
                   8535:             pcrecpp::RE re("(\\w+):(\\d+)");
                   8536:             re.FullMatch("ruby:1234", &s, &i);
                   8537: 
                   8538:          Example: does not try to extract any extra sub-patterns
                   8539:             re.FullMatch("ruby:1234", &s);
                   8540: 
                   8541:          Example: does not try to extract into NULL
                   8542:             re.FullMatch("ruby:1234", NULL, &i);
                   8543: 
                   8544:          Example: integer overflow causes failure
                   8545:             !re.FullMatch("ruby:1234567891234", NULL, &i);
                   8546: 
                   8547:          Example: fails because there aren't enough sub-patterns:
                   8548:             !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
                   8549: 
                   8550:          Example: fails because string cannot be stored in integer
                   8551:             !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
                   8552: 
1.1.1.2   misho    8553:        The  provided  pointer  arguments can be pointers to any scalar numeric
1.1       misho    8554:        type, or one of:
                   8555: 
                   8556:           string        (matched piece is copied to string)
                   8557:           StringPiece   (StringPiece is mutated to point to matched piece)
                   8558:           T             (where "bool T::ParseFrom(const char*, int)" exists)
                   8559:           NULL          (the corresponding matched sub-pattern is not copied)
                   8560: 
1.1.1.2   misho    8561:        The function returns true iff all of the following conditions are  sat-
1.1       misho    8562:        isfied:
                   8563: 
                   8564:          a. "text" matches "pattern" exactly;
                   8565: 
                   8566:          b. The number of matched sub-patterns is >= number of supplied
                   8567:             pointers;
                   8568: 
                   8569:          c. The "i"th argument has a suitable type for holding the
                   8570:             string captured as the "i"th sub-pattern. If you pass in
                   8571:             void * NULL for the "i"th argument, or a non-void * NULL
                   8572:             of the correct type, or pass fewer arguments than the
                   8573:             number of sub-patterns, "i"th captured sub-pattern is
                   8574:             ignored.
                   8575: 
1.1.1.2   misho    8576:        CAVEAT:  An  optional  sub-pattern  that  does not exist in the matched
                   8577:        string is assigned the empty  string.  Therefore,  the  following  will
1.1       misho    8578:        return false (because the empty string is not a valid number):
                   8579: 
                   8580:           int number;
                   8581:           pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
                   8582: 
1.1.1.2   misho    8583:        The  matching interface supports at most 16 arguments per call.  If you
                   8584:        need   more,   consider    using    the    more    general    interface
1.1       misho    8585:        pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
                   8586: 
1.1.1.2   misho    8587:        NOTE:  Do not use no_arg, which is used internally to mark the end of a
                   8588:        list of optional arguments, as a placeholder for missing arguments,  as
1.1       misho    8589:        this can lead to segfaults.
                   8590: 
                   8591: 
                   8592: QUOTING METACHARACTERS
                   8593: 
1.1.1.2   misho    8594:        You  can use the "QuoteMeta" operation to insert backslashes before all
                   8595:        potentially meaningful characters in a  string.  The  returned  string,
1.1       misho    8596:        used as a regular expression, will exactly match the original string.
                   8597: 
                   8598:          Example:
                   8599:             string quoted = RE::QuoteMeta(unquoted);
                   8600: 
1.1.1.2   misho    8601:        Note  that  it's  legal to escape a character even if it has no special
                   8602:        meaning in a regular expression -- so this function  does  that.  (This
                   8603:        also  makes  it  identical  to  the perl function of the same name; see
                   8604:        "perldoc   -f   quotemeta".)    For   example,    "1.5-2.0?"    becomes
1.1       misho    8605:        "1\.5\-2\.0\?".
                   8606: 
                   8607: 
                   8608: PARTIAL MATCHES
                   8609: 
1.1.1.2   misho    8610:        You  can  use the "PartialMatch" operation when you want the pattern to
1.1       misho    8611:        match any substring of the text.
                   8612: 
                   8613:          Example: simple search for a string:
                   8614:             pcrecpp::RE("ell").PartialMatch("hello");
                   8615: 
                   8616:          Example: find first number in a string:
                   8617:             int number;
                   8618:             pcrecpp::RE re("(\\d+)");
                   8619:             re.PartialMatch("x*100 + 20", &number);
                   8620:             assert(number == 100);
                   8621: 
                   8622: 
                   8623: UTF-8 AND THE MATCHING INTERFACE
                   8624: 
1.1.1.2   misho    8625:        By default, pattern and text are plain text, one  byte  per  character.
                   8626:        The  UTF8  flag,  passed  to  the  constructor, causes both pattern and
1.1       misho    8627:        string to be treated as UTF-8 text, still a byte stream but potentially
1.1.1.2   misho    8628:        multiple  bytes  per character. In practice, the text is likelier to be
                   8629:        UTF-8 than the pattern, but the match returned may depend on  the  UTF8
                   8630:        flag,  so  always use it when matching UTF8 text. For example, "." will
                   8631:        match one byte normally but with UTF8 set may match up to  three  bytes
1.1       misho    8632:        of a multi-byte character.
                   8633: 
                   8634:          Example:
                   8635:             pcrecpp::RE_Options options;
                   8636:             options.set_utf8();
                   8637:             pcrecpp::RE re(utf8_pattern, options);
                   8638:             re.FullMatch(utf8_string);
                   8639: 
                   8640:          Example: using the convenience function UTF8():
                   8641:             pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
                   8642:             re.FullMatch(utf8_string);
                   8643: 
                   8644:        NOTE: The UTF8 flag is ignored if pcre was not configured with the
                   8645:              --enable-utf8 flag.
                   8646: 
                   8647: 
                   8648: PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
                   8649: 
1.1.1.2   misho    8650:        PCRE  defines  some  modifiers  to  change  the behavior of the regular
                   8651:        expression  engine.  The  C++  wrapper  defines  an  auxiliary   class,
                   8652:        RE_Options,  as  a  vehicle  to pass such modifiers to a RE class. Cur-
1.1       misho    8653:        rently, the following modifiers are supported:
                   8654: 
                   8655:           modifier              description               Perl corresponding
                   8656: 
                   8657:           PCRE_CASELESS         case insensitive match      /i
                   8658:           PCRE_MULTILINE        multiple lines match        /m
                   8659:           PCRE_DOTALL           dot matches newlines        /s
                   8660:           PCRE_DOLLAR_ENDONLY   $ matches only at end       N/A
                   8661:           PCRE_EXTRA            strict escape parsing       N/A
1.1.1.3 ! misho    8662:           PCRE_EXTENDED         ignore white spaces         /x
1.1       misho    8663:           PCRE_UTF8             handles UTF8 chars          built-in
                   8664:           PCRE_UNGREEDY         reverses * and *?           N/A
                   8665:           PCRE_NO_AUTO_CAPTURE  disables capturing parens   N/A (*)
                   8666: 
1.1.1.2   misho    8667:        (*) Both Perl and PCRE allow non capturing parentheses by means of  the
                   8668:        "?:"  modifier  within the pattern itself. e.g. (?:ab|cd) does not cap-
1.1       misho    8669:        ture, while (ab|cd) does.
                   8670: 
1.1.1.2   misho    8671:        For a full account on how each modifier works, please  check  the  PCRE
1.1       misho    8672:        API reference page.
                   8673: 
1.1.1.2   misho    8674:        For  each  modifier,  there are two member functions whose name is made
                   8675:        out of the modifier in  lowercase,  without  the  "PCRE_"  prefix.  For
1.1       misho    8676:        instance, PCRE_CASELESS is handled by
                   8677: 
                   8678:          bool caseless()
                   8679: 
                   8680:        which returns true if the modifier is set, and
                   8681: 
                   8682:          RE_Options & set_caseless(bool)
                   8683: 
                   8684:        which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
1.1.1.2   misho    8685:        be accessed through  the  set_match_limit()  and  match_limit()  member
                   8686:        functions.  Setting match_limit to a non-zero value will limit the exe-
                   8687:        cution of pcre to keep it from doing bad things like blowing the  stack
                   8688:        or  taking  an  eternity  to  return  a result. A value of 5000 is good
                   8689:        enough to stop stack blowup in a 2MB thread stack. Setting  match_limit
                   8690:        to   zero   disables   match  limiting.  Alternatively,  you  can  call
                   8691:        match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION  to
                   8692:        limit  how  much  PCRE  recurses.  match_limit()  limits  the number of
1.1       misho    8693:        matches PCRE does; match_limit_recursion() limits the depth of internal
                   8694:        recursion, and therefore the amount of stack that is used.
                   8695: 
1.1.1.2   misho    8696:        Normally,  to  pass  one or more modifiers to a RE class, you declare a
1.1       misho    8697:        RE_Options object, set the appropriate options, and pass this object to
                   8698:        a RE constructor. Example:
                   8699: 
                   8700:           RE_Options opt;
                   8701:           opt.set_caseless(true);
                   8702:           if (RE("HELLO", opt).PartialMatch("hello world")) ...
                   8703: 
                   8704:        RE_options has two constructors. The default constructor takes no argu-
1.1.1.2   misho    8705:        ments and creates a set of flags that are off by default. The  optional
                   8706:        parameter  option_flags is to facilitate transfer of legacy code from C
1.1       misho    8707:        programs.  This lets you do
                   8708: 
                   8709:           RE(pattern,
                   8710:             RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
                   8711: 
                   8712:        However, new code is better off doing
                   8713: 
                   8714:           RE(pattern,
                   8715:             RE_Options().set_caseless(true).set_multiline(true))
                   8716:               .PartialMatch(str);
                   8717: 
                   8718:        If you are going to pass one of the most used modifiers, there are some
                   8719:        convenience functions that return a RE_Options class with the appropri-
1.1.1.2   misho    8720:        ate modifier already set: CASELESS(),  UTF8(),  MULTILINE(),  DOTALL(),
1.1       misho    8721:        and EXTENDED().
                   8722: 
1.1.1.2   misho    8723:        If  you  need  to set several options at once, and you don't want to go
                   8724:        through the pains of declaring a RE_Options object and setting  several
                   8725:        options,  there  is a parallel method that give you such ability on the
                   8726:        fly. You can concatenate several set_xxxxx()  member  functions,  since
                   8727:        each  of  them returns a reference to its class object. For example, to
                   8728:        pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with  one
1.1       misho    8729:        statement, you may write:
                   8730: 
                   8731:           RE(" ^ xyz \\s+ .* blah$",
                   8732:             RE_Options()
                   8733:               .set_caseless(true)
                   8734:               .set_extended(true)
                   8735:               .set_multiline(true)).PartialMatch(sometext);
                   8736: 
                   8737: 
                   8738: SCANNING TEXT INCREMENTALLY
                   8739: 
1.1.1.2   misho    8740:        The  "Consume"  operation may be useful if you want to repeatedly match
1.1       misho    8741:        regular expressions at the front of a string and skip over them as they
1.1.1.2   misho    8742:        match.  This requires use of the "StringPiece" type, which represents a
                   8743:        sub-range of a real string. Like RE,  StringPiece  is  defined  in  the
1.1       misho    8744:        pcrecpp namespace.
                   8745: 
                   8746:          Example: read lines of the form "var = value" from a string.
                   8747:             string contents = ...;                 // Fill string somehow
                   8748:             pcrecpp::StringPiece input(contents);  // Wrap in a StringPiece
                   8749: 
                   8750:             string var;
                   8751:             int value;
                   8752:             pcrecpp::RE re("(\\w+) = (\\d+)\n");
                   8753:             while (re.Consume(&input, &var, &value)) {
                   8754:               ...;
                   8755:             }
                   8756: 
1.1.1.2   misho    8757:        Each  successful  call  to  "Consume"  will  set  "var/value", and also
1.1       misho    8758:        advance "input" so it points past the matched text.
                   8759: 
1.1.1.2   misho    8760:        The "FindAndConsume" operation is similar to  "Consume"  but  does  not
                   8761:        anchor  your  match  at  the  beginning of the string. For example, you
1.1       misho    8762:        could extract all words from a string by repeatedly calling
                   8763: 
                   8764:          pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
                   8765: 
                   8766: 
                   8767: PARSING HEX/OCTAL/C-RADIX NUMBERS
                   8768: 
                   8769:        By default, if you pass a pointer to a numeric value, the corresponding
1.1.1.2   misho    8770:        text  is  interpreted  as  a  base-10  number. You can instead wrap the
1.1       misho    8771:        pointer with a call to one of the operators Hex(), Octal(), or CRadix()
1.1.1.2   misho    8772:        to  interpret  the text in another base. The CRadix operator interprets
                   8773:        C-style "0" (base-8) and  "0x"  (base-16)  prefixes,  but  defaults  to
1.1       misho    8774:        base-10.
                   8775: 
                   8776:          Example:
                   8777:            int a, b, c, d;
                   8778:            pcrecpp::RE re("(.*) (.*) (.*) (.*)");
                   8779:            re.FullMatch("100 40 0100 0x40",
                   8780:                         pcrecpp::Octal(&a), pcrecpp::Hex(&b),
                   8781:                         pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
                   8782: 
                   8783:        will leave 64 in a, b, c, and d.
                   8784: 
                   8785: 
                   8786: REPLACING PARTS OF STRINGS
                   8787: 
1.1.1.2   misho    8788:        You  can  replace the first match of "pattern" in "str" with "rewrite".
                   8789:        Within "rewrite", backslash-escaped digits (\1 to \9) can  be  used  to
                   8790:        insert  text  matching  corresponding parenthesized group from the pat-
1.1       misho    8791:        tern. \0 in "rewrite" refers to the entire matching text. For example:
                   8792: 
                   8793:          string s = "yabba dabba doo";
                   8794:          pcrecpp::RE("b+").Replace("d", &s);
                   8795: 
1.1.1.2   misho    8796:        will leave "s" containing "yada dabba doo". The result is true  if  the
1.1       misho    8797:        pattern matches and a replacement occurs, false otherwise.
                   8798: 
1.1.1.2   misho    8799:        GlobalReplace  is  like Replace except that it replaces all occurrences
                   8800:        of the pattern in the string with the  rewrite.  Replacements  are  not
1.1       misho    8801:        subject to re-matching. For example:
                   8802: 
                   8803:          string s = "yabba dabba doo";
                   8804:          pcrecpp::RE("b+").GlobalReplace("d", &s);
                   8805: 
1.1.1.2   misho    8806:        will  leave  "s"  containing  "yada dada doo". It returns the number of
1.1       misho    8807:        replacements made.
                   8808: 
1.1.1.2   misho    8809:        Extract is like Replace, except that if the pattern matches,  "rewrite"
                   8810:        is  copied into "out" (an additional argument) with substitutions.  The
                   8811:        non-matching portions of "text" are ignored. Returns true iff  a  match
1.1       misho    8812:        occurred and the extraction happened successfully;  if no match occurs,
                   8813:        the string is left unaffected.
                   8814: 
                   8815: 
                   8816: AUTHOR
                   8817: 
                   8818:        The C++ wrapper was contributed by Google Inc.
                   8819:        Copyright (c) 2007 Google Inc.
                   8820: 
                   8821: 
                   8822: REVISION
                   8823: 
1.1.1.2   misho    8824:        Last updated: 08 January 2012
1.1       misho    8825: ------------------------------------------------------------------------------
                   8826: 
                   8827: 
                   8828: PCRESAMPLE(3)                                                    PCRESAMPLE(3)
                   8829: 
                   8830: 
                   8831: NAME
                   8832:        PCRE - Perl-compatible regular expressions
                   8833: 
                   8834: 
                   8835: PCRE SAMPLE PROGRAM
                   8836: 
                   8837:        A simple, complete demonstration program, to get you started with using
                   8838:        PCRE, is supplied in the file pcredemo.c in the  PCRE  distribution.  A
                   8839:        listing  of this program is given in the pcredemo documentation. If you
                   8840:        do not have a copy of the PCRE distribution, you can save this  listing
                   8841:        to re-create pcredemo.c.
                   8842: 
1.1.1.2   misho    8843:        The  demonstration program, which uses the original PCRE 8-bit library,
                   8844:        compiles the regular expression that is its first argument, and matches
                   8845:        it  against  the subject string in its second argument. No PCRE options
                   8846:        are set, and default character tables are used. If  matching  succeeds,
                   8847:        the  program  outputs the portion of the subject that matched, together
                   8848:        with the contents of any captured substrings.
1.1       misho    8849: 
                   8850:        If the -g option is given on the command line, the program then goes on
                   8851:        to check for further matches of the same regular expression in the same
1.1.1.2   misho    8852:        subject string. The logic is a little bit tricky because of the  possi-
                   8853:        bility  of  matching an empty string. Comments in the code explain what
1.1       misho    8854:        is going on.
                   8855: 
1.1.1.2   misho    8856:        If PCRE is installed in the standard include  and  library  directories
1.1       misho    8857:        for your operating system, you should be able to compile the demonstra-
                   8858:        tion program using this command:
                   8859: 
                   8860:          gcc -o pcredemo pcredemo.c -lpcre
                   8861: 
1.1.1.2   misho    8862:        If PCRE is installed elsewhere, you may need to add additional  options
                   8863:        to  the  command line. For example, on a Unix-like system that has PCRE
                   8864:        installed in /usr/local, you  can  compile  the  demonstration  program
1.1       misho    8865:        using a command like this:
                   8866: 
                   8867:          gcc -o pcredemo -I/usr/local/include pcredemo.c \
                   8868:              -L/usr/local/lib -lpcre
                   8869: 
1.1.1.2   misho    8870:        In  a  Windows  environment, if you want to statically link the program
1.1       misho    8871:        against a non-dll pcre.a file, you must uncomment the line that defines
1.1.1.2   misho    8872:        PCRE_STATIC  before  including  pcre.h, because otherwise the pcre_mal-
1.1       misho    8873:        loc()   and   pcre_free()   exported   functions   will   be   declared
                   8874:        __declspec(dllimport), with unwanted results.
                   8875: 
1.1.1.2   misho    8876:        Once  you  have  compiled and linked the demonstration program, you can
1.1       misho    8877:        run simple tests like this:
                   8878: 
                   8879:          ./pcredemo 'cat|dog' 'the cat sat on the mat'
                   8880:          ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
                   8881: 
1.1.1.2   misho    8882:        Note that there is a  much  more  comprehensive  test  program,  called
                   8883:        pcretest,  which  supports  many  more  facilities  for testing regular
                   8884:        expressions and both PCRE libraries. The pcredemo program  is  provided
                   8885:        as a simple coding example.
1.1       misho    8886: 
1.1.1.2   misho    8887:        If  you  try to run pcredemo when PCRE is not installed in the standard
                   8888:        library directory, you may get an error like  this  on  some  operating
1.1       misho    8889:        systems (e.g. Solaris):
                   8890: 
1.1.1.2   misho    8891:          ld.so.1:  a.out:  fatal:  libpcre.so.0:  open failed: No such file or
1.1       misho    8892:        directory
                   8893: 
1.1.1.2   misho    8894:        This is caused by the way shared library support works  on  those  sys-
1.1       misho    8895:        tems. You need to add
                   8896: 
                   8897:          -R/usr/local/lib
                   8898: 
                   8899:        (for example) to the compile command to get round this problem.
                   8900: 
                   8901: 
                   8902: AUTHOR
                   8903: 
                   8904:        Philip Hazel
                   8905:        University Computing Service
                   8906:        Cambridge CB2 3QH, England.
                   8907: 
                   8908: 
                   8909: REVISION
                   8910: 
1.1.1.2   misho    8911:        Last updated: 10 January 2012
                   8912:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    8913: ------------------------------------------------------------------------------
                   8914: PCRELIMITS(3)                                                    PCRELIMITS(3)
                   8915: 
                   8916: 
                   8917: NAME
                   8918:        PCRE - Perl-compatible regular expressions
                   8919: 
                   8920: 
                   8921: SIZE AND OTHER LIMITATIONS
                   8922: 
                   8923:        There  are some size limitations in PCRE but it is hoped that they will
                   8924:        never in practice be relevant.
                   8925: 
1.1.1.2   misho    8926:        The maximum length of a compiled  pattern  is  approximately  64K  data
                   8927:        units  (bytes  for  the  8-bit  library,  16-bit  units  for the 16-bit
                   8928:        library) if PCRE is compiled with the default internal linkage size  of
                   8929:        2  bytes.  If  you  want  to process regular expressions that are truly
                   8930:        enormous, you can compile PCRE with an internal linkage size of 3 or  4
                   8931:        (when  building  the  16-bit  library,  3  is rounded up to 4). See the
                   8932:        README file in the source distribution and the pcrebuild  documentation
                   8933:        for  details.  In  these cases the limit is substantially larger.  How-
                   8934:        ever, the speed of execution is slower.
1.1       misho    8935: 
                   8936:        All values in repeating quantifiers must be less than 65536.
                   8937: 
                   8938:        There is no limit to the number of parenthesized subpatterns, but there
                   8939:        can be no more than 65535 capturing subpatterns.
                   8940: 
                   8941:        There is a limit to the number of forward references to subsequent sub-
                   8942:        patterns of around 200,000.  Repeated  forward  references  with  fixed
                   8943:        upper  limits,  for example, (?2){0,100} when subpattern number 2 is to
                   8944:        the right, are included in the count. There is no limit to  the  number
                   8945:        of backward references.
                   8946: 
                   8947:        The maximum length of name for a named subpattern is 32 characters, and
                   8948:        the maximum number of named subpatterns is 10000.
                   8949: 
1.1.1.3 ! misho    8950:        The maximum length of a  name  in  a  (*MARK),  (*PRUNE),  (*SKIP),  or
        !          8951:        (*THEN)  verb  is  255  for  the 8-bit library and 65535 for the 16-bit
        !          8952:        library.
        !          8953: 
1.1       misho    8954:        The maximum length of a subject string is the largest  positive  number
                   8955:        that  an integer variable can hold. However, when using the traditional
                   8956:        matching function, PCRE uses recursion to handle subpatterns and indef-
                   8957:        inite  repetition.  This means that the available stack space may limit
                   8958:        the size of a subject string that can be processed by certain patterns.
                   8959:        For a discussion of stack issues, see the pcrestack documentation.
                   8960: 
                   8961: 
                   8962: AUTHOR
                   8963: 
                   8964:        Philip Hazel
                   8965:        University Computing Service
                   8966:        Cambridge CB2 3QH, England.
                   8967: 
                   8968: 
                   8969: REVISION
                   8970: 
1.1.1.3 ! misho    8971:        Last updated: 04 May 2012
1.1.1.2   misho    8972:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    8973: ------------------------------------------------------------------------------
                   8974: 
                   8975: 
                   8976: PCRESTACK(3)                                                      PCRESTACK(3)
                   8977: 
                   8978: 
                   8979: NAME
                   8980:        PCRE - Perl-compatible regular expressions
                   8981: 
                   8982: 
                   8983: PCRE DISCUSSION OF STACK USAGE
                   8984: 
1.1.1.2   misho    8985:        When  you  call  pcre[16]_exec(),  it makes use of an internal function
                   8986:        called match(). This calls itself recursively at branch points  in  the
                   8987:        pattern,  in  order  to  remember the state of the match so that it can
                   8988:        back up and try a different alternative if  the  first  one  fails.  As
                   8989:        matching proceeds deeper and deeper into the tree of possibilities, the
                   8990:        recursion depth increases. The match() function is also called in other
                   8991:        circumstances,  for  example,  whenever  a parenthesized sub-pattern is
                   8992:        entered, and in certain cases of repetition.
1.1       misho    8993: 
                   8994:        Not all calls of match() increase the recursion depth; for an item such
                   8995:        as  a* it may be called several times at the same level, after matching
                   8996:        different numbers of a's. Furthermore, in a number of cases  where  the
                   8997:        result  of  the  recursive call would immediately be passed back as the
                   8998:        result of the current call (a "tail recursion"), the function  is  just
                   8999:        restarted instead.
                   9000: 
1.1.1.2   misho    9001:        The  above  comments  apply  when  pcre[16]_exec() is run in its normal
                   9002:        interpretive  manner.   If   the   pattern   was   studied   with   the
                   9003:        PCRE_STUDY_JIT_COMPILE  option, and just-in-time compiling was success-
                   9004:        ful, and the options passed to pcre[16]_exec() were  not  incompatible,
                   9005:        the  matching process uses the JIT-compiled code instead of the match()
                   9006:        function. In this case, the memory requirements  are  handled  entirely
                   9007:        differently. See the pcrejit documentation for details.
1.1       misho    9008: 
1.1.1.2   misho    9009:        The pcre[16]_dfa_exec() function operates in an entirely different way,
                   9010:        and uses recursion only when there is a regular expression recursion or
1.1       misho    9011:        subroutine  call in the pattern. This includes the processing of asser-
                   9012:        tion and "once-only" subpatterns, which  are  handled  like  subroutine
                   9013:        calls.  Normally,  these are never very deep, and the limit on the com-
1.1.1.2   misho    9014:        plexity of pcre[16]_dfa_exec() is controlled by the amount of workspace
                   9015:        it  is  given.   However, it is possible to write patterns with runaway
                   9016:        infinite recursions; such patterns will  cause  pcre[16]_dfa_exec()  to
                   9017:        run out of stack. At present, there is no protection against this.
1.1       misho    9018: 
1.1.1.2   misho    9019:        The  comments that follow do NOT apply to pcre[16]_dfa_exec(); they are
                   9020:        relevant only for pcre[16]_exec() without the JIT optimization.
1.1       misho    9021: 
1.1.1.2   misho    9022:    Reducing pcre[16]_exec()'s stack usage
1.1       misho    9023: 
                   9024:        Each time that match() is actually called recursively, it  uses  memory
                   9025:        from  the  process  stack.  For certain kinds of pattern and data, very
                   9026:        large amounts of stack may be needed, despite the recognition of  "tail
                   9027:        recursion".   You  can often reduce the amount of recursion, and there-
                   9028:        fore the amount of stack used, by modifying the pattern that  is  being
                   9029:        matched. Consider, for example, this pattern:
                   9030: 
                   9031:          ([^<]|<(?!inet))+
                   9032: 
                   9033:        It  matches  from wherever it starts until it encounters "<inet" or the
                   9034:        end of the data, and is the kind of pattern that  might  be  used  when
                   9035:        processing an XML file. Each iteration of the outer parentheses matches
                   9036:        either one character that is not "<" or a "<" that is not  followed  by
                   9037:        "inet".  However,  each  time  a  parenthesis is processed, a recursion
                   9038:        occurs, so this formulation uses a stack frame for each matched charac-
                   9039:        ter.  For  a long string, a lot of stack is required. Consider now this
                   9040:        rewritten pattern, which matches exactly the same strings:
                   9041: 
                   9042:          ([^<]++|<(?!inet))+
                   9043: 
                   9044:        This uses very much less stack, because runs of characters that do  not
                   9045:        contain  "<" are "swallowed" in one item inside the parentheses. Recur-
                   9046:        sion happens only when a "<" character that is not followed  by  "inet"
                   9047:        is  encountered  (and  we assume this is relatively rare). A possessive
                   9048:        quantifier is used to stop any backtracking into the  runs  of  non-"<"
                   9049:        characters, but that is not related to stack usage.
                   9050: 
                   9051:        This  example shows that one way of avoiding stack problems when match-
                   9052:        ing long subject strings is to write repeated parenthesized subpatterns
                   9053:        to match more than one character whenever possible.
                   9054: 
1.1.1.2   misho    9055:    Compiling PCRE to use heap instead of stack for pcre[16]_exec()
1.1       misho    9056: 
                   9057:        In  environments  where  stack memory is constrained, you might want to
                   9058:        compile PCRE to use heap memory instead of stack for remembering  back-
1.1.1.2   misho    9059:        up points when pcre[16]_exec() is running. This makes it run a lot more
1.1       misho    9060:        slowly, however.  Details of how to do this are given in the  pcrebuild
                   9061:        documentation. When built in this way, instead of using the stack, PCRE
                   9062:        obtains and frees memory by calling the functions that are  pointed  to
1.1.1.2   misho    9063:        by  the  pcre[16]_stack_malloc  and  pcre[16]_stack_free  variables. By
                   9064:        default, these point to malloc() and free(), but you  can  replace  the
                   9065:        pointers to cause PCRE to use your own functions. Since the block sizes
                   9066:        are always the same, and are always freed in reverse order, it  may  be
                   9067:        possible  to  implement  customized memory handlers that are more effi-
                   9068:        cient than the standard functions.
1.1       misho    9069: 
1.1.1.2   misho    9070:    Limiting pcre[16]_exec()'s stack usage
1.1       misho    9071: 
                   9072:        You can set limits on the number of times that match() is called,  both
1.1.1.2   misho    9073:        in  total  and  recursively.  If  a  limit is exceeded, pcre[16]_exec()
                   9074:        returns an error code. Setting suitable limits should prevent  it  from
                   9075:        running  out of stack. The default values of the limits are very large,
                   9076:        and unlikely ever to operate. They can be changed when PCRE  is  built,
                   9077:        and they can also be set when pcre[16]_exec() is called. For details of
                   9078:        these interfaces, see the pcrebuild documentation and  the  section  on
                   9079:        extra data for pcre[16]_exec() in the pcreapi documentation.
1.1       misho    9080: 
                   9081:        As a very rough rule of thumb, you should reckon on about 500 bytes per
                   9082:        recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
                   9083:        should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
                   9084:        hand, can support around 128000 recursions.
                   9085: 
                   9086:        In Unix-like environments, the pcretest test program has a command line
                   9087:        option (-S) that can be used to increase the size of its stack. As long
                   9088:        as the stack is large enough, another option (-M) can be used  to  find
                   9089:        the  smallest  limits  that allow a particular pattern to match a given
1.1.1.2   misho    9090:        subject string. This is done by calling pcre[16]_exec() repeatedly with
1.1       misho    9091:        different limits.
                   9092: 
1.1.1.2   misho    9093:    Obtaining an estimate of stack usage
                   9094: 
                   9095:        The  actual  amount  of  stack used per recursion can vary quite a lot,
                   9096:        depending on the compiler that was used to build PCRE and the optimiza-
                   9097:        tion or debugging options that were set for it. The rule of thumb value
                   9098:        of 500 bytes mentioned above may be larger  or  smaller  than  what  is
                   9099:        actually needed. A better approximation can be obtained by running this
                   9100:        command:
                   9101: 
                   9102:          pcretest -m -C
                   9103: 
                   9104:        The -C option causes pcretest to output information about  the  options
                   9105:        with which PCRE was compiled. When -m is also given (before -C), infor-
                   9106:        mation about stack use is given in a line like this:
                   9107: 
                   9108:          Match recursion uses stack: approximate frame size = 640 bytes
                   9109: 
                   9110:        The value is approximate because some recursions need a bit more (up to
                   9111:        perhaps 16 more bytes).
                   9112: 
                   9113:        If  the  above  command  is given when PCRE is compiled to use the heap
                   9114:        instead of the stack for recursion, the value that  is  output  is  the
                   9115:        size of each block that is obtained from the heap.
                   9116: 
1.1       misho    9117:    Changing stack size in Unix-like systems
                   9118: 
                   9119:        In  Unix-like environments, there is not often a problem with the stack
                   9120:        unless very long strings are involved,  though  the  default  limit  on
                   9121:        stack  size  varies  from system to system. Values from 8Mb to 64Mb are
                   9122:        common. You can find your default limit by running the command:
                   9123: 
                   9124:          ulimit -s
                   9125: 
                   9126:        Unfortunately, the effect of running out of  stack  is  often  SIGSEGV,
                   9127:        though  sometimes  a more explicit error message is given. You can nor-
                   9128:        mally increase the limit on stack size by code such as this:
                   9129: 
                   9130:          struct rlimit rlim;
                   9131:          getrlimit(RLIMIT_STACK, &rlim);
                   9132:          rlim.rlim_cur = 100*1024*1024;
                   9133:          setrlimit(RLIMIT_STACK, &rlim);
                   9134: 
                   9135:        This reads the current limits (soft and hard) using  getrlimit(),  then
                   9136:        attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You
1.1.1.2   misho    9137:        must do this before calling pcre[16]_exec().
1.1       misho    9138: 
                   9139:    Changing stack size in Mac OS X
                   9140: 
                   9141:        Using setrlimit(), as described above, should also work on Mac OS X. It
                   9142:        is also possible to set a stack size when linking a program. There is a
                   9143:        discussion  about  stack  sizes  in  Mac  OS  X  at  this   web   site:
                   9144:        http://developer.apple.com/qa/qa2005/qa1419.html.
                   9145: 
                   9146: 
                   9147: AUTHOR
                   9148: 
                   9149:        Philip Hazel
                   9150:        University Computing Service
                   9151:        Cambridge CB2 3QH, England.
                   9152: 
                   9153: 
                   9154: REVISION
                   9155: 
1.1.1.2   misho    9156:        Last updated: 21 January 2012
                   9157:        Copyright (c) 1997-2012 University of Cambridge.
1.1       misho    9158: ------------------------------------------------------------------------------
                   9159: 
                   9160: 

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>