File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / sqlite3 / ext / icu / README.txt
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Tue Feb 21 17:04:17 2012 UTC (13 years, 1 month ago) by misho
Branches: sqlite3, MAIN
CVS tags: v3_7_10, HEAD
sqlite3

    1: 
    2: This directory contains source code for the SQLite "ICU" extension, an
    3: integration of the "International Components for Unicode" library with
    4: SQLite. Documentation follows.
    5: 
    6:     1. Features
    7:     
    8:         1.1  SQL Scalars upper() and lower()
    9:         1.2  Unicode Aware LIKE Operator
   10:         1.3  ICU Collation Sequences
   11:         1.4  SQL REGEXP Operator
   12:     
   13:     2. Compilation and Usage
   14:     
   15:     3. Bugs, Problems and Security Issues
   16:     
   17:         3.1  The "case_sensitive_like" Pragma
   18:         3.2  The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
   19:         3.3  Collation Sequence Security Issue
   20: 
   21: 
   22: 1. FEATURES
   23: 
   24:   1.1  SQL Scalars upper() and lower()
   25: 
   26:     SQLite's built-in implementations of these two functions only 
   27:     provide case mapping for the 26 letters used in the English
   28:     language. The ICU based functions provided by this extension
   29:     provide case mapping, where defined, for the full range of 
   30:     unicode characters.
   31: 
   32:     ICU provides two types of case mapping, "general" case mapping and
   33:     "language specific". Refer to ICU documentation for the differences
   34:     between the two. Specifically:
   35: 
   36:        http://www.icu-project.org/userguide/caseMappings.html
   37:        http://www.icu-project.org/userguide/posix.html#case_mappings
   38: 
   39:     To utilise "general" case mapping, the upper() or lower() scalar 
   40:     functions are invoked with one argument:
   41: 
   42:         upper('ABC') -> 'abc'
   43:         lower('abc') -> 'ABC'
   44: 
   45:     To access ICU "language specific" case mapping, upper() or lower()
   46:     should be invoked with two arguments. The second argument is the name
   47:     of the locale to use. Passing an empty string ("") or SQL NULL value
   48:     as the second argument is the same as invoking the 1 argument version
   49:     of upper() or lower():
   50: 
   51:         lower('I', 'en_us') -> 'i'
   52:         lower('I', 'tr_tr') -> 'ı' (small dotless i)
   53: 
   54:   1.2  Unicode Aware LIKE Operator
   55: 
   56:     Similarly to the upper() and lower() functions, the built-in SQLite LIKE
   57:     operator understands case equivalence for the 26 letters of the English
   58:     language alphabet. The implementation of LIKE included in this
   59:     extension uses the ICU function u_foldCase() to provide case
   60:     independent comparisons for the full range of unicode characters.  
   61: 
   62:     The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
   63:     dotless 'I' character used in the Turkish language is considered
   64:     to be in the same equivalence class as the dotted 'I' character
   65:     used by many languages (including English).
   66: 
   67:   1.3  ICU Collation Sequences
   68: 
   69:     A special SQL scalar function, icu_load_collation() is provided that 
   70:     may be used to register ICU collation sequences with SQLite. It
   71:     is always called with exactly two arguments, the ICU locale 
   72:     identifying the collation sequence to ICU, and the name of the
   73:     SQLite collation sequence to create. For example, to create an
   74:     SQLite collation sequence named "turkish" using Turkish language
   75:     sorting rules, the SQL statement:
   76: 
   77:         SELECT icu_load_collation('tr_TR', 'turkish');
   78: 
   79:     Or, for Australian English:
   80: 
   81:         SELECT icu_load_collation('en_AU', 'australian');
   82: 
   83:     The identifiers "turkish" and "australian" may then be used
   84:     as collation sequence identifiers in SQL statements:
   85: 
   86:         CREATE TABLE aust_turkish_penpals(
   87:           australian_penpal_name TEXT COLLATE australian,
   88:           turkish_penpal_name    TEXT COLLATE turkish
   89:         );
   90:   
   91:   1.4 SQL REGEXP Operator
   92: 
   93:     This extension provides an implementation of the SQL binary
   94:     comparision operator "REGEXP", based on the regular expression functions
   95:     provided by the ICU library. The syntax of the operator is as described
   96:     in SQLite documentation:
   97: 
   98:         <string> REGEXP <re-pattern>
   99: 
  100:     This extension uses the ICU defaults for regular expression matching
  101:     behaviour. Specifically, this means that:
  102: 
  103:         * Matching is case-sensitive,
  104:         * Regular expression comments are not allowed within patterns, and
  105:         * The '^' and '$' characters match the beginning and end of the
  106:           <string> argument, not the beginning and end of lines within
  107:           the <string> argument.
  108: 
  109:     Even more specifically, the value passed to the "flags" parameter
  110:     of ICU C function uregex_open() is 0.
  111: 
  112: 
  113: 2  COMPILATION AND USAGE
  114: 
  115:   The easiest way to compile and use the ICU extension is to build
  116:   and use it as a dynamically loadable SQLite extension. To do this
  117:   using gcc on *nix:
  118: 
  119:     gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so
  120: 
  121:   You may need to add "-I" flags so that gcc can find sqlite3ext.h
  122:   and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
  123:   loaded into sqlite in the same way as any other dynamically loadable
  124:   extension.
  125: 
  126: 
  127: 3 BUGS, PROBLEMS AND SECURITY ISSUES
  128: 
  129:   3.1 The "case_sensitive_like" Pragma
  130: 
  131:     This extension does not work well with the "case_sensitive_like"
  132:     pragma. If this pragma is used before the ICU extension is loaded,
  133:     then the pragma has no effect. If the pragma is used after the ICU
  134:     extension is loaded, then SQLite ignores the ICU implementation and
  135:     always uses the built-in LIKE operator.
  136: 
  137:     The ICU extension LIKE operator is always case insensitive.
  138: 
  139:   3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
  140: 
  141:     Passing very long patterns to the built-in SQLite LIKE operator can
  142:     cause excessive CPU usage. To curb this problem, SQLite defines the
  143:     SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
  144:     pattern in bytes (irrespective of encoding). The default value is
  145:     defined in internal header file "limits.h".
  146:     
  147:     The ICU extension LIKE implementation suffers from the same 
  148:     problem and uses the same solution. However, since the ICU extension
  149:     code does not include the SQLite file "limits.h", modifying
  150:     the default value therein does not affect the ICU extension.
  151:     The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
  152:     the ICU extension LIKE operator is 50000, defined in source 
  153:     file "icu.c".
  154: 
  155:   3.3 Collation Sequence Security Issue
  156: 
  157:     Internally, SQLite assumes that indices stored in database files
  158:     are sorted according to the collation sequence indicated by the
  159:     SQL schema. Changing the definition of a collation sequence after
  160:     an index has been built is therefore equivalent to database
  161:     corruption. The SQLite library is not very well tested under
  162:     these conditions, and may contain potential buffer overruns
  163:     or other programming errors that could be exploited by a malicious
  164:     programmer.
  165: 
  166:     If the ICU extension is used in an environment where potentially
  167:     malicious users may execute arbitrary SQL (i.e. gears), they
  168:     should be prevented from invoking the icu_load_collation() function,
  169:     possibly using the authorisation callback.

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>