1:
2: This directory contains source code for the SQLite "ICU" extension, an
3: integration of the "International Components for Unicode" library with
4: SQLite. Documentation follows.
5:
6: 1. Features
7:
8: 1.1 SQL Scalars upper() and lower()
9: 1.2 Unicode Aware LIKE Operator
10: 1.3 ICU Collation Sequences
11: 1.4 SQL REGEXP Operator
12:
13: 2. Compilation and Usage
14:
15: 3. Bugs, Problems and Security Issues
16:
17: 3.1 The "case_sensitive_like" Pragma
18: 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
19: 3.3 Collation Sequence Security Issue
20:
21:
22: 1. FEATURES
23:
24: 1.1 SQL Scalars upper() and lower()
25:
26: SQLite's built-in implementations of these two functions only
27: provide case mapping for the 26 letters used in the English
28: language. The ICU based functions provided by this extension
29: provide case mapping, where defined, for the full range of
30: unicode characters.
31:
32: ICU provides two types of case mapping, "general" case mapping and
33: "language specific". Refer to ICU documentation for the differences
34: between the two. Specifically:
35:
36: http://www.icu-project.org/userguide/caseMappings.html
37: http://www.icu-project.org/userguide/posix.html#case_mappings
38:
39: To utilise "general" case mapping, the upper() or lower() scalar
40: functions are invoked with one argument:
41:
42: upper('ABC') -> 'abc'
43: lower('abc') -> 'ABC'
44:
45: To access ICU "language specific" case mapping, upper() or lower()
46: should be invoked with two arguments. The second argument is the name
47: of the locale to use. Passing an empty string ("") or SQL NULL value
48: as the second argument is the same as invoking the 1 argument version
49: of upper() or lower():
50:
51: lower('I', 'en_us') -> 'i'
52: lower('I', 'tr_tr') -> 'ı' (small dotless i)
53:
54: 1.2 Unicode Aware LIKE Operator
55:
56: Similarly to the upper() and lower() functions, the built-in SQLite LIKE
57: operator understands case equivalence for the 26 letters of the English
58: language alphabet. The implementation of LIKE included in this
59: extension uses the ICU function u_foldCase() to provide case
60: independent comparisons for the full range of unicode characters.
61:
62: The U_FOLD_CASE_DEFAULT flag is passed to u_foldCase(), meaning the
63: dotless 'I' character used in the Turkish language is considered
64: to be in the same equivalence class as the dotted 'I' character
65: used by many languages (including English).
66:
67: 1.3 ICU Collation Sequences
68:
69: A special SQL scalar function, icu_load_collation() is provided that
70: may be used to register ICU collation sequences with SQLite. It
71: is always called with exactly two arguments, the ICU locale
72: identifying the collation sequence to ICU, and the name of the
73: SQLite collation sequence to create. For example, to create an
74: SQLite collation sequence named "turkish" using Turkish language
75: sorting rules, the SQL statement:
76:
77: SELECT icu_load_collation('tr_TR', 'turkish');
78:
79: Or, for Australian English:
80:
81: SELECT icu_load_collation('en_AU', 'australian');
82:
83: The identifiers "turkish" and "australian" may then be used
84: as collation sequence identifiers in SQL statements:
85:
86: CREATE TABLE aust_turkish_penpals(
87: australian_penpal_name TEXT COLLATE australian,
88: turkish_penpal_name TEXT COLLATE turkish
89: );
90:
91: 1.4 SQL REGEXP Operator
92:
93: This extension provides an implementation of the SQL binary
94: comparision operator "REGEXP", based on the regular expression functions
95: provided by the ICU library. The syntax of the operator is as described
96: in SQLite documentation:
97:
98: <string> REGEXP <re-pattern>
99:
100: This extension uses the ICU defaults for regular expression matching
101: behaviour. Specifically, this means that:
102:
103: * Matching is case-sensitive,
104: * Regular expression comments are not allowed within patterns, and
105: * The '^' and '$' characters match the beginning and end of the
106: <string> argument, not the beginning and end of lines within
107: the <string> argument.
108:
109: Even more specifically, the value passed to the "flags" parameter
110: of ICU C function uregex_open() is 0.
111:
112:
113: 2 COMPILATION AND USAGE
114:
115: The easiest way to compile and use the ICU extension is to build
116: and use it as a dynamically loadable SQLite extension. To do this
117: using gcc on *nix:
118:
119: gcc -shared icu.c `icu-config --ldflags` -o libSqliteIcu.so
120:
121: You may need to add "-I" flags so that gcc can find sqlite3ext.h
122: and sqlite3.h. The resulting shared lib, libSqliteIcu.so, may be
123: loaded into sqlite in the same way as any other dynamically loadable
124: extension.
125:
126:
127: 3 BUGS, PROBLEMS AND SECURITY ISSUES
128:
129: 3.1 The "case_sensitive_like" Pragma
130:
131: This extension does not work well with the "case_sensitive_like"
132: pragma. If this pragma is used before the ICU extension is loaded,
133: then the pragma has no effect. If the pragma is used after the ICU
134: extension is loaded, then SQLite ignores the ICU implementation and
135: always uses the built-in LIKE operator.
136:
137: The ICU extension LIKE operator is always case insensitive.
138:
139: 3.2 The SQLITE_MAX_LIKE_PATTERN_LENGTH Macro
140:
141: Passing very long patterns to the built-in SQLite LIKE operator can
142: cause excessive CPU usage. To curb this problem, SQLite defines the
143: SQLITE_MAX_LIKE_PATTERN_LENGTH macro as the maximum length of a
144: pattern in bytes (irrespective of encoding). The default value is
145: defined in internal header file "limits.h".
146:
147: The ICU extension LIKE implementation suffers from the same
148: problem and uses the same solution. However, since the ICU extension
149: code does not include the SQLite file "limits.h", modifying
150: the default value therein does not affect the ICU extension.
151: The default value of SQLITE_MAX_LIKE_PATTERN_LENGTH used by
152: the ICU extension LIKE operator is 50000, defined in source
153: file "icu.c".
154:
155: 3.3 Collation Sequence Security Issue
156:
157: Internally, SQLite assumes that indices stored in database files
158: are sorted according to the collation sequence indicated by the
159: SQL schema. Changing the definition of a collation sequence after
160: an index has been built is therefore equivalent to database
161: corruption. The SQLite library is not very well tested under
162: these conditions, and may contain potential buffer overruns
163: or other programming errors that could be exploited by a malicious
164: programmer.
165:
166: If the ICU extension is used in an environment where potentially
167: malicious users may execute arbitrary SQL (i.e. gears), they
168: should be prevented from invoking the icu_load_collation() function,
169: possibly using the authorisation callback.
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>