Return to README.tokenizers CVS log | Up to [ELWIX - Embedded LightWeight unIX -] / embedaddon / sqlite3 / ext / fts2 |
1.1 ! misho 1: ! 2: 1. FTS2 Tokenizers ! 3: ! 4: When creating a new full-text table, FTS2 allows the user to select ! 5: the text tokenizer implementation to be used when indexing text ! 6: by specifying a "tokenizer" clause as part of the CREATE VIRTUAL TABLE ! 7: statement: ! 8: ! 9: CREATE VIRTUAL TABLE <table-name> USING fts2( ! 10: <columns ...> [, tokenizer <tokenizer-name> [<tokenizer-args>]] ! 11: ); ! 12: ! 13: The built-in tokenizers (valid values to pass as <tokenizer name>) are ! 14: "simple" and "porter". ! 15: ! 16: <tokenizer-args> should consist of zero or more white-space separated ! 17: arguments to pass to the selected tokenizer implementation. The ! 18: interpretation of the arguments, if any, depends on the individual ! 19: tokenizer. ! 20: ! 21: 2. Custom Tokenizers ! 22: ! 23: FTS2 allows users to provide custom tokenizer implementations. The ! 24: interface used to create a new tokenizer is defined and described in ! 25: the fts2_tokenizer.h source file. ! 26: ! 27: Registering a new FTS2 tokenizer is similar to registering a new ! 28: virtual table module with SQLite. The user passes a pointer to a ! 29: structure containing pointers to various callback functions that ! 30: make up the implementation of the new tokenizer type. For tokenizers, ! 31: the structure (defined in fts2_tokenizer.h) is called ! 32: "sqlite3_tokenizer_module". ! 33: ! 34: FTS2 does not expose a C-function that users call to register new ! 35: tokenizer types with a database handle. Instead, the pointer must ! 36: be encoded as an SQL blob value and passed to FTS2 through the SQL ! 37: engine by evaluating a special scalar function, "fts2_tokenizer()". ! 38: The fts2_tokenizer() function may be called with one or two arguments, ! 39: as follows: ! 40: ! 41: SELECT fts2_tokenizer(<tokenizer-name>); ! 42: SELECT fts2_tokenizer(<tokenizer-name>, <sqlite3_tokenizer_module ptr>); ! 43: ! 44: Where <tokenizer-name> is a string identifying the tokenizer and ! 45: <sqlite3_tokenizer_module ptr> is a pointer to an sqlite3_tokenizer_module ! 46: structure encoded as an SQL blob. If the second argument is present, ! 47: it is registered as tokenizer <tokenizer-name> and a copy of it ! 48: returned. If only one argument is passed, a pointer to the tokenizer ! 49: implementation currently registered as <tokenizer-name> is returned, ! 50: encoded as a blob. Or, if no such tokenizer exists, an SQL exception ! 51: (error) is raised. ! 52: ! 53: SECURITY: If the fts2 extension is used in an environment where potentially ! 54: malicious users may execute arbitrary SQL (i.e. gears), they should be ! 55: prevented from invoking the fts2_tokenizer() function, possibly using the ! 56: authorisation callback. ! 57: ! 58: See "Sample code" below for an example of calling the fts2_tokenizer() ! 59: function from C code. ! 60: ! 61: 3. ICU Library Tokenizers ! 62: ! 63: If this extension is compiled with the SQLITE_ENABLE_ICU pre-processor ! 64: symbol defined, then there exists a built-in tokenizer named "icu" ! 65: implemented using the ICU library. The first argument passed to the ! 66: xCreate() method (see fts2_tokenizer.h) of this tokenizer may be ! 67: an ICU locale identifier. For example "tr_TR" for Turkish as used ! 68: in Turkey, or "en_AU" for English as used in Australia. For example: ! 69: ! 70: "CREATE VIRTUAL TABLE thai_text USING fts2(text, tokenizer icu th_TH)" ! 71: ! 72: The ICU tokenizer implementation is very simple. It splits the input ! 73: text according to the ICU rules for finding word boundaries and discards ! 74: any tokens that consist entirely of white-space. This may be suitable ! 75: for some applications in some locales, but not all. If more complex ! 76: processing is required, for example to implement stemming or ! 77: discard punctuation, this can be done by creating a tokenizer ! 78: implementation that uses the ICU tokenizer as part of its implementation. ! 79: ! 80: When using the ICU tokenizer this way, it is safe to overwrite the ! 81: contents of the strings returned by the xNext() method (see ! 82: fts2_tokenizer.h). ! 83: ! 84: 4. Sample code. ! 85: ! 86: The following two code samples illustrate the way C code should invoke ! 87: the fts2_tokenizer() scalar function: ! 88: ! 89: int registerTokenizer( ! 90: sqlite3 *db, ! 91: char *zName, ! 92: const sqlite3_tokenizer_module *p ! 93: ){ ! 94: int rc; ! 95: sqlite3_stmt *pStmt; ! 96: const char zSql[] = "SELECT fts2_tokenizer(?, ?)"; ! 97: ! 98: rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); ! 99: if( rc!=SQLITE_OK ){ ! 100: return rc; ! 101: } ! 102: ! 103: sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); ! 104: sqlite3_bind_blob(pStmt, 2, &p, sizeof(p), SQLITE_STATIC); ! 105: sqlite3_step(pStmt); ! 106: ! 107: return sqlite3_finalize(pStmt); ! 108: } ! 109: ! 110: int queryTokenizer( ! 111: sqlite3 *db, ! 112: char *zName, ! 113: const sqlite3_tokenizer_module **pp ! 114: ){ ! 115: int rc; ! 116: sqlite3_stmt *pStmt; ! 117: const char zSql[] = "SELECT fts2_tokenizer(?)"; ! 118: ! 119: *pp = 0; ! 120: rc = sqlite3_prepare_v2(db, zSql, -1, &pStmt, 0); ! 121: if( rc!=SQLITE_OK ){ ! 122: return rc; ! 123: } ! 124: ! 125: sqlite3_bind_text(pStmt, 1, zName, -1, SQLITE_STATIC); ! 126: if( SQLITE_ROW==sqlite3_step(pStmt) ){ ! 127: if( sqlite3_column_type(pStmt, 0)==SQLITE_BLOB ){ ! 128: memcpy(pp, sqlite3_column_blob(pStmt, 0), sizeof(*pp)); ! 129: } ! 130: } ! 131: ! 132: return sqlite3_finalize(pStmt); ! 133: }