Вы находитесь на странице: 1из 56

SQLite FTS4 Internals

And comparison with FTS5


FTS4 and FTS5
● FTS4 and FTS5 are both virtual tables that
maintain a “full text index” on their contents.
● They provide similar functionality, but
– FTS4 is released and is widely used.
– FTS5 is unreleased but incorporates a few lessons
learned during FTS4's lifetime.
● Most of these slides are about FTS4, with a few
comments regarding FTS5.
Presentation Structure
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Part 1
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Table Creation
● An FTS index is automatically created and
populated along with each FTS table.
● Create an FTS table using:
CREATE VIRTUAL TABLE ft USING fts4(a, b);

● Populate it using regular INSERT, UPDATE and


DELETE statements:
INSERT INTO ft(rowid, a, b) VALUES(?, ?, ?);
INSERT INTO ft(docid, a, b) VALUES(?, ?, ?);

DELETE FROM ft WHERE rowid = ?;


UPDATE ft SET a=? WHERE docid=?;
Example FTS Index
INSERT INTO ft(rowid, a, b)
VALUES(1, 'Purple Cyan', 'orange blue purple cyan.'),
VALUES(2, 'Yellow', 'Orange BLUE yellow purple yellow.'),
VALUES(3, 'Purple Cyan', 'Gold purple green.'),
VALUES(4, 'Yellow', '[red purple, grey]');

“blue” -> (1: b1) (2: b1) Doclists


“cyan” -> (1: a1 b3) (3: a1)
“gold” -> (3: b0)
“green” -> (3: b2)
“grey” -> (4: b2)
“orange” -> (1: b0) (2: b0)
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“red” -> (4: b0)
“yellow” -> (2: a0 b2 b4) (4: a0)

B-Tree structure Terms


Example Query
● So it's easy to see how FTS answers queries
for “the set of rowid values for rows that contain
'cyan'”:
SELECT rowid FROM ft WHERE ft MATCH 'cyan'

● It searches the b-tree for “cyan”, and finds:


(1: a1 b3) (3: a1)

● So returns rowid values 1 and 3.


Tokenizers
● A “tokenizer” extracts tokens or terms from
blocks of text. e.g. transforms:
”orange BLUE Yellow purple yellow.”
Case folding
● To:
”orange”, “blue”, “yellow”, “purple”, “yellow”

● FTS4 and FTS5 both have a couple of built-in


tokenizers (simple, unicode61, porter).
● And an API allowing users to implement more.
Tokenizers
● A single FTS table has a single tokenizer*.
● Used to extract tokens from both table content
and query text.
● It's important to use the same tokenizer on
content and queries. So that:
SELECT rowid FROM ft WHERE ft MATCH 'Cyan';

works. Upper case C


● Tokenizers may also transform terms to a more
normal form – this is “stemming”.
* Not entirely true for tables that use “languageid”
Stemmer Tokenizers
● Stemmers are language specific – the built-in
“porter” tokenizer is a stemmer for English.

With porter, “require”,


“requirement”,
“requirements”, and
“required” are all
considered the same term.
Custom Stemmer Tokenizers
● A tokenizer could also map common sets of
synonyms or abbreviations to a single token.
● i.e. tokenize these strings as follows:
”1st road, Somerset” -> “first”, “road”, “somerset”
“first Rd., Lancashire” -> “first”, “road”, “lancashire”

● Then, if the user runs:


SELECT … WHERE MATCH '1st Rd.'

● The tokenizer tokenizes the query as:


“first AND road”

● Which matches both rows.


More Queries: AND, OR
● As well as querying for all documents
containing a specified token, the FTS index
supports logical AND and OR operations:
SELECT rowid FROM ft WHERE ft MATCH 'yellow AND grey'
SELECT rowid FROM ft WHERE ft MATCH 'yellow OR grey'

● Retrieve doclists for each token:


“grey” -> (4: b2)
“yellow” -> (2: a0 b2 b4) (4: a0)

● For “AND”, return the intersection of the two


sets of rowids (just 4). For “OR”, the union (2
and 4).
Implicit AND operators
● If there is no operator between two tokens, an
implicit AND is inserted. Equivalent:
SELECT rowid FROM ft WHERE ft MATCH 'wal performance'
SELECT rowid FROM ft WHERE ft MATCH 'wal AND performance'

● This leads to intuitive results in UI's:


Why Implicit AND is important
● Say a document contains the text:
“The sqlite3_prepare API...”

● And the query:


SELECT rowid FROM ft WHERE ft MATCH 'sqlite3_prepare'

● Depending on the tokenizer, “sqlite3_prepare”


might be one or two tokens
● If it is two tokens, the query is equivalent to:
... MATCH 'sqlite3 AND prepare'

● Which will match


FTS4 Has Two Query Syntaxes
● FTS4 actually supports two slightly different
query syntaxes
● The switch:
-DSQLITE_ENABLE_FTS3_PARENTHESIS=1

● Enables the new syntax. Which supports


parenthesis. And the “NOT” operator.
● Always build with this switch!
More Queries: NOT operator
● The “NOT” operator works like an SQL
EXCEPT. This:
SELECT rowid FROM ft WHERE ft MATCH 'yellow NOT grey'

● Is “all rowids for documents that contain 'yellow'


but do not contain 'grey'”.
● Same again: Retrieve doclists for each token:
“grey” -> (4: b2)
“yellow” -> (2: a0 b2 b4) (4: a0)

● And so on..
Precedence & Parenthesis
● Precedence, from tightest to loosest grouping:
– NOT
– AND
– OR
● You can use parenthesis. So these are the
same:
SELECT * FROM ft WHERE ft MATCH 'yellow AND grey OR red'
SELECT * FROM ft WHERE ft MATCH 'red OR yellow AND grey'
SELECT * FROM ft WHERE ft MATCH 'red OR (yellow AND grey)'

● But this is different:


SELECT * FROM ft WHERE ft MATCH '(red OR yellow) AND grey'
More Queries: Phrases
● Can also use the index for “phrase” queries:
SELECT rowid FROM ft WHERE ft MATCH '”blue yellow”'

● FTS retrieves the doclists for each separate


token:
“blue” -> (1: b1) (2: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)

● Filters as for “AND”, then filters for the phrase


match.
“blue” -> (1: b1) (2: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)
More Queries: NEAR
● NEAR queries are similar:
SELECT rowid FROM ft WHERE ft MATCH 'orange NEAR cyan'

● As are queries that restrict matches to a


specified column:
SELECT rowid FROM ft WHERE ft MATCH 'b:cyan'

● All implemented by extra filtering after index


entries have been loaded from disk
Prefix Queries
● We can also do prefix queries:
SELECT rowid FROM ft WHERE ft MATCH 'g*'

● “all rows that contain at least one term that


begins with 'g'”
Scan and merge

“blue” -> (1: b1) (2: b1) this range


“cyan” -> (1: a1 b3) (3: a1)
“gold” -> (3: b0)
“green” -> (3: b2)
“grey” -> (4: b2)
“orange” -> (1: b0) (2: b0)
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“red” -> (4: b0)
“yellow” -> (2: a0 b2 b4) (4: a0)
Prefix Indexes
● Scanning and merging doclists can be slow.
● The “prefix=” option can be used to create
prefix indexes. e.g.:
CREATE VIRTUAL TABLE ft USING fts4(a, b, prefix=”1”);

● Then, as well as the main term index:


“b” -> (1: b1) (2: b1)
“c” -> (1: a1 b3) (3: a1)
“g” -> (3: b0 b2) (4: b2)
“o” -> (1: b0) (2: b0)
“p” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“r” -> (4: b0)
“y” -> (2: a0 b2 b4) (4: a0)
Prefix Indexes
● Multiple prefix indexes can be added:
CREATE VIRTUAL TABLE ft USING fts4(a, b, prefix=”1,2,3”);

● Each additional prefix index is between half and


the same size on disk as the main term index.
● Adding a prefix index reduces the CPU used by
prefix queries significantly. And IO by a little.
Part 2
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Data stored on disk
● For each virtual table, FTS4 creates between 2
and 5 native tables on disk:
sqlite> CREATE VIRTUAL TABLE ft USING fts4(a, b);
sqlite> .schema
CREATE VIRTUAL TABLE ft USING fts4(a, b);

CREATE TABLE 'ft_content' (docid IPK, 'c0a', 'c1b');


CREATE TABLE 'ft_segments'(blockid IPK, block BLOB);
CREATE TABLE 'ft_segdir' (level INTEGER, idx INTEGER, ....
CREATE TABLE 'ft_docsize'(docid IPK, size BLOB);
CREATE TABLE 'ft_stat'(id IPK, value BLOB);
Data stored on disk
● Big tables:
– The “%_content” table stores the actual content inserted
into the table, verbatim.
– The “%_segment” table stores (most of) the FTS index
data.
– The “%_docsize” table stores the size, in tokens, of each
column value in the table. This is used by matchinfo().
● Small tables:
– %_segdir stores a small amount of FTS index data.
– %_stat contains a single record – the sum of the
%_docsize values.
Example 1: Enron Database
● Consists of 517424 separate emails (1.4 GiB).
● sqlite3_analyzer says:
Table 1024 byte pages % of DB
%_content 1524691 65.5%
%_segments 797885 34.3%
%_docsize 6105 0.25%
%_segdir 7 0.0%

● After adding a prefix index (prefix=1):


Table 1024 byte pages % of DB
%_content 1524691 57.9% FTS index now
%_segments 1103621 41.9% 1.38 times as
%_docsize 6105 0.23% large
%_segdir 16 0.0%
Example 2: POI Database
● 1.3 million rows, 28 columns, but just a few
tokens per row (most columns contain NULL):
Table 1024 byte pages % of DB
%_content 101246 55.6%
%_segments 30803 16.9% Unusually
%_docsize 50035 27.5% large
%_segdir 4 0.0%

● The %_docsize table is only used by the


matchinfo 'l' option. It can be omitted with:
CREATE VIRTUAL TABLE ft USING fts4(a, b, matchinfo=fts3);
Compressing the %_content table
● Each column value stored in an FTS4 table
may be individually compressed.
● Application provides SQL scalar functions to
compress and uncompress values.
● Compress function takes one argument –
returns compressed version.
● Uncompress function also takes one argument
– returns uncompressed version.
Compressing the %_content table
● Configuring an FTS4 table to use
compress/uncompress scalar functions:
CREATE VIRTUAL TABLE ft USING fts4(
a, b, compress=cmp, uncompress=uncmp
);

● Then, instead of reading and writing with:


SELECT c1 AS a, c2 AS b ...
INSERT INTO %_content VALUES($rowid, ?, ?);

● It uses:
SELECT cmp(c1) AS a, cmp(c2) AS b ...
INSERT INTO %_content VALUES($rowid, uncmp(?), uncmp(?));

● May not help if using ZipVFS already.


Contentless Tables
● The %_content table can be left out altogether,
as follows:
CREATE VIRTUAL TABLE ft USING fts4(a, b, content='');

● Works like any FTS table, except:


– UPDATE and DELETE are not supported
(because %_content is required to determine
which entries need to be removed from FTS
index).
– Reading from any column other than “rowid”
returns NULL.
External Content Tables
● FTS4 can also index content stored in regular
tables – but the index is not kept up to date
automatically.
CREATE TABLE tbl(a, b);
CREATE VIRTUAL TABLE ft USING fts4(a, b, content='tbl');

● Whenever content values are required, FTS tries


to obtain them with:
SELECT a, b FROM tbl WHERE rowid=?

● The same thing it would do if the %_content table


did exist.
External Content Tables
● To insert a row:
INSERT INTO tbl(rowid, a, b) VALUES(?,?,?); Order doesn't
INSERT INTO ft (rowid, a, b) VALUES(?,?,?);
matter
● To delete a row:
DELETE FROM ft WHERE rowid=?; Order matters!
DELETE FROM tbl WHERE rowid=?

● To update a row:
UPDATE ft SET a=?, b=? WHERE rowid=?;
UPDATE tbl SET a=?, b=? WHERE rowid=?; Order matters!
External Content Tables
● The external content table doesn't actually have
to be a table. Just something (a table, a view, a
virtual table) that supports the following:

– SELECT * FROM obj WHERE rowid=?;


– SELECT * FROM obj ORDER BY rowid ASC;
– SELECT * FROM obj ORDER BY rowid DESC;
The notindexed= option
● Entire columns can be omitted from the FTS
index using the “notindexed option”:
CREATE VIRTUAL TABLE ft USING fts4(a, b, notindexed='a');

● Multiple “notindexed” options are permitted.


● Works with external content tables.
● And contentless tables too (not really useful)
Part 3
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Auxiliary Functions
● Functions run as part of FTS queries that
operate on:
– the position-lists for search terms
– the original document text,
– document sizes,
– and other things.
● FTS4 has “offsets”, “snippet” and “matchinfo”
● FTS5 has an API that allows applications to
implement custom auxiliary functions.
Auxiliary Function Example
● Say the query is:
SELECT snippet(ft) FROM ft WHERE ft MATCH 'purple AND yellow'

snippet() returns
this text

position lists
● Doclists:
“purple” -> (1: a0 b2) (2: b3) (3: a0 b1) (4: b1)
“yellow” -> (2: a0 b2 b4) (4: a0)

● Snippet() also accesses the original document


text (from %_content table) and the tokenizer.
The matchinfo() function
● Matchinfo exposes some of the data available to
aux. functions as an array of integers. e.g.
SELECT matchinfo(ft, 'ly') FROM ft WHERE ft MATCH 'red blue'

● Return value is an SQL blob – an array of 32-bit


integers.
● Each character in the second argument adds
one or more integers to the output blob.
The matchinfo() function
● The 'l' flag appends the size of each column in
tokens to the output.
● For each phrase/column combination, the 'y' flag
appends the number of phrase hits in the column
to the output. So:
SELECT matchinfo(ft, 'ly') FROM ft WHERE ft MATCH 'red blue'

● Returns a blob of 6 integers (2 from 'l', 4 from 'y').


● And there are many other flags too...
The matchinfo() function
● Matchinfo allows FTS to be extended in similar,
but more limited, ways to adding new aux.
functions – for ranking and so on.
● Tip: If you're using the 'x' option to matchinfo,
take a look at recently added option 'y'. 'y'
provides similar information, but is quicker.
Auxiliary Functions
● In general, it is easier and safer to add auxiliary
functions or matchinfo() modes than it is to add
other features to FTS4.
Part 4
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Multiple Tree Structures
● Instead of a single tree structure, FTS uses an
array of trees
● This is to work around the “write amplification”
problem (see also – OTA).
● A new tree is written either:
– At the end of each transaction, or
– For large transactions, roughly once for each 1MB
of FTS index data
● When querying, FTS has to query all trees in
the array and merge the result.
Multiple Tree Structures

New trees are added to level 0

Level 0:
Once there are 16 trees in level 0,
Level 1: their contents are merged into a
single big level 1 tree (and the
original level 0 trees discarded)

Level 2:
And once there are 16 trees in
level 1, a level 2 tree... And so on
FTS Index Details: 'optimize'
● Querying multiple trees is slower than querying
a single tree.
● To merge all trees in an FTS index to a single
tree:
INSERT INTO ft(ft) VALUES('optimize');

● 'optimize' tends to help queries that retrieve


smaller doclists more than others.
The 'automerge' setting 1
● When a level reaches 16 trees, FTS
immediately merges them together into a single
tree.
● If the input trees are large, this might take a
long time.
● From the user's point of view, this means that
an unlucky FTS write might inexplicably take a
very long time.
The 'automerge' setting 2
● With automerge, after creating a new Level 0
tree, FTS (sometimes) does some work
towards merging existing trees too.
New trees are still added to level 0

Level 0:
After adding a level 0 tree, also
Level 1: do some work merging (say) level 1
Trees to level 0.

Level 2: FTS can query the partially


merged trees.
The 'automerge' setting 3
● Automerge prevents a level from ever having
as many as 16 trees, avoiding the problems
associated with large merge operations.
● Set automerge as follows:
INSERT INTO ft(ft) VALUES('automerge=4');

● The parameter (4) is the minimum number of


trees to merge at a time.
● A value of 0 turns automerge off. As does 16 or
greater.
The 'rebuild' command
● The 'rebuild' command rebuilds the FTS index
based on the current contents of the FTS table.
INSERT INTO ft(ft) VALUES('rebuild');

● For “external content” tables, the current


contents are read from the external table.
● Contentless FTS tables may not be rebuilt.
● This is useful when:
– The index may be corrupt, or
– The tokenizer has changed somehow.
Part 5
1. The FTS Index
• what it contains, the types of queries it supports, tokenizers

2. The Underlying Database Tables


• what is stored on disk, how this can be configured/optimized

3. Auxiliary Functions
• what are they, what they can do and how they might be extended

4. Administration and Tuning Parameters


• 'optimize', 'automerge', 'rebuild' and other commands – what and why

5. Common Tokens and FTS5


• how common tokens cause problems for FTS4, and why they are less
of a problem with FTS5
Large Doclists in FTS4
● Consider:
... poiFtsTable MATCH 'am faltenbach'

● The two doclists are loaded and merged to


determine the query result.
● But:
– Doclist for “am” contains 35,000 entries.
– Doclist for “faltenbach” contains 2 or 3.
● Making the query much, much slower than just:
... poiFtsTable MATCH 'faltenbach'
Large Doclists in FTS4
● Each Doclist in FTS4 is stored as a single blob.
● May only be read sequentially.
● Can be read incrementally, so:
... poiFtsTable MATCH 'am' LIMIT 10

can run without loading much data.


● But not much else can be done without loading
the entire doclist into memory.
● Large doclists cause many performance
problems.
Large Doclists in FTS5
Doclist is a single large blob
● FTS4:
“am” ->

● FTS5:
And there is a b-tree
Doclist is divided into a sequence of blobs to index it by docid

“am” ->
Large Doclists in FTS5
● So, when querying for:
SELECT count(*) FROM poiFtsTable
WHERE poiFtsTable MATCH 'am faltenbach'

● FTS5 effectively loads the small doclist for


'faltenbach' and then queries the b-tree to
check which of them also match 'am'.
FTS4 FTS5
Memory Used 301808 (max 446392) 120704 (max 127264)
Largest Allocation 136829 64000
Cache Misses 151 25
Pager Heap Usage: 195192 33912
Another large doclist problem
● Say a table contains:
poiName Country
Kath. Kindergarten Deutschland
Deutsch Bank Deutschland
Jim Knopf Deutschland
Velo Shop Well Deutschland

And many more rows...

● And the query is for 'poiName: de*'


● FTS4 (and FTS5) both have to do a linear scan
of the huge doclist for 'de*'.
● No solution yet for this one.
Finally...
● An FTS table maintains an FTS index mapping from
each term to a list of term occurrences.
● This can be queried for terms, prefixes and
phrases. AND, OR, NOT and NEAR are supported.
● Auxiliary functions do stuff with the position list data
for each row (and sometimes all rows).
● There are actually multiple trees on disk.
● Large doclists are something to watch out for.

Вам также может понравиться