Вы находитесь на странице: 1из 50

Practical full-text search in MySQL

Bill Karwin
MySQL University 2009-12-3

Me

20+ years experience SQL maven Community contributor



MySQL, PostgreSQL, InterBase Zend Framework Oracle, SQL Server, IBM DB2, SQLite

Application/SDK developer Support, Training, Proj Mgmt C, Java, Perl, PHP

Full Text Search

In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
http://www.ickr.com/photos/tryingyouth/

Test Data

StackOverow.com data dump,


exported October 2009

1.5 million tuples ~1 Gigabyte

StackOverow ER diagram

searchable text

Naive Searching
Some people, when confronted with a problem, think I know, Ill use regular expressions. Now they have two problems. Jamie Zawinsky

Accuracy issue

Irrelevant or false matching words


one, money, prone, etc.: body LIKE %one%

Regular expressions in MySQL


body RLIKE [[:<:]]one[[:>:]]

support escapes for word boundaries:

Performance issue

LIKE with wildcards: POSIX regular expressions:


SELECT * FROM Posts WHERE body RLIKE performance

time: 22 sec

SELECT * FROM Posts WHERE body LIKE %performance%

time: 108 sec

Why so slow?
CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (Riddle, Thomas), (Thomas, Dean);

Why so slow?

Search for all with last name Thomas uses


SELECT * FROM telephone_book WHERE full_name LIKE Thomas%
index

Search for all with rst name Thomas


SELECT * FROM telephone_book WHERE full_name LIKE %Thomas
doesnt use index

Indexes dont help searching for substrings

Solutions
1. Full-Text Indexing in SQL 2. Sphinx Search 3. Apache Lucene 4. Inverted Index 5. Search Engine Service

MySQL FULLTEXT Index

MySQL FULLTEXT Index

Special index type for MyISAM Integrated with SQL queries Balances features vs. speed vs. space

MySQL FULLTEXT:

Indexing

CREATE FULLTEXT INDEX PostText ON Posts(title, body, tags);


time: 15 min 6 sec

MySQL FULLTEXT:

Index Caching

SET GLOBAL key_buffer_size = 600*1024*1024; LOAD INDEX INTO CACHE Posts INDEX(PostText); time: 11 sec

MySQL FULLTEXT:

Querying

SELECT * FROM Posts WHERE MATCH( column(s) ) AGAINST( query pattern ); must include all columns of index, in the order dened

MySQL FULLTEXT:

Natural Language Mode


Searches concepts with free text queries:
SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(improving mysql performance IN NATURAL LANGUAGE MODE) LIMIT 100; time with index: 80 milliseconds

MySQL FULLTEXT:

Boolean Mode

Searches words using mini-language:


SELECT * FROM Posts WHERE MATCH( title, body, tags ) AGAINST(+mysql +performance IN BOOLEAN MODE); time with index: 50 milliseconds

Lucene

Lucene

Apache Project since 2001 Apache License Java implementation Ports exist for other languages:

Lucy (C) Lucene.NET (C#) Zend_Search_Lucene (PHP)

PyLucene (Python) Plucene (Perl) Ferret (Ruby)

Lucene:

How to use

1. Add documents to index 2. Parse query 3. Execute query

Lucene:

Creating an index

Programmatic solution in Java...


time: 6 minutes, 50 seconds

Lucene:

Indexing
String url = "jdbc:mysql://localhost/stackoverow?" + "user=myappuser&password=xxxx"; Class.forName("org.mysql.jdbc.Driver"); Connection con = DriverManager.getConnection(url, props);

String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; com.mysql.jdbc.Statement stmt = (com.mysql.jdbc.Statement) con.createStatement(); stmt.enableStreamingResults(); ResultSet rs = stmt.executeQuery(sql); new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);

any SQL query

open Lucene index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR),

Lucene:

Indexing
loop over SQL result
while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); } writer.addDocument(doc);

writer.optimize(); writer.close();

each row is a Document with four Fields

nish and close index

Lucene:

Querying

Parse a Lucene query


String[] elds = new String[3]; elds[0] = Title; elds[1] = Body; elds[2] = Tags; Query q = new MultiFieldQueryParser(elds, new StandardAnalyzer()).parse(performance);

dene elds

Execute the query


Searcher s = new IndexSearcher(indexDirectory, true); Hits h = s.search(q);

parse search query

time: 120 milliseconds

Sphinx Search

Sphinx Search

Started in 2001 GPLv2 license Good database integration:

SphinxSE storage engine for MySQL

Sphinx Search:

How to use
1. Edit conguration le 2. Index the data 3. Query the index 4. Issues

Sphinx Search:

sphinx.conf
source stackoverowsrc { type = mysql sql_host = localhost sql_user = myappuser sql_pass = xxxx sql_db = stackoverow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT * FROM Posts WHERE PostId=$id }

Sphinx Search:

sphinx.conf

index stackoverow { source = stackoverowsrc path = /opt/local/var/db/sphinx/stackoverow }

Sphinx Search:

Building index
indexer -c sphinx.conf stackoverow
collected 1517638 docs, 1021.3 MB sorted 171.5 Mhits, 100.0% done total 1517638 docs, 1021342525 bytes total 147.060 sec, 6945093.00 bytes/sec, 10319.88 docs/sec

time: 2 min 27 sec

Sphinx Search:

Querying index

search -c sphinx.conf -i stackoverow -b sql & performance


time: 12 milliseconds

Sphinx Search:

Issues
Cost to update index = cost to build index

Build a main index plus a delta index for recent changes Merge indexes periodically (much less costly) But not all data ts into this model; i.e. good for a forum, but bad for a wiki

Inverted Index

Inverted index
many-to-many relationship for Posts and words searchable words

Posts

PostTags

Tags

Inverted index:

Updated ER Diagram

new tables

Inverted index:

Data denition
CREATE TABLE Tags ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL UNIQUE KEY (Tag) ); CREATE TABLE PostTags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES Tags (TagId) );

Inverted index:

Indexing
1. Query all Posts.Tags strings: <mysql><search><performance> 2. Loop over tag strings 3. Dump two CSV les:
time: 23.5 seconds

Tags.csv PostTags.csv

4. Load CSV les with mysqlimport

time: 5.2 seconds

Inverted index:

Querying

SELECT p.* FROM Posts p JOIN PostTags pt USING (PostId) JOIN Tags t USING (TagId) WHERE t.Tag = performance;
250 milliseconds

Inverted Index:

Is it right for you?

Best for searching selected words Simple, portable, standard SQL Not as fast as specialized technology,
but far better than using LIKE

Search Engine Services

Search engine services:

Google Custom Search Engine

http://www.google.com/cse/

even big web sites use this solution

DEMO

http://www.karwin.com/demo/gcse-demo.html

Search engine services:

Is it right for you?

Your site is public and allows external index Search is a non-critical feature for you Search results are satisfactory You need to ofoad search processing

Comparison: Time to Build Index


LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! none 15 min 6 min 50 sec 2 min 27 sec 28 sec ofine

Comparison: Index Storage


LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! none 466 MB 1323 MB 933 MB 48 MB ofine

Comparison: Query Speed


LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo! 22 seconds 50-80 ms 120 ms 12 ms 250 ms *

Comparison: Bottom-Line
indexing storage query 2000x solution

LIKE expression MySQL FULLTEXT Apache Lucene Sphinx Search Inverted index Google / Yahoo!

none 32x 15x 5x 1x ofine

none 10x 27x 20x 1x ofine

SQL RDBMS 3rd party 3rd party SQL Service

6x 10x 1x 20x *

Copyright 2009 Bill Karwin

www.slideshare.net/billkarwin
Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions:
Attribution. You must attribute this work to Bill Karwin. Noncommercial. You may not use this work for commercial purposes. No Derivative Works. You may not alter, transform, or build upon this work.

Вам также может понравиться