Вы находитесь на странице: 1из 32

Introduction to basics of

Search and Relevancy


with Apache Solr

FEATURING:

Mark Bennett, CTO


Agenda

• Prerequisites: Browser Tricks


• Web “Command Line”
• The DisMax Parser
• Boosting Formula
• Explaining “Explain”
• Check Your Index!
• Q&A
• Resources / About NIE

12/2/2009 Lucid Imagination, Inc. 2


Prerequisite:
Some Browser Tricks

12/2/2009 Lucid Imagination, Inc. 3


Browsers Matter – install them all!

Firefox: IE and Safari:


• Default XML Rendering • Better “Explain”
• (also some versions of IE)
copy & paste

• Lots of Plugins maintains line


breaks
• Better table copy
and paste

12/2/2009 Lucid Imagination, Inc. 4


Larger Firefox “Command Line”

Customize the Firefox


URL box as a command
line in 3 easy steps
1. Toolbar: Right Click
2. Customize… Add New Toolbar
3. URL bar ->CLICK and DRAG

Lucid Imagination, Inc. 5


Turn off Solr HTTP Caching

• Change in solrconfig.xml
• Disable the http304 section
• Turn it back on before you deploy!

12/2/2009 Lucid Imagination, Inc. 6


Understanding Solr’s
“Web Command Line”

12/2/2009 Lucid Imagination, Inc. 7


The “Web Command Line”
CLI CONCEPT SOLR EQUIVALENT
• Command Prompt URL bar
• -o or --foo bar ? or & and =
• (spaces) +
• some punctuation %nn
• output XML or HTML
• Command line “adapter” Curl
• Script files can
call URLs
• Not built into
Windows – try cygwin

12/2/2009 Lucid Imagination, Inc. 8


Solr “Command Line”

• Typical Base URL


• http://localhost:8983/solr/select?...
• Basic Input (not counting dismax)
• q = query, fq = filter query
• df = default field
• qt = query type (standard / dismax)
• Controlling Output (lots more!!!)
• debugQuery = true
• wt = “what type” (actually “writer type”)
• standard/XML, xslt (with tr=), javabin, json…
• fl = *,score (which fields)

12/2/2009 Lucid Imagination, Inc. 9


Example: search for “solr”

http://localhost:8983/solr/select?q=solr&debugQuery=true
With
Firefox
you get XML
output you
can expand
and collapse

With MSIE* and Safari,


not so much

* Some versions

12/2/2009 Lucid Imagination, Inc. 10


Detailed Debug & Explain Output

http://localhost:8983/solr/select?q=solr&debugQuery=true
<str name="parsedquery">text:solr</str>

<lst name="explain">
<str name="SOLR1000">
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
</str>
</lst>

12/2/2009 Lucid Imagination, Inc. 11


A look at the
DisMax query parser

12/2/2009 Lucid Imagination, Inc. 12


Solr DisMax: Defined

• What is it?
• Dis-joint text (Multiple fields)
• Max-imum match (score)
• How do you get it?
• Configured in:
• solrconfig.xml and schema.xml
• Called with:
• qt=dismax
• Adjusted with:
• mm, bf, qf, pf, qs, ps, tie

12/2/2009 Lucid Imagination, Inc. 13


Solr DisMax: Pros and Cons

General Benefits
• Multiple Fields
• Multiple Relevancy Rules
• Great for Freshness / Popularity
Issues to be Aware of
• Tie-in between schema.xml & solrconfig.xml
• Trouble with some CJK (Chinese, Japanese, Korean)
• Limited wildcard / field / range support
• Difficult to customize and debug
• Trouble with shingles
• Understand mm!

Lucid Imagination, Inc. 14


About the “dis” and the “max”

Distributed across multiple fields


• Breakup query into words
• Each part becomes field clause
• Like an OR but with extra credit
Takes the Maximum of each set
• Word 1 had highest score in Title
• Word 2 very dense in the doc body
• Adds in Tie breaker if in multiple fields

Lucid Imagination, Inc. 15


Coming soon: Extended DisMax

Improvements
• Flexible case Boolean ops: AND/and, OR/or
• Auto-escape punctuation & -> \&, etc.
• Improved Proximity Boosting (via word bigrams)
• Other changes in stop words, relevancy calc, URL arguments
How to get it
• Post 1.4 patch, planned for 1.5
• Details + Patch in JIRA: SOLR-1553
http://issues.apache.org/jira/browse/SOLR-1553
• TBD: change URL option qt=edismax (or qt=dismax )

Lucid Imagination, Inc. 16


Boosting Formulas

12/2/2009 Lucid Imagination, Inc. 17


Boost Functions in Dismax
High Level Feature
• Numeric functions for scoring
• sum(), product(), sqrt(), log(), etc.

• Boost on recent dates, user popularity


Good Combination: Reverse-Ordinal & Reciprocal
• Position in index : ord(), reverse is: rord()
• Larger y for smaller x: recip()
How to get it
• URL parameter bf = “boost function”
• Configured in solrconfig.xml
• See http://wiki.apache.org/solr/FunctionQuery

Lucid Imagination, Inc. 18


“Freshness”: Boosting Recent Dates
mx+c a / mx+c WIKI EXAMPLE:
Position N-Position Linear
Date ord() rord() (x,m,c) recip(x,m,a,c)
recip( rord(creationDate), 1, 1000, 1000 )
slope m 1
1/1/2000 1 120 1120 0.89286
numerator a 1000
2/1/2000 2 119 1119 0.89366
intercept c 1000 (aka "b")
3/1/2000 3 118 1118 0.89445
1.000
… … … … …
1/1/2005 61 60 1060 0.94340
0.980
… … … … …
1/1/2009 109 12 1012 0.98814 0.960
2/1/2009 110 11 1011 0.98912
3/1/2009 111 10 1010 0.99010 0.940

4/1/2009 112 9 1009 0.99108


0.920
5/1/2009 113 8 1008 0.99206
6/1/2009 114 7 1007 0.99305
0.900
7/1/2009 115 6 1006 0.99404
8/1/2009 116 5 1005 0.99502 0.880
9/1/2009 117 4 1004 0.99602
10/1/2009 118 3 1003 0.99701
11/1/2009 119 2 1002 0.99800
12/1/2009 120 1 1001 0.99900
Lucid Imagination, Inc. 19
Sifting through
Solr’s “Explain” output

12/2/2009 Lucid Imagination, Inc. 20


DisMax Example for “solr”
INPUT:
http://localhost:8983/solr
/select?q=solr&debugQuery=true&qt=dismax

DEBUG OUTPUT: (1 OF 2)

<str name="parsedquery">
+DisjunctionMaxQuery((id:solr^10.0 | text:solr^0.5 | cat:solr^1.4 |
manu:solr^1.1 | name:solr^1.2 | features:solr | sku:solr^1.5)~0.01)
DisjunctionMaxQuery((manu_exact:solr^1.9 | features:solr^1.1 |
text:solr^0.2 | manu:solr^1.4 | name:solr^1.5)~0.01)
FunctionQuery((top(ord(popularity)))^0.5)
FunctionQuery((1000.0/(1.0*float(top(rord(price)))+1000.0))^0.3)
</str>

12/2/2009 Lucid Imagination, Inc. 21


DisMax explain output
for a single word query
<lst name="explain"> 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.125 = fieldNorm(field=text, doc=13)
<str name="SOLR1000"> 1.0 = tf(termFreq(sku:solr)=1) 0.22260013 = (MATCH) weight(name:solr^1.5
0.74609417 = (MATCH) sum of: 3.6026897 = idf(docFreq=1, numDocs=26) in 13), product of:
0.4476144 = (MATCH) max plus 0.01 times others of: 1.0 = fieldNorm(field=sku, doc=13) 0.12357441 = queryWeight(name:solr^1.5),
0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of: 1.0 = tf(termFreq(features:solr)=1) product of:
0.04119147 = queryWeight(text:solr^0.5), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 1.5 = boost
0.5 = boost 0.125 = fieldNorm(field=features, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26)
3.6026897 = idf(docFreq=1, numDocs=26) 0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.022867065 = queryNorm
0.022867065 = queryNorm 0.12357441 = queryWeight(sku:solr^1.5), product of: 1.8013449 = (MATCH) fieldWeight(name:solr
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of: 1.5 = boost in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2) 3.6026897 = idf(docFreq=1, numDocs=26) 1.0 = tf(termFreq(name:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm 3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13) 3.6026897 = (MATCH) fieldWeight(sku:solr in 13), product of: 0.5 = fieldNorm(field=name, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of: 1.0 = tf(termFreq(sku:solr)=1) 0.06860119 = (MATCH)
0.09885953 = queryWeight(name:solr^1.2), product of: 3.6026897 = idf(docFreq=1, numDocs=26) FunctionQuery(top(ord(popularity))),
1.2 = boost 1.0 = fieldNorm(field=sku, doc=13) product of:
3.6026897 = idf(docFreq=1, numDocs=26) 0.22311316 = (MATCH) max plus 0.01 times others of: 6.0 = ord(popularity)=6
0.022867065 = queryNorm 0.040810023 = (MATCH) weight(features:solr^1.1 in 13), 0.5 = boost
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of: product of: 0.022867065 = queryNorm
1.0 = tf(termFreq(name:solr)=1) 0.09062123 = queryWeight(features:solr^1.1), product of: 0.0067654043 = (MATCH)
3.6026897 = idf(docFreq=1, numDocs=26) 1.1 = boost FunctionQuery(1000.0/(1.0*float(top(ror
0.5 = fieldNorm(field=name, doc=13) 3.6026897 = idf(docFreq=1, numDocs=26) d(price)))+1000.0)), product of:
0.03710002 = (MATCH) weight(features:solr in 13), product of: 0.022867065 = queryNorm 0.9861933 =
0.08238294 = queryWeight(features:solr), product of: 0.45033622 = (MATCH) fieldWeight(features:solr in 13), 1000.0/(1.0*float(rord(price)=14)+1000.0
3.6026897 = idf(docFreq=1, numDocs=26) product of: )
0.022867065 = queryNorm 1.0 = tf(termFreq(features:solr)=1) 0.3 = boost
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of: 3.6026897 = idf(docFreq=1, numDocs=26) 0.022867065 = queryNorm
1.0 = tf(termFreq(features:solr)=1) 0.125 = fieldNorm(field=features, doc=13) </str>
3.6026897 = idf(docFreq=1, numDocs=26) 0.01049347 = (MATCH) weight(text:solr^0.2 in 13), product of: </lst>
0.125 = fieldNorm(field=features, doc=13) 0.016476588 = queryWeight(text:solr^0.2), product of:
0.44520026 = (MATCH) weight(sku:solr^1.5 in 13), product of: 0.2 = boost
0.12357441 = queryWeight(sku:solr^1.5), product of: 3.6026897 = idf(docFreq=1, numDocs=26)
1.5 = boost 0.022867065 = queryNorm
3.6026897 = idf(docFreq=1, numDocs=26) 0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
0.022867065 = queryNorm 1.4142135 = tf(termFreq(text:solr)=2)
3.6026897 = idf(docFreq=1, numDocs=26)

12/2/2009 Lucid Imagination, Inc. 22


“Explain” example:

...
0.026233677 = (MATCH) weight(text:solr^0.5 in 13), product of:
0.04119147 = queryWeight(text:solr^0.5), product of:
0.5 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.6368716 = (MATCH) fieldWeight(text:solr in 13), product of:
1.4142135 = tf(termFreq(text:solr)=2)
tf (termFreq(text:solr )=2)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=text, doc=13)
0.17808011 = (MATCH) weight(name:solr^1.2 in 13), product of:
idf (docFreq=1,numDocs=26)
0.09885953 = queryWeight(name:solr^1.2), product of:
1.2 = boost
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
1.8013449 = (MATCH) fieldWeight(name:solr in 13), product of:
1.0 = tf(termFreq(name:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.5 = fieldNorm(field=name, doc=13)
0.03710002 = (MATCH) weight(features:solr in 13), product of:
0.08238294 = queryWeight(features:solr), product of:
3.6026897 = idf(docFreq=1, numDocs=26)
0.022867065 = queryNorm
0.45033622 = (MATCH) fieldWeight(features:solr in 13), product of:
1.0 = tf(termFreq(features:solr)=1)
3.6026897 = idf(docFreq=1, numDocs=26)
0.125 = fieldNorm(field=features, doc=13)
...

12/2/2009 Lucid Imagination, Inc. 23


Solr’s XSLT “debugger”
http://localhost:8983/solr/select?
q=solr
&debugQuery=true
&wt=xslt
&tr=example.xsl
&fl=*,score
&qt=dismax

12/2/2009 Lucid Imagination, Inc. 24


Another way to view Explain data

• Solr1.4 has Solritas


• Various features, including toggle explain display
• “Some assembly required…”

http://www.lucidimagination.com/blog/2009/11/04/solritas-solr-1-4s-hidden-gem/

Lucid Imagination, Inc. 25


Checking your Index and IDF

12/2/2009 Lucid Imagination, Inc. 26


Checking what got Indexed

Bad Index = Bad Search


• Check Upper / lower case and Punctuation
• Bad Fields / Meta Data = Bad Facets, Filters, Sorting
Use built-in Schema Browser:
• Check each field
• Common words =
• IDF “Inverse Document Frequency”

Lucid Imagination, Inc. 27


Check IDF w/ the Schema Browser
Start at the Admin Screen:
http://localhost:8983/solr/admin

Schema Browser
• select a field
• change #
to see more

Lucid Imagination, Inc.


About NIE
New Idea Engineering

12/2/2009 Lucid Imagination, Inc. 29


NIE Resources

Newsletter & Whitepapers: Search Dev Newsgroup:


www.ideaeng.com/current www.SearchDev.org

Blogs:
EnterpriseSearchBlog.com
SearchComponentsOnline.com

12/2/2009 Lucid Imagination, Inc. 30


Finish Line / Q & A

Review & Questions

Mark Bennett mbennett@ideaeng.com


main 408-446-3460
cell 408-829-6513

12/2/2009 Lucid Imagination, Inc. 31


Q&A

These slides and a recorded presentation are available at

bit.ly/SolrRelevancy
12/2/2009 Lucid Imagination, Inc.

Вам также может понравиться