Академический Документы
Профессиональный Документы
Культура Документы
Learning to Rank
for Information Retrieval
Tie-Yan Liu
Microsoft Research Asia
This Tutorial
• Learning to rank for information retrieval
– But not ranking problems in other fields.
• Supervised learning
– But not unsupervised or semi-supervised learning.
• Learning in vector space
– But not on graphs or other structured data.
• Mainly based on papers at SIGIR, WWW, ICML, and NIPS.
– Papers at other conferences and journals might not be covered
comprehensively.
1
2009/4/12
Outline
• Introduction
• Learning to Rank Algorithms
– Pointwise approach
– Pairwise approach
– Listwise approach
– Analysis of the approaches
• Statistical Ranking Theory
– Query-level ranking framework
– Generalization analysis for ranking
• Benchmarking Learning to Rank methods
• Summary
2
2009/4/12
Introduction
3
2009/4/12
4
2009/4/12
Ranking is Essential
Indexed
Document Repository
D d j
Ranking
model
Query
Applications of Ranking
• Document retrieval
• Collaborative filtering
• Key term extraction
• Definition finding
• Important email routing
• Sentiment analysis,
• Product rating
• Anti Web spam
• ……
5
2009/4/12
Scenarios of Ranking
(Document Retrieval as Example)
• Rank the documents purely according to their relevance with
regards to the query.
• Consider the relationships of similarity, website structure, and
diversity between documents in the ranking process
(relational ranking).
• Aggregate several candidate ranked lists to get a better ranked
list (meta search).
• Find whether and to what degree a property of a webpage
influences the ranking result (reverse engineering).
• …
6
2009/4/12
Relevance Judgment
• Degree of relevance
– Binary: relevant vs. irrelevant
– Multiple ordered categories: Perfect > Excellent > Good >
Fair > Bad
• Pairwise preference
– Document A is more relevant than
document B
• Total order
• Documents are ranked as {A,B,C,..}
according to their relevance
7
2009/4/12
NDCG @ k Z k G 1 ( j ) ( j )
k
j 1
G 1 ( j ) 2
l
1, ( j ) 1 / log( j 1)
1 ( j )
8
2009/4/12
• Query-independent
– PageRank, TrustRank, BrowseRank, etc.
9
2009/4/12
• “Learning to Rank”
– In general, those methods that use
machine learning technologies to solve
the problem of ranking can be named as
“learning to rank” methods.
10
2009/4/12
Learning to Rank
• In most recent works, learning to rank is
defined as having the following two properties:
– Feature based;
– Discriminative training.
Feature Based
• Documents represented by feature vectors.
– Even if a feature is the output of an existing retrieval
model, one assumes that the parameter in the model is
fixed, and only learns the optimal way of combining these
features.
• The capability of combining a large number of
features is very promising.
– It can easily incorporate any new progress on retrieval
model, by including the output of the model as a feature.
11
2009/4/12
Discriminative Training
• An automatic learning process based on the training
data,
– With the following four pillars: Input space, output space,
hypothesis space, loss function.
– Not even necessary to have a probabilistic explanation.
• This is highly demanding for real search engines,
– Everyday these search engines will receive a lot of user
feedback and usage logs
– It is very important to automatically learn from the
feedback and constantly improve the ranking mechanism.
Learning to Rank
• In most recent works, learning to rank is
defined as having the following two properties:
– Feature based;
– Discriminative training.
12
2009/4/12
4/20/2009
Round robin ranking (ECML 2003)
Tie-Yan Liu @ WWW 2009 Tutorial on Learning to Rank 26
13
2009/4/12
14
2009/4/12
Listwise Listwise loss minimization: RankCosine (IP&M 2008), ListNet (ICML 2007), ListMLE
Approach (ICML 2008), …
Direct optimization of IR measure: LambdaRank (NIPS 2006), AdaRank (SIGIR 2007),
SVM-MAP (SIGIR 2007), SoftRank (LR4IR 2007), GPRank (LR4IR 2007), CCA (SIGIR
2007), …
15
2009/4/12
Ordinal regression
Regression loss Classification loss
loss
Loss Function
L( f ; x j , y j )
( x , y ), ( x , y ),..., ( x
1 1 2 2 m
, ym )
16
2009/4/12
2
L( f ; x j , y j ) f ( x j ) y j
17
2009/4/12
McRank
(P. Li, C. Burges, et al. NIPS 2007)
x w x
18
2009/4/12
if y j 2
b1 f (xj ) b2
f (xj )
if y j 2
4/20/2009 Tie-Yan Liu @ WWW 2009 Tutorial on Learning to Rank 38
19
2009/4/12
20
2009/4/12
21
2009/4/12
RankNet
(C. Burges, T. Shaked, et al. ICML 2005)
• Target probability:
Pu ,v 1, if yu ,v 1; Pu ,v 0, otherwise
• Modeled probability:
exp( f ( xu ) f ( xv ))
Pu ,v ( f )
1 exp( f ( xu ) f ( xv ))
• Cross entropy as the loss function
l ( f ; xu , xv , yu ,v )
Pu ,v log Pu ,v ( f ) (1 Pu ,v ) log(1 Pu ,v ( f ))
• Use Neural Network as model, and gradient descent as
algorithm, to optimize the cross-entropy loss.
FRank
(M. Tsai, T.-Y. Liu, et al. SIGIR 2007)
22
2009/4/12
RankBoost
(Y. Freund, R. Iyer, et al. JMLR 2003)
• Exponential loss
L( f ; xu , xv , yu,v ) exp( yu,v ( f ( xu ) f ( xv )))
Ranking SVM
(R. Herbrich, T. Graepel, et al. , Advances in Large Margin Classifiers, 2000;
T. Joachims, KDD 2002)
• Hinge loss
L( f ; xu , xv , yu,v ) 1 yu,v ( f ( xu ) f ( xv )
n
1
|| w || 2 C u(,iv)
xu-xv as positive instance of
min learning
2 i 1 u ,v: yu( i,v) 1
wT xu(i ) xv(i ) 1 u(,iv) , if yu(i,)v 1. Use SVM to perform binary
(i )
uv 0, i 1,..., n. classification on these instances,
to learn model parameter w
23
2009/4/12
• Solutions
– Multi-hyperplane ranker
Multi-Hyperplane Ranker
(T. Qin, T.-Y. Liu, et al. SIGIR 2007)
• Testing
– Rank Aggregation
f ( x ) k , l f k ,l ( x )
k ,l
24
2009/4/12
IR-SVM
• Introduce query-level normalization to the pairwise loss
function.
• IRSVM (Y. Cao, J. Xu, et al. SIGIR 2006; T. Qin, T.-Y. Liu, et al. IPM 2007)
N
1 1
min || w ||2 C ~ ( i ) (i )
uv
2 i 1 m u ,v
Query-level normalizer
25
2009/4/12
Direct Optimization of IR
Listwise Loss Minimization
Measure
Ordered categories y { y j } j 1
m
Output Space Permutation y
26
2009/4/12
27
2009/4/12
SoftRank
(M. Taylor, J. Guiver, et al. LR4IR 2007/WSDM 2008)
• Key Idea:
– Avoid sorting by treating scores as random variables
• Steps
– Construct score distribution
– Map score distribution to rank distribution
– Compute expected NDCG (SoftNDCG) which is smooth and
differentiable.
– Optimize SoftNDCG by gradient descent.
p(s j ) Ν (s j | f ( x j ), s2 )
28
2009/4/12
• Rank Distribution
Soft NDCG
• Expected NDCG as the objective, or in the language of loss
function,
m m 1
L( f ; x, y ) 1 E[ NDCG] 1 Z m (2 j 1)r p j ( r )
y
j 1 r 0
29
2009/4/12
SVM-MAP
(Y. Yue, T. Finley, et al. SIGIR 2007)
N
1 2 C
min
2
w
N
i 1
(i )
Challenges in SVM-MAP
• For AP, the true ranking is a ranking where the relevant
documents are all ranked in the front, e.g.,
30
2009/4/12
• If the relevance at each position is fixed, Δ(y,y’) will be the same. But
if the documents are sorted by their scores in descending order, the
second term will be maximized.
• Strategy (with a complexity of O(mlogm))
– Sort the relevant and irrelevant documents and form a perfect ranking.
– Start with the perfect ranking and swap two adjacent relevant and
irrelevant documents, so as to find the optimal interleaving of the two
sorted lists.
31
2009/4/12
AdaRank
(J. Xu, H. Li. SIGIR 2007)
Weak learner that can
rank important queries
• Given: initial distribution D1 for each query.more correctly will have
• For t=1,…, T: larger weight α
– Create f t ( x) s 1 s hs ( x)
t
– Update: Dt 1 (i ) Nexp{ E ( qi , f t ,y )}
(i )
j1exp{ E ( q j , ft ,y ( j ) )}
• Output the final ranker: f ( x) t ht ( x)
t
Embed IR evaluation measure in the distribution update of Boosting
Tie-Yan Liu @ WWW 2009 Tutorial on
4/20/2009 63
Learning to Rank
RankGP
(J. Yeh, J. Lin, et al. LR4IR 2007)
• Learning
– Single-population GP, which has been widely used to
optimize non-smoothing non-differentiable objectives.
• Evolution mechanism
– Crossover, mutation, reproduction, tournament selection.
• Fitness function
– IR evaluation measure (MAP).
32
2009/4/12
• Representative Algorithms
– ListNet
– ListMLE
33
2009/4/12
• Solution
– Ranked list ↔ Permutation probability distribution
– More informative representation for ranked list: permutation
and ranked list has 1-1 correspondence.
Luce Model:
Defining Permutation Probability
• Probability of a permutation 𝜋 is defined as
m ( f ( x ( j ) ))
P( | f )
( f ( x ( k ) ))
m
j 1
k j
• Example:
f ( A) f (B) f (C)
PABC | f
f (A) f (B) f (A) f (B) f (C) f (C)
34
2009/4/12
dis(f,g) = 0.46
dis(g,h) = 2.56
35
2009/4/12
ListNet
(Z. Cao, T. Qin, T.-Y. Liu, et al. ICML 2007)
Limitations of ListNet
• Too complex to use in practice: O(m⋅m!)
• However,
– When k is large, the complexity is still high!
– When k is small, the information loss in the top-k Luce model will
affect the effectiveness of ListNet.
36
2009/4/12
Solution
• Given the scores outputted by the ranking model, compute
the likelihood of a ground truth permutation, and maximize it
to learn the model parameter.
ListMLE Algorithm
(F. Xia, T.-Y. Liu, et al. ICML 2008)
• Likelihood Loss
37
2009/4/12
Extension of ListMLE
• Dealing with other types of ground truth labels using
the concept of “equivalent permutation set”.
– For relevance degree
y { y | u v, if l 1 ( u ) l 1 ( u ) }
y y
L( f , x , y ) log P
y y
y | f ( w, x )
L( f , x, y ) max log P y | f ( w, x)
y y
Discussions
• Advantages
– Take all the documents associated with the same
query as the learning instance
– Rank position is visible to the loss function
• Problems
– Complexity issue.
– The use of the position information is insufficient.
38
2009/4/12
• Question
– What are the relationship between the loss
functions and IR evaluation measures?
Pointwise Approach
• For regression based algorithms (D. Cossock and T. Zhang,
COLT 2006)
1/ 1/
m m
f ( x j ) y j
1
1 NDCG( f ) 2 j
Zm j 1 j 1
39
2009/4/12
Pointwise Approach
• Although it seems the loss functions can bound (1-
NDCG), the constants before the losses seem too
large.
xi , yi xi , f ( xi )
Z m 21.4
x1 ,4 DCG( f ) 21.4 x1 ,3
x2 ,3 x 2 ,2
x ,2 x ,1
3 1 NDCG( f ) 0 3
x ,1 x ,0
4 4
2
2
15 m 1 m
1 m m
2 m
I{ y j f ( x j )} 1.15 1
Z m j 1 log( j 1) j 1 log( j 1) j 1
Pairwise Approach
• Unified loss
– Model ranking as a sequence of classifications
Suppose the ground truth permutation is {x1 x2 x3 x4 }
No Classification Task Lt ( f ) I
arg max { f ( xk )}t
k t
1 {x1} : {x2 , x3 , x4 }
2 {x2 } : {x3 , x4 } L( f ) t I
t arg max { f ( xk )}t
k t
3 {x3} : {x4 }
L ( f ) t I
~
Classifier: arg max{ f ( xk )}
t larg max { f ( xk )} lt
k t
k t
40
2009/4/12
Pairwise Approach
(W. Chen, T.-Y. Liu, et al. 2009)
• Unified loss vs. (1-NDCG)
– When t G(t ) (t ) , L~( f ) is a tight bound of (1-NDCG).
Zm
• Surrogate function of Unified loss
– After introducing weights βt, loss functions in
Ranking SVM, RankBoost, RankNet are Cost-
sensitive Pairwise Comparison surrogate functions,
and thus are consistent with and are upper bounds
of the unified loss.
– Consequently, they also upper bound (1-NDCG).
4/20/2009 Tie-Yan Liu @ WWW 2009 Tutorial on Learning to Rank 81
Listwise Approach
(W. Chen, T.-Y. Liu, et al. 2009)
41
2009/4/12
Listwise Approach
• Direct optimization
– Many direct optimization methods try to optimize smooth
surrogates of IR measures, and this is why they are called
direct optimization.
– Question, are they really direct?
– Directness
~ 1
D( w, M , M ) ~
sup | M ( w, x, y ) M ( w, x, y ) |
x , y
Listwise Approach
(Y. He, T.-Y. Liu, et al. 2008)
• Direct optimization
– For SoftRank, it has been proved that the
surrogate measure can be as direct as possible to
NDCG, by setting σs as small as possible.
– For SVM-MAP, there always exists some instances
in the input /output space that make the
surrogate measure not direct to AP on them.
• Good or bad?
42
2009/4/12
Discussions
• The tools to prove the bounds are not yet
“unified”.
– What about the pointwise approach?
• Quantitative analysis between different
bounds is missing.
• Upper bounds may not mean too much.
• Directness vs. Consistency.
43
2009/4/12
U-statistics
44
2009/4/12
45
2009/4/12
46
2009/4/12
47
2009/4/12
Function learned Function learned from the training data that Stability
from the original eliminate the j-th “samples” (both the query and coefficient
training data. the associates) from the original training data.
48
2009/4/12
• IRSVM
49
2009/4/12
Convergence rate:
Both algorithms: O(n-½ )
Generalization ability:
ListMLE > ListNet, especially when m is large
Linear transformation function has the best
generalization ability.
Discussion
• Generalization is not the entire story.
– Generalization is defined with respect to a
surrogate loss function L.
– Consistency further discusses whether the risk
defined by L can converge to the risk defined by
the true loss for ranking, at the large sample limit.
• What is the true loss for ranking?
– Permutation level 0-1?
– IR measures?
4/20/2009 Tie-Yan Liu @ WWW 2009 Tutorial on Learning to Rank 100
50
2009/4/12
51
2009/4/12
Document Corpora
• The “.gov” corpus
– TD2003, TD2004
– HP 2003, HP 2004
– NP 2003, NP 2004
• The OHSUMED corpus
52
2009/4/12
Document Sampling
• The “.gov” corpus
– Top 1000 documents per query returned by BM25
Feature Extraction
• The “.gov” corpus
– 64 features
– TF, IDF, BM25, LMIR, PageRank, HITS, Relevance
Propagation, URL length, etc.
• The OHSUMED corpus
– 40 features
– TF, IDF, BM25, LMIR, etc.
53
2009/4/12
Meta Information
• Statistical information about the corpus, such as the
number of documents, the number of streams, and
the number of (unique) terms in each stream.
• Raw information of the documents associated with
each query, such as the term frequency, and the
document length.
• Relational information, such as the hyperlink graph,
the sitemap information, and the similarity
relationship matrix of the corpora.
Finalizing Datasets
• Five-fold cross validation
http://research.microsoft.com/~LETOR/
54
2009/4/12
Experimental Results
55
2009/4/12
Experimental Results
Experimental Results
56
2009/4/12
Experimental Results
Experimental Results
57
2009/4/12
Experimental Results
Experimental Results
58
2009/4/12
Winner Numbers
Discussions
• NDCG@1
– {ListNet, AdaRank} > {SVM-MAP, RankSVM}
> {RankBoost, FRank} > {Regression}
• NDCG@3, @10
– {ListNet} > {AdaRank, SVM-MAP, RankSVM, RankBoost}
> {FRank} > {Regression}
• MAP
– {ListNet} > {AdaRank, SVM-MAP, RankSVM}
> {RankBoost, FRank} > {Regression}
59
2009/4/12
Summary
60
2009/4/12
Other Works
• Ground truth mining
– targets automatically mining ground truth labels for learning to rank,
mainly from click-through logs of search engines.
• Feature engineering
– includes feature selection, dimensionality reduction, and effective
feature learning.
• Query-dependent ranking
– adopts different ranking models for different types of queries, based
on either hard query type classification or soft nearest neighbor
approach.
Other Works
• Supervised rank aggregation
– learns the ranking model not to combine features, but to aggregate
candidate ranked lists.
• Semi-supervised / active ranking
– leverages the large number of unlabeled queries and documents to
improve the performance of ranking model learning.
• Relational / global ranking
– does not only make use of the scoring function for ranking, but
considers the inter-relationship between documents to define more
complex ranking models.
61
2009/4/12
62
2009/4/12
63
2009/4/12
Feature Engineering
• Ranking is deeply rooted in IR
– Not just new algorithms and theories.
• Feature engineering is very importance
– How to encode the knowledge on IR accumulated
in the past half a century in the features?
– Currently, these kinds of works have not been
given enough attention to.
64
2009/4/12
Ranking Theory
• A more unified view on loss functions
• True loss for ranking
• Statistical consistency for ranking
• Complexity of ranking function class
• ……
65
2009/4/12
Information Machine
Retrieval Learning
Information Machine
Retrieval Learning
66
2009/4/12
tyliu@microsoft.com
http://research.microsoft.com/people/tyliu/
67