Академический Документы
Профессиональный Документы
Культура Документы
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
ARTIFICIAL INTELLIGENCE
IN MOTION
HOME
ABOUT
tosearch,typeandhitenter....
Hi all,
In this post I will introduce three metrics widely used for evaluating the utility
of recommendations produced by a recommender system : Precision , Recall
and F-1 Score. The F-1 Score is slightly different from the other ones, since it is
a measure of a test's accuracy and considers both the precision and the recall of
the test to compute the final score.
Introduction about the Recommender Systems Evaluation
The Recommender systems are a sub-domain of information filtering system
techniques that attempts to recommend information items such as movies,
programs, music, books, page that are likely to be of interest of the user. The
final product of a recommender system is a top-list of items recommended for
the user ordered by a evaluated score that represents the preference of that
item for the user. So highest the value, more interested the user will be. But to
produce the right recommendations is not trivial and there are several studies
and research on evaluating the recommendations of such engine.
Therefore, when you develop a recommender engine for your problem domain,
the main question after all work is done is: "What are the best
recommendations for the user ?" Before looking after the answers, we should
investigate the question. What exactly we mean as a good recommendation ?
And how will we known when the recommender system is producing them ?
The remainder of this post is to explain how we can evaluate such engines in a
way that we provide the best possible recommender that should recommend
possible items that the user hasn't yet seen or expressed any preference for. A
optimal recommender system would be the one that could predict all your
preferences exactly would present a set of items ranked by your future
preference and be done.
MARCEL CARACIOLO
Iamabraziliandatascientist,entrepreneur,
pythonhackerandtechnologyconsultant.
NowadaysIworkwithdatacentric
applications,speciallyinmachinelearning,
recommendersystemsandbioinformatics.I
amalsointerestedindistributedcomputing,
highperformanceanddatavisualization,
educationalandbioinformaticsventures.
Until2013Iwasthecofounderoftwo
companiesAtepassar.com,asocialnetwork
forstudentsinBrazilandcofounderof
PyCursos,aonlinestartupforpythontraining
andonlinecourses.In2014,Iassumeda
newpositionatGenomikaDiagnsticos,a
braziliangeneticstestslaboratory,asCTO.
MyGithubRepos
POPULAR POSTS
MACHINE LEARNING
WITH PYTHON LOGISTIC REGRESSION
Hi all, I decided to
start a new series of posts now
focusing on general machine
learning with several snippets for
anyone to use with...
MACHINE LEARNING
values, that is, evaluating how closely the estimated preferences match the
actual preferences of the user.
REGRESSION
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
1/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
MACHINE LEARNING
WITH PYTHON:
MEETING TF-IDF FOR
TEXT MINING
3Hi all, This month I was studying
about information retrieval and
text mining, specially how to
convert the textual representation
of ...
MACHINE LEARNING
AND DATA MINING -
results. The 'Recall' would measure the proportion of all relevant results
included in the top results. See the Figure 1 as an example to illustrate those
metrics.
ASSOCIATION
ANALYSIS WITH
PYTHON
Hi all, Recently I've been working
with recommender systems and
association analysis. This last
one, specially, is one of the mos...
INTRODUCTION TO
RECOMMENDATIONS
WITH MAP-REDUCE
AND MRJOB
In a formal way, we could consider documents as instances and the task it to
return a set of relevant documents given a search term. So the task would be
assign each document to one of two categories: relevant and not relevant.
Recall is defined as the number of relevant documents (the instances that
belongs to relevant category) retrieved by a search divided by the total number
of existing relevant documents, while precision is defined as the number of
CATEGORIES
PYTHON (70)
DATA MINING (32)
RECOMMENDATION SYSTEMS (29)
MACHINE LEARNING (17)
BLOG ARCHIVES
2014 (2)
2013 (12)
2012 (10)
2011 (34)
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
2/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
Running to an example
December (3)
November (3)
October (4)
Blackberries. The sample data set, I've fetched using the Apontador API by
writing a simple python crawler for accessing their data (The code I used is
based on the Python Library developed by Apontador). The output is a commadelimited file with 3463 samples composed by user IDs, location IDs (places)
and the ratings (preference values from the range 1 to 5) for that location. In
the figure below you can see the structure of this file. Unfortunately, for privacy
June (5)
purposes I can't show the respective usernames and places for each place rated.
September (4)
August (1)
July (3)
May (7)
Hidden Markov Models
Sharing CodeCereal
Python and Machine
Learning: Introductory
Keynote ...
Introducing the
Architecture REST Creating APIs...
Introducing the
Architecture REST -
234 UserIDs, 1599 PlaceIDs, 3463 Ratings for the Apontador Data Set
Creating APIs...
Crab: A Python Framework
for Building
Recommendati...
Evaluating Recommender
Systems - Explaining FScor...
March (2)
February (1)
January (1)
2010 (19)
2009 (26)
2008 (4)
VISITORS
512,660
13
@MARCELCARACIOLO UPDATES
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
3/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
Tweets
Marcel Caraciolo
@marcelcaraciolo
Follow
12 Jan
21 Dec
Marcel Caraciolo
@marcelcaraciolo
1 Dec
Marcel Caraciolo
@marcelcaraciolo
23 Nov
Marcel Caraciolo
@marcelcaraciolo
17 Nov
Tweet to @marcelcaraciolo
SUBSCRIBE TO
Posts
Comments
In the figure above, the table illustrates the difference between a set of actual
and estimated preferences, and how they are translated into scores. RMSE
heavily penalizes estimates that are quite different in range such as the place 2
there, and that is considered desirable by some. The magnitude of the
difference is important for example when you estimate an item with 2 stars
which the actual preference would be 5 stars, it is probably more than twice as
'bad' as one different by just 1 star.
After running our recommender system using as inputs the training and test
set, which it will compare its estimated preferences to the actual test data. The
main code for this task can be found here at my personal repository at Github
(Crab Recommender System). We will use 70% of the data to train; and test with
other 30%. And those sets are chosen randomly.
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
4/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
It shows as result three scores: The first one, it is a score indicating how well
the recommender (slope-one) performed. In this case we use the Root Mean
Square Error (RMSE) . The value 1.04 is not great, since there is so little data
here to begin with. Your results may differ as the data set is split randomly, and
hence the training and test set may differ with each run.
The second and the third ones are respectively the precision and recall.
Precision at 20 is 0.1722; on average about 3 of recommendations were 'good'.
Recall at 20 is 0.1722; so only on average about 3 are good recommendations
among those top recommended. But we still haven't decided what exactly is a
'good' recommendation ? Intuitively, the most highly preferred items in the test
set are the good recommendations, and the rest aren't.
Let's take look at the preferences of the user 2452498672 at our simple data set.
If we consider as test data the preferences for items A,B,C and the preference
values for these 4, 3, 2. With these values missing from the training data, our
expectation is that the recommender engine to recommend A before B, and B
before C because we know the order that the user 2452498672 prefer those
items. But how will we know which item is a good idea to recommend ?
Consider the item C which the user doesn't seem to like it much or the item B
that is just average. We could consider that A is a good recommendation
whereas B and C are valid, but not good recommendations. So we conclude that
it is important to give an threshold that divides good recommendations from
bad. If we choose not to pick explicitly one, we could use some statistics like the
user's average plus one standard deviation for example.
If we decide to vary the quantities the number of recommended items, we
could plot a graph by using these two measures together (precision x recall) so
that we can assess to what extent the good results blend with bad results. The
goal is to get the precision and recall values for my recommendation lists both
close to one. The figure below shows the plot of those quantities varied for the
slope-one recommender. For each list of N recommendations, we enter a point
that corresponds to the precision and recall values of that list. The Good
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
5/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
Precision-Recall Graph
Looking at the figure we can conclude that the slope-one approach is not
particular efficient for this type of data set maybe because of small amount of
samples to train/test the recommender or because of the characteristics of this
data set.
Running to an example
Running again our slope-one recommender into our simple data set we get as
result the value 0.1722.
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
6/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
As we can see by the result the recommender performed not well at this data
set. It can be due to the amount of data set or the particular characteristics of
the data set that is not appropriate for this type of algorithm. It would be
necessary more robust tests to see its performance and compare with another
approaches. One of the particular features of slope-one recommender is that
can produce quick recommendations at runtime with a simple algorithm, but it
takes significant time to pre-compute its internal structures before it can start.
The
figure
below
presents
our
plot
combining
the
length
of
the
recommendation lists represented by 'at' using our training and test data set
and the F-scores estimated for each amount. It is expected that the best one
approaches are with the F-Scores with values nearby 1.0.
F1 Score Graph, the best are with score f near to 1.0 (top right of the graph)
7/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
For the slope-one recommender the values for precision and recall produced
are interesting and illustrates the power of a simple collaborative filtering
technique which involves just a little bit of linear algebra. You can learn more
about the Slope-One recommender system here. Soon I will also talk more
about this technique in a dedicated post. What is important to figure out in
those evaluations is that you have to choose carefully which recommendation
algorithm to use. Each algorithm has its own characteristics and properties that
can interact in harder ways with a given data set. So is essential to use as many
as possible variations of collaborative , content and hybrid algorithms to
evaluate which one is faster or more accurate to the data set we will want to
work with.
Another important observation related to precision and recall tests is how can
we define what a 'good' recommendation is. As we have seen earlier, we must
define a threshold to represent the usefulness of an item recommended and a
poor choice could really prejudice the evaluation of the recommender
algorithm. Furthermore, those tests could be problematic if we consider
recommendations that aren't not necessarily among the user already knows
about! Imagine running such a test for a user who would love the restaurant
'Recanto Paraibano'. Of course it is a great recommendation for this user, but
the user has never heard of this restaurant before. If a recommender actually
returned this restaurant when recommending new places to go, it would be
penalized; since the test framework can only pick good recommendations from
among those in the user's set of preferences already. It is a discovery problem,
and a harder task for recommender engines and is a special topic approached
by several researchers in the literature.
The problem is further complicated if we consider the preferences as 'booleans'
such as 'liked' or 'disliked' and don't contain preference values. This implies
that the test doesn't have a notion of relative preference on which to select a
subset of good items. So the best test is to randomly select some 'liked' items as
the good ones. But despite of all these problems the test has great use but it is
not perfect, so you must understand the test's limitations in the context of the
features of the data set available for you to work.
In this post I presented some metrics used to evaluate the quality of the
recommendations in a recommender engine and explained through some
examples using real data set from Apontador social network. All the code used
here is provided at by personal repository at Github and is all written in
Python.
Finally, I conclude that evaluations are really important in the recommendation
engine building process, which can be used to empirically discover
improvements to a recommendation algorithm.
Below all the references used to write this post.
I hope you have enjoyed this post.
Regards,
Marcel Caraciolo
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
8/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
References
Herlocker, Evaluating Collaborative Filtering Recommender Systems, 2004.
Wikipedia, F1-Score.
Wikipedia, Precision and Recall.
Bryan Sullivan's Blog, Collaborative Filtering made easy, 2006.
Apontador,Apontador API.
POSTADO POR MARCEL CARACIOLO S 6:47 AM
+4 Recommend this on Google
TAGS API, APONTADOR, COLLABORATIVE FILTERING, CRAB, CRAWLER, DATA MINING, DATABASE,
EVALUATION, PRECISION, PYTHON, RECALL, RECOMMENDATION SYSTEMS, SLOPE ONE
RECOMMENDER, VISUALIZATION
6 COMMENTS:
FBARTH
May 4, 2011 at 7:28 AM
Hi Fbarth! Thanks :D By the way the Apontador team is quite open and helped
me in the data collection step using your API.
This project is amazing, I have another great idea for Apontador. We should
talk more about it sometime. It is related to Q&A and Locations.
Thanks for your words!
Regards,
Marcel
REPLY
EDUARDO MAAN
May 4, 2011 at 8:57 AM
9/10
5/11/2015
EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion
(remove the top N items of a user and then try to match them with the top N
recommendations), all of precision, recall and F1-score would always yield
*exactly* the same value. Isn't that correct?
REPLY
DOFOLLOW BOOKMARK
March 5, 2013 at 3:46 PM
Nice post about knowledge. i have a $2.95 COM coupon for sharing to all.
GoDaddy coupon $2.95
REPLY
ANONYMOUS
November 10, 2014 at 1:24 AM
Enteryourcomment...
Commentas:
Publish
Newer Post
GoogleAccount
Preview
Home
Older Post
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html
10/10