Вы находитесь на странице: 1из 10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

ARTIFICIAL INTELLIGENCE
IN MOTION

HOME

ABOUT

A BLOG ABOUT SCIENTIFIC PYTHON,


DATA, MACHINE LEARNING AND
RECOMMENDER SYSTEMS.

Evaluating Recommender Systems Explaining F-Score, Recall and Precision


using Real Data Set from Apontador
WEDNESDAY, MAY 4, 2011

SEARCH IN THIS BLOG

tosearch,typeandhitenter....

JOIN THE BRAZILIAN PYTHON


CONFERENCE PYTHONBRASIL 2013
PythonBrasil[9]

Hi all,
In this post I will introduce three metrics widely used for evaluating the utility
of recommendations produced by a recommender system : Precision , Recall
and F-1 Score. The F-1 Score is slightly different from the other ones, since it is
a measure of a test's accuracy and considers both the precision and the recall of
the test to compute the final score.
Introduction about the Recommender Systems Evaluation
The Recommender systems are a sub-domain of information filtering system
techniques that attempts to recommend information items such as movies,
programs, music, books, page that are likely to be of interest of the user. The
final product of a recommender system is a top-list of items recommended for
the user ordered by a evaluated score that represents the preference of that
item for the user. So highest the value, more interested the user will be. But to
produce the right recommendations is not trivial and there are several studies
and research on evaluating the recommendations of such engine.
Therefore, when you develop a recommender engine for your problem domain,
the main question after all work is done is: "What are the best
recommendations for the user ?" Before looking after the answers, we should
investigate the question. What exactly we mean as a good recommendation ?
And how will we known when the recommender system is producing them ?
The remainder of this post is to explain how we can evaluate such engines in a
way that we provide the best possible recommender that should recommend
possible items that the user hasn't yet seen or expressed any preference for. A
optimal recommender system would be the one that could predict all your
preferences exactly would present a set of items ranked by your future
preference and be done.

MARCEL CARACIOLO
Iamabraziliandatascientist,entrepreneur,
pythonhackerandtechnologyconsultant.
NowadaysIworkwithdatacentric
applications,speciallyinmachinelearning,
recommendersystemsandbioinformatics.I
amalsointerestedindistributedcomputing,
highperformanceanddatavisualization,
educationalandbioinformaticsventures.

Until2013Iwasthecofounderoftwo
companiesAtepassar.com,asocialnetwork
forstudentsinBrazilandcofounderof
PyCursos,aonlinestartupforpythontraining
andonlinecourses.In2014,Iassumeda
newpositionatGenomikaDiagnsticos,a
braziliangeneticstestslaboratory,asCTO.

MyGithubRepos

POPULAR POSTS
MACHINE LEARNING
WITH PYTHON LOGISTIC REGRESSION
Hi all, I decided to
start a new series of posts now
focusing on general machine
learning with several snippets for
anyone to use with...

For that , most recommender engines operate by trying to do just that,


estimating rating for some or all other items. One possible way of evaluating
recommender's suggestions is to evaluate the quality of its estimated preference

MACHINE LEARNING

values, that is, evaluating how closely the estimated preferences match the
actual preferences of the user.

REGRESSION

WITH PYTHON - LINEAR


Hi all, I decided to

http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

1/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

Introducing Precision and Recall Metrics


In recommender systems, for the final the user the most important result is to
receive an ordered list of recommendations, from best to worst. In fact, in
some cases the user doesn't care much about the exact ordering of the list - a set

start a new series of posts now


focusing on general machine
learning with several snippets for
anyone to use with ...

MACHINE LEARNING

of few good recommendations is fine. Taking this fact into evaluation of


recommender systems, we could apply classic information retrieval metrics to
evaluate those engines: Precision and Recall. These metrics are widely used on
information retrieving scenario and applied to domains such as search engines,
which return some set of best results for a query out of many possible results.
For a search engine for example, it should not return irrelevant results in the
top results, although it should be able to return as many relevant results as
possible. We could say that the 'Precision' is the proportion of top results that
are relevant, considering some definition of relevant for your problem domain.
So if we say 'Precision at 10' would be this proportion judged from the top 10

WITH PYTHON:
MEETING TF-IDF FOR
TEXT MINING
3Hi all, This month I was studying
about information retrieval and
text mining, specially how to
convert the textual representation
of ...

MACHINE LEARNING
AND DATA MINING -

results. The 'Recall' would measure the proportion of all relevant results
included in the top results. See the Figure 1 as an example to illustrate those
metrics.

ASSOCIATION
ANALYSIS WITH
PYTHON
Hi all, Recently I've been working
with recommender systems and
association analysis. This last
one, specially, is one of the mos...

INTRODUCTION TO
RECOMMENDATIONS
WITH MAP-REDUCE

Precision and Recall in the context of Search Engines

AND MRJOB
In a formal way, we could consider documents as instances and the task it to
return a set of relevant documents given a search term. So the task would be
assign each document to one of two categories: relevant and not relevant.
Recall is defined as the number of relevant documents (the instances that
belongs to relevant category) retrieved by a search divided by the total number
of existing relevant documents, while precision is defined as the number of

Hi all, In this post I will present


how can we use map-reduce
programming model for making
recommendations.
Recommender systems are qu...

relevant documents (the instances that belongs to relevant category) retrieved


by a search divided by the total number of documents retrieved by the search.

CATEGORIES

In recommender systems those metrics could be adapted so:

"The precision is the proportion of recommendations that are good


recommendations, and recall is the proportion of good recommendations
that appear in top recommendations."

In recommendations domain, a perfect precision score of 1.0 means that every


item recommended in the list was good (although says nothing about if all good
recommendations were suggested) whereas a perfect recall score of 1.0 means
that all good recommended items were suggested in the list (although says
nothing how many bad recommendations were also in the list).

PYTHON (70)
DATA MINING (32)
RECOMMENDATION SYSTEMS (29)
MACHINE LEARNING (17)

BLOG ARCHIVES
2014 (2)
2013 (12)
2012 (10)
2011 (34)

http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

2/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

Running to an example

December (3)
November (3)

For showing an example of evaluating a recommender system, let's test-drive a


'slope-one' recommender on a simple data set fetched from the brazilian
location-based social networking website, software for mobile devices
Apontador. Its main goal is to help people to interact with their friends and
update their location by using GPS enabled mobile devices, such as iPhones and

October (4)

Blackberries. The sample data set, I've fetched using the Apontador API by
writing a simple python crawler for accessing their data (The code I used is
based on the Python Library developed by Apontador). The output is a commadelimited file with 3463 samples composed by user IDs, location IDs (places)
and the ratings (preference values from the range 1 to 5) for that location. In
the figure below you can see the structure of this file. Unfortunately, for privacy

June (5)

purposes I can't show the respective usernames and places for each place rated.

September (4)
August (1)
July (3)

May (7)
Hidden Markov Models
Sharing CodeCereal
Python and Machine
Learning: Introductory
Keynote ...
Introducing the
Architecture REST Creating APIs...
Introducing the
Architecture REST -

234 UserIDs, 1599 PlaceIDs, 3463 Ratings for the Apontador Data Set

After some pre-processing for the experiments, I have pre-processed the


database and came into a rating distribution. This plot represents a heat map
illustrating the distribution of the ratings evaluated by the users (rows) to
places (columns). The black areas represent the absence of the rating for that
particular place from that user.

Creating APIs...
Crab: A Python Framework
for Building
Recommendati...
Evaluating Recommender
Systems - Explaining FScor...
March (2)
February (1)
January (1)
2010 (19)
2009 (26)
2008 (4)

VISITORS

512,660
13

@MARCELCARACIOLO UPDATES

Ratings Matrix - Users x Items

http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

3/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

Tweets

To evaluate the accuracy of the recommender engine we must have a training


data and a test data. Generally, this can be simulated to a recommender, since
using new items it is not predictable for sure how the user will like that new
item in the future. For this, the data mining researcher must set aside a small
part of the real data set as test data. These test preferences are not present in
the training data fed into a recommender under evaluation - which is all data
except the test data. Therefore, the recommender is asked to estimate
preferences for the missing test data, and the scores estimated are compared to
the actual values.
Base on this data set split, it is simple to produce a kind of "score" for the
recommender. For instance, we could compute the average difference between
the estimated and actual preference. This references to another popular metric
used to evaluate classifiers called root-mean-square. It is a metric represented
by the square root of the average of the squares of the differences between
actual and estimated preference values. The optimal classifier is the one that
have lower scores (RMSE), because that would mean the estimates differed
from the actual preference values by less. 0.0 would mean perfect estimation -no difference at all between the estimates and actual values.

Marcel Caraciolo
@marcelcaraciolo

Follow
12 Jan

My Top 2 #lastfm Artists: The Offspring


10 & Adele 2 #TopLastFM
bit.ly/cUC3od
Expand
Marcel Caraciolo
@marcelcaraciolo

21 Dec

My Top 1 #lastfm Artists: Erasure 39


#TopLastFM bit.ly/cUC3od

Marcel Caraciolo
@marcelcaraciolo

1 Dec

My Top 1 #lastfm Artists: Erasure 19


#TopLastFM bit.ly/cUC3od

Marcel Caraciolo
@marcelcaraciolo

23 Nov

My Top 1 #lastfm Artists: Erasure 14


#TopLastFM bit.ly/cUC3od

Marcel Caraciolo
@marcelcaraciolo

17 Nov

My Top 3 #lastfm Artists: Marilyn


Manson 59, Erasure 45 & Weezer 4
#TopLastFM bit.ly/cUC3od

Tweet to @marcelcaraciolo

SUBSCRIBE TO
Posts
Comments
In the figure above, the table illustrates the difference between a set of actual
and estimated preferences, and how they are translated into scores. RMSE
heavily penalizes estimates that are quite different in range such as the place 2
there, and that is considered desirable by some. The magnitude of the
difference is important for example when you estimate an item with 2 stars
which the actual preference would be 5 stars, it is probably more than twice as
'bad' as one different by just 1 star.
After running our recommender system using as inputs the training and test
set, which it will compare its estimated preferences to the actual test data. The
main code for this task can be found here at my personal repository at Github
(Crab Recommender System). We will use 70% of the data to train; and test with
other 30%. And those sets are chosen randomly.

http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

4/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

Evaluating the Slope One Recommender

It shows as result three scores: The first one, it is a score indicating how well
the recommender (slope-one) performed. In this case we use the Root Mean
Square Error (RMSE) . The value 1.04 is not great, since there is so little data
here to begin with. Your results may differ as the data set is split randomly, and
hence the training and test set may differ with each run.
The second and the third ones are respectively the precision and recall.
Precision at 20 is 0.1722; on average about 3 of recommendations were 'good'.
Recall at 20 is 0.1722; so only on average about 3 are good recommendations
among those top recommended. But we still haven't decided what exactly is a
'good' recommendation ? Intuitively, the most highly preferred items in the test
set are the good recommendations, and the rest aren't.
Let's take look at the preferences of the user 2452498672 at our simple data set.
If we consider as test data the preferences for items A,B,C and the preference
values for these 4, 3, 2. With these values missing from the training data, our
expectation is that the recommender engine to recommend A before B, and B
before C because we know the order that the user 2452498672 prefer those
items. But how will we know which item is a good idea to recommend ?
Consider the item C which the user doesn't seem to like it much or the item B
that is just average. We could consider that A is a good recommendation
whereas B and C are valid, but not good recommendations. So we conclude that
it is important to give an threshold that divides good recommendations from
bad. If we choose not to pick explicitly one, we could use some statistics like the
user's average plus one standard deviation for example.
If we decide to vary the quantities the number of recommended items, we
could plot a graph by using these two measures together (precision x recall) so
that we can assess to what extent the good results blend with bad results. The
goal is to get the precision and recall values for my recommendation lists both
close to one. The figure below shows the plot of those quantities varied for the
slope-one recommender. For each list of N recommendations, we enter a point
that corresponds to the precision and recall values of that list. The Good
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

5/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

precision-recall points are located in the upper-right conner of the graph


because we want to have high precision and high recall. These plots are
extremely useful for you to compare different approaches for a particular data
set.

Precision-Recall Graph

Looking at the figure we can conclude that the slope-one approach is not
particular efficient for this type of data set maybe because of small amount of
samples to train/test the recommender or because of the characteristics of this
data set.

Explaining F1- Score


The F-Score or F-measure is a measure of a statistic test's accuracy. It considers
both precision and recall measures of the test to compute the score. We could
interpret it as a weighted average of the precision and recall, where the best
F1 score has its value at 1 and worst score at the value 0.

F-Score Formula (Image from Wikipedia)

In recommendations domain, it is considered an single value obtained


combining both the precision and recall measures and indicates an overall
utility of the recommendation list.

Running to an example
Running again our slope-one recommender into our simple data set we get as
result the value 0.1722.
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

6/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

Evaluating the Slope One Recommender

As we can see by the result the recommender performed not well at this data
set. It can be due to the amount of data set or the particular characteristics of
the data set that is not appropriate for this type of algorithm. It would be
necessary more robust tests to see its performance and compare with another
approaches. One of the particular features of slope-one recommender is that
can produce quick recommendations at runtime with a simple algorithm, but it
takes significant time to pre-compute its internal structures before it can start.
The

figure

below

presents

our

plot

combining

the

length

of

the

recommendation lists represented by 'at' using our training and test data set
and the F-scores estimated for each amount. It is expected that the best one
approaches are with the F-Scores with values nearby 1.0.

F1 Score Graph, the best are with score f near to 1.0 (top right of the graph)

Conclusions and Contribution


http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

7/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

For the slope-one recommender the values for precision and recall produced
are interesting and illustrates the power of a simple collaborative filtering
technique which involves just a little bit of linear algebra. You can learn more
about the Slope-One recommender system here. Soon I will also talk more
about this technique in a dedicated post. What is important to figure out in
those evaluations is that you have to choose carefully which recommendation
algorithm to use. Each algorithm has its own characteristics and properties that
can interact in harder ways with a given data set. So is essential to use as many
as possible variations of collaborative , content and hybrid algorithms to
evaluate which one is faster or more accurate to the data set we will want to
work with.
Another important observation related to precision and recall tests is how can
we define what a 'good' recommendation is. As we have seen earlier, we must
define a threshold to represent the usefulness of an item recommended and a
poor choice could really prejudice the evaluation of the recommender
algorithm. Furthermore, those tests could be problematic if we consider
recommendations that aren't not necessarily among the user already knows
about! Imagine running such a test for a user who would love the restaurant
'Recanto Paraibano'. Of course it is a great recommendation for this user, but
the user has never heard of this restaurant before. If a recommender actually
returned this restaurant when recommending new places to go, it would be
penalized; since the test framework can only pick good recommendations from
among those in the user's set of preferences already. It is a discovery problem,
and a harder task for recommender engines and is a special topic approached
by several researchers in the literature.
The problem is further complicated if we consider the preferences as 'booleans'
such as 'liked' or 'disliked' and don't contain preference values. This implies
that the test doesn't have a notion of relative preference on which to select a
subset of good items. So the best test is to randomly select some 'liked' items as
the good ones. But despite of all these problems the test has great use but it is
not perfect, so you must understand the test's limitations in the context of the
features of the data set available for you to work.
In this post I presented some metrics used to evaluate the quality of the
recommendations in a recommender engine and explained through some
examples using real data set from Apontador social network. All the code used
here is provided at by personal repository at Github and is all written in
Python.
Finally, I conclude that evaluations are really important in the recommendation
engine building process, which can be used to empirically discover
improvements to a recommendation algorithm.
Below all the references used to write this post.
I hope you have enjoyed this post.
Regards,
Marcel Caraciolo
http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

8/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

References
Herlocker, Evaluating Collaborative Filtering Recommender Systems, 2004.
Wikipedia, F1-Score.
Wikipedia, Precision and Recall.
Bryan Sullivan's Blog, Collaborative Filtering made easy, 2006.
Apontador,Apontador API.
POSTADO POR MARCEL CARACIOLO S 6:47 AM
+4 Recommend this on Google

TAGS API, APONTADOR, COLLABORATIVE FILTERING, CRAB, CRAWLER, DATA MINING, DATABASE,
EVALUATION, PRECISION, PYTHON, RECALL, RECOMMENDATION SYSTEMS, SLOPE ONE
RECOMMENDER, VISUALIZATION

6 COMMENTS:
FBARTH
May 4, 2011 at 7:28 AM

Marcel, congratulations! Very nice post! This kind of explanation about


Evaluation Recommendation System is always very useful. Besides, with real
data, like Apontador data, get even better!
REPLY

MARCEL PINHEIRO CARACIOLO


May 4, 2011 at 8:42 AM

Hi Fbarth! Thanks :D By the way the Apontador team is quite open and helped
me in the data collection step using your API.
This project is amazing, I have another great idea for Apontador. We should
talk more about it sometime. It is related to Q&A and Locations.
Thanks for your words!
Regards,
Marcel
REPLY

EDUARDO MAAN
May 4, 2011 at 8:57 AM

Hey Marcel, congratulations! I learned about your experiment via fbarth :)


I must say I'm really amazed. Good job! :)
REPLY

JUAN ANTONIO NAVARRO PREZ


March 14, 2012 at 10:54 AM

Hmm, it seems that because of how "IRStatsRecommenderEvaluator" is defined


http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

9/10

5/11/2015

EvaluatingRecommenderSystemsExplainingFScore,RecallandPrecisionusingRealDataSetfromApontadorArtificialIntelligenceinMotion

(remove the top N items of a user and then try to match them with the top N
recommendations), all of precision, recall and F1-score would always yield
*exactly* the same value. Isn't that correct?
REPLY

DOFOLLOW BOOKMARK
March 5, 2013 at 3:46 PM

Nice post about knowledge. i have a $2.95 COM coupon for sharing to all.
GoDaddy coupon $2.95
REPLY

ANONYMOUS
November 10, 2014 at 1:24 AM

Thanks good post!


REPLY

Enteryourcomment...

Commentas:

Publish

Newer Post

GoogleAccount

Preview

Home

Older Post

COPYRIGHT 2013. ARTIFICIAL INTELLIGENCE IN MOTION . SOME RIGHTS RESERVED


TEMPLATE DESIGN BY HERDIANSYAH HAMZAH. PUBLISHED BY BORNEO TEMPLATES

http://aimotion.blogspot.in/2011/05/evaluatingrecommendersystems.html

10/10

Вам также может понравиться