Вы находитесь на странице: 1из 26

Twitrank

Nitin Dhar
Quentin Swain
Maggie Neuwald

c
`nspiration behind project

c Citizen journalism and micro-blogging services are increasingly


becoming accepted sources of information for unfolding and
recent events.

c Being able to effectively use Twitter to find information about


news events may increase the speed of information dispersal and
increase the variety of witnesses and accounts of an event.

c `n order to be effective, the search engine must be able to return


the most relevant tweets to unfolding events, especially as usage
increases and the number of tweets may be large.
ïroblem Statement

Searching for tweets relevant to events or topics can be


extremely useful depending on how the results are ranked
and displayed to the user. Many issues must be considered
when ranking tweets, such as:
c The length of tweets (if restricted to 140).
c Tweets are posted at random and there are large numbers of
tweets being posted at any given time.
c Many tweets may not be as informative or relevant about a
news event as others.
c Many tweets contain spam or are noisy and unrelated to events
or topics that are considered trends
Koals of the project

Find a ranking methodology that ranks more relevant tweets


higher in the results than the current approach used in
Twitter·s native search function.

c Baseline: Twitter ranks tweets containing keywords the user


entered based on the timestamp of the tweet, with more
recent tweets ranked higher
c Koal: To find a methodology that can rank tweets using
machine-learned predictive signals, so that more relevant
tweets are ranked higher, as compared to the baseline
ãxamples of related work

c Time is of the Essence: Improving Recencey Ranking Using


Twitter Data
c ºses Twitter º s, text, and metadata to index new º s, judge
relevance, and expand queries through surrounding text

c Ranking Approaches for Microblog Search


c ºses author-related features to rank relevance for tweets

c The Utility of Tweeted URLs for Web Search


c Focuses on judging the quality of º s in tweets for use in search engines

c -n Popularity in the BlogosphereSearch


c ÿlso explores author popularity to determine relevance, but focuses it on
blogs rather than Twitter
Dow our work is unique

c We consider Twitter as a source of information   


 , not as an additional signal to assist a separate search
engine or º  index

c We will used ranked relevance for each result, rather than


ask editors to choose the more relevant result between two
tweets

c We are focusing on finding relevant, real-time information


in the domain of news events and rather than searching for
signals for use in search engines
Methodology: Overview

c We would train a model by feeding it a set of manually


constructed queries

c The results of the queries would be judged and scored for


relevance by ´highly-trained editorsµ (ourselves in this case)

c The model would then use a linear regression to find which


signals are most predictive based on the relevance judgments

c Finally, the model would be used against test data to judge


how effective it is at ranking tweets according to relevance
Methodology: elevance

First, we worked on defining relevance«

1st attempt:
c Topical relevance (it must be about the event being searched)
c ïroximity to the event (eye-witness accounts)
c New information (rather than emotion)
c Depth and quality of information (either in the tweet or in
embedded links)
Methodology: elevance

c `n our next attempt, we clarified our goal with the following


assumptions:
c ºsers are mostly interested in information about an event.
They are only interested in opinions that include relevant
information about an event.
c They are interested in both text and links. `f one is subpar but
the other is not, they will find some value in the result,
although it may not be as valuable as a result that is completely
relevant.
c ´ elevantµ means that it includes facts about the event itself
and is topically relevant to the event.
Methodology: elevance

c These assumptions resulted in the following ranking


strategy:
0 ´No ‘ÿll information is topically irrelevant to the event.
elevanceµ ‘The text and links are completely unrelated to the event
‘There is no information given about the event and any opinions are unrelated to the event itself
‘Keywords are present but are not about the event or the text is about something else topically
1 ´Somewhat ‘There is some topically irrelevant information present, but there is also some relevant information.
elevantµ ‘ãither the text or link is unrelated to the event or uninformative.
‘Keywords may match but no relevant information is given ² the tweet does not provide much
information about the event and any opinions do not contain information about the event.
2 ´Mostly ‘ÿlthough some information may not be correct or may be pure opinion, there is still mostly topically
elevantµ relevant information about the event.
‘ïerhaps the link is incorrect or the text is mostly opinion, but the tweet is mostly about the event
‘Keywords match and the tweet is topically relevant, or the link is relevant although there is not as much
relevant text.
3 ´Dighly ‘ÿll information present is correct and relevant
elevantµ ‘The text and links are completely related to the event and new or interesting information is given
Methodology: Defining signals

c We came up with a list of candidate signals, relying mostly on the metadata


available within Twitter for the majority of them, combining some with data
related to the query:
c % of keywords in tweet*
c the number of retweets
c the location of the user in respect to the event
c the time of the tweet with respect to the event
c the number of friends of the user and the number of people the user follows
c the status count of the user
c the favorites count of the user
c the relevance score of the tweet

*We later also added the length of the tweet as a signal, although that may be a proxy for this signal,
since the search results must contain all keywords in the tweet.
Methodology: Kenerating
queries

c We needed to construct a set of queries and results to be


manually judged by editors for relevance.

c We wanted the queries to be reflective of our purpose and


therefore related to a current, unfolding news event.
Methodology: Kenerating
queries
c Our initial plan for generating queries was to use Wikipedia
to find current news events and extract the structured data
from their infoboxes to generate data about the event in
order to manually construct queries.
c We ended up abandoning this methodology for a number of
reasons:
c Many events did not have infoboxes constructed for them,
making it more difficult to programmatically extract the
information
c Not all unfolding events had traditional wikipages. They may
only be a mention within a country or other topic·s wikipage.
Methodology: Kenerating
queries
c Our 2nd attempt involved using Wikipedia·s Year pages to manually
construct queries based on events which occurred over the last two
years.

c When constructing the queries, we would limit the results to a time


period between the initial time of the event as a start date, then
increment the end date for 24 hour periods until at least 50 tweets
were returned.

c We also had to abandon this methodology for the following reason:


c Twitter only allows you to pull tweets from the past two weeks. ÿlthough
they have a search parameter with which to enter start and end dates,
there is a current limit on it to the past two weeks. This made all events we
had constructed from the past unusable.
Methodology: Kenerating
queries

c Our final methodology was as follows, as a result of the


various constraints we previously mentioned:
c We constantly searched news sites (The New York Times,
allvoices.com, and infoplease.com) to find new current events
around the world.
c We then would construct a short query based on that event
(due to the limited length of tweets, queries had to be kept very
short to return results)
c We would also record the date, country or state, and
latitude/longitude of the event to include in the query
ºsing Twitrank to judge tweets

c `n order to enable our team to simultaneously work on generating


queries and judging results, we built an app called Twitrank

c Twitrank allows you to enter data for a query then uses a uby
library to pull tweets from the Twitter ÿï` based on your query
parameters, returning both the results and the associated metadata
(signals) for each result

c Twitter also has a limit on the number of requests you can make
to their ÿï` within an hour, this allowed us to use multiple
Twitter ÿï` accounts to retrieve more results per hour than if one
person was storing the results
Twitrank Screenshot
Twitrank Screenshot
System ÿrchitecture
oading Data into Sofia-ml

c Sofia-ml used logistic regression to determine how


predictive each signal is and weight it accordingly.
c `n order to load it, we needed to first transform our results
data into the form of a tuple, such as:
c <class-label> <aid>
1:<value>2:<value> 3:<value> 4:<value> 5:<value> 6:<value
> 7:<value> 8:<value> 9:<value>

c Our relevance score was the dependent variable, or class-


label, with values 1-3 (we also tested it using values of 0, 1,
10, 100 to weight more relevant documents heavier)
Signal to Tuple Mapping

Tuple Signal
1 Tweet length
2 % of keywords in tweet
3 Number of retweets
4 ocation of user with respect to event
5 Time difference of tweet with respect to the event
6 Number of followers of user
7 Number of people the user follows
8 Status count of user
9 Favorites count of user
ãxample Tuples

c 3 1:3 2:66 3:0 4:100 5:6 6:101 7:65 8:1189 9:37 // this
one was given a relevance of 3

c 2 1:2 2:100 3:0 4:5 5:6 6:101 7:65 8:1189 9:37 // this
one was given a relevance of 2
Sofia Outputs

c ãxample model generated from test data:


c 6.49869e-05 0.00796799 0.000791874 0.0990353 0.00182186
0.2668 0.300435 3.13506 -0.00871805
Coefficients for each signal:
Signal Coefficient
Tweet length 6.49869e-05
% of keywords in tweet 0.00796799
Number of retweets 0.000791874
ocation of user with respect to event 0.0990353
Time difference of tweet with respect to the event 0.00182186
Number of followers of user 0.2668
Number of people the user follows 0.300435
Status count of user 3.13506
Favorites count of user -0.00871805
Calculating the DCK

c Since this is a ranking exercise, we also compared the DCK


between Twitter·s native ranking and our machine-learned
ranking.
c To do this, we grouped the results by query and ordered them
according to:
c Our ideal ranking
c Sofia·s ranking
c Twitter·s ranking

c We then compared the DCK at position 10 across all three. We


discounted it using a logarithmic base of 2 as well as 10 to see
differences in the sensitivity of the discount.
DCK esults

Ô  

 16
  *   *  
   *      14
     *  *
   *     12
       * 10
    
Sofia-ml
8
Twitter
6
Total `deal
4
16
14 2

12 0
10 Kroup 1 Kroup 2 Kroup 3 Kroup 4 Kroup 5

8
Total
6
4
2
0
1 2 3
`mplementation and Future
Work

c The challenge with this system is that it required us to hand


judge the results in order to train the model, making it
difficult to implement in a production environment.

c We·d like to implement it into a system that would allow


users to enter a query for a news event, view the sofia-
ranked results, and ´likeµ certain results to at least allow for
binary relevance judgments that could update the model on
a regular basis.

Вам также может понравиться