Академический Документы
Профессиональный Документы
Культура Документы
Bachelor Thesis
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
Abstract
VII
Contents
Acknowledgments V
1 Introduction 1
2 Related Work 3
2.1 Content-Based location prediction . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Social-Network-Based location prediction . . . . . . . . . . . . . . . . . . 5
2.2.1 Friendship-Based Methods . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Social-Closeness-Based location prediction . . . . . . . . . . . . . 6
2.3 Context-Based location prediction . . . . . . . . . . . . . . . . . . . . . . 7
3 Methodology 9
3.1 LIW Identification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Heuristic-Based Approaches . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Information Theory-Based Approach . . . . . . . . . . . . . . . . 12
3.2 Modeling Spatial Word Usage . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Implementation 15
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 The Bag Of Words Model . . . . . . . . . . . . . . . . . . . . . . 16
4.1.4 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.5 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.6 Stemming and Lemmatization . . . . . . . . . . . . . . . . . . . . 17
4.2 Location Indicative Words Identification . . . . . . . . . . . . . . . . . . 17
4.2.1 CALGARI Algorithm Implementation . . . . . . . . . . . . . . . 17
4.2.2 TF-ICF Algorithm Implementation . . . . . . . . . . . . . . . . . 19
4.2.3 IGR Algorithm Implementation . . . . . . . . . . . . . . . . . . . 20
4.3 Building a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Fitting a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Testing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
IX
4.4.2 Back-End Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.3 Used Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Experimentation 29
5.1 Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Future Work 33
6.1 Hierarchical Classification Models . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Exploiting Social Network Data . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Exploiting User Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Exploiting Non-Geotagged tweets . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Influence of Language on Geolocation Prediction . . . . . . . . . . . . . . 34
7 Conclusion 35
Appendix 36
A Lists 37
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References 41
X
Chapter 1
Introduction
The last decade has witnessed the creation of so many social network platforms. Those
include general purpose ones such as Twitter, Tumblr and Facebook, location-based
ones like Foursquare and Gowalla, photo-sharing platforms like Instagram, Pinterest and
Flickr, as well as domain specific platforms such as Yelp and Linkedin. On these social
media platforms users may establish online friendships with others and share similar in-
terestes in form of texts, photos, videos or check-ins.
Among all social media platforms, Twitter is characterized by the unique way of
following users and sending posts to the timeline so that your followers can see posts.
Twitter friendships are not bidirectional, for example users may follow public figures
without requiring them to follow back. Twitter also have the textual posts a.k.a tweets
and are limited to only 140 characters. Twitter users have also the option to provide
their long-term residential addresses in their profiles. With increasing the popularity of
GPS-enabled devices such as smart phones and tablets, users may provide their current
location while tweeting, The problem is that users may not provide locations neither
in their profiles nor using GPS-enabled devices. knowing physical location involved in
Twitter helps us to understand what is happening in real life developing various applica-
tions, such as event detection, predicting elections, news recommendation, personalized
advertising and local places recommendation systems.
Although Twitter users have the option to provide their locations correctly, the lo-
cation information in Twitter are far from being accurate and complete, In a previous
study that investigated the behaviour of twitter users, they found that only 26% out
of a random sample of over 1 million Twitter users revealed their city-level location in
their profiles and only 0.42% of the tweets in this dataset were geo-tagged[4]. Moreover
these profile locations are not always valid as reported that only 42% of Twitter users
in a random dataset have reported a valid city-level location in their profiles [9] and
the remaining users provided inaccurate or even invalid location fields, for example they
have found that some user was located in Justin Biebers Heart, On the tweets level a
1
2 CHAPTER 1. INTRODUCTION
research firm Sysomos studied Twitter usage between mid-October and mid-December
2009 and found that only 0.23% of tweets in that time period were geo-tagged which is
a good indicator how much this information is sparse.
The problem of predicting locations associated with objects have been studied on
Wikipedia, web pages, and general documents for decades. Intuitively, recognizing tweets
location also could be done as users may reveal their location information in various ways,
for example users may discuss local landmarks, buildings, events or mention local words
that is only employed in their state or city. However the nature of Twitter platform
caused challenges for recognizing these words, as Twitter users often tweets in a very
casual manner. Acronyms, misspellings, and special tokens make tweets noisy and tech-
niques developed for formal documents are error-prone on tweets.
In this study, related work that have been done to solve location sparseness problem
in Twitter is presented in chapter 2 followed by the methodology of the investigated
approaches in chapter 3. In chapter 4 the implementation details are further discussed
and explored deeply then we highlight the experiments conducted using the investigated
algorithms, results of the conducted experiments, and analysis of these results. Future
research directions that could enhance the approaches and solve challenges that faced us
while working are then presented in chapter 6, finally in chapter 7 the study is summarized
and concluded.
Chapter 2
Related Work
In this study we focus on location prediction in Twitter, this chapter shows a variety
of prior work that is related to this study. The prior studies categorized the task of
location prediction in Twitter into three areas: Content-Based, Social-Media-Based, and
Context-Based location prediction approaches.
3
4 CHAPTER 2. RELATED WORK
On the other hand, supervised methods are also taken into consideration by a number
of previous studies. In Cheng et al.[4] they have treated local word identification as a
classification problem by fitting geographical distribution of each word with Backstroom
et al.[1]s spatial variation model. Specifically, the spatial variation model assumes that
each word has a geographical center and dispersion ratio. In other words the probability
of seeing the word is inversely proportional to the distance to the geographical center
of the word with exponential decay. After the model is fit, the parameters are used as
word features. Second they manually labeled 19,178 words in a dictionary as either local
or non-local. Finally they train a classification model and apply it to all other words in
the tweet dataset. Ryoo and Moon[17] apply the above methods to Korean dataset and
achieve good results.
After identifying local words, the problem is how to use them to predict users home
location. Most studies model the prediction problem in a probabilistic manner. Proba-
bilistic models to characterize the conditional distribution of users location with respect
to their tweets content, then decompose the model to make predictions. A few stud-
ies adopt classification-based approaches to make predictions, they treat users statistics
about local words as features, and all candidate locations as classification labels, Hecht et
al.[9] select top 10,000 words with highest predictability score as local words, In their work
users are represented as feature vector of fixed size 10,000 and fed into a Multinomial
Naive Bayes classifier for training, Similarly Rahimi et al.[15] apply logistic regression
on users TF-IDF vectors. Instead of selecting local words as features, In Mahmud et
al.[13] they adopt hierarchical ensemble algorithm to train two-level classifier ensembles
on timezone-city or state-city granularities, In their extended work proposed identifying
and removing travelling users from training data to improve the performance of home
location classifiers, the travelling users are identified as users who have two tweets sent
from locations with distance above 161 kilometers. Wing and Baldridge[20] use K-D tree
to achieve adaptive grids in their hierarchy. This leads to better granularity for populated
regions, and avoids unnecessarily over-representing less populated areas.
2.2. SOCIAL-NETWORK-BASED LOCATION PREDICTION 5
In social science, the assumption of homophilly suggests that similar people make con-
tacts at higher rate than dissimilar ones. The intuition here was that ones location is
very likely to coincide with ones friends locations. Ren et al.[16] assume that the higher
proportion of a users friends live at a location, the higher probability for the user to
stay at the same location. In Davis et al.[6] they propose the same approach but only
considering mutual friendships.
One of the attempts to find relation between friendship and home location distances is
made by Backstroom et al.[1]. Although this study is conducted on Facebook, it is taken
into consideration for its impact on Twitter-based studies. In their work they analyze
a large number of Facebook users with known locations and their friendships. They try
to fit the probability of two users being friends with respect to their home distances,
they found that the probability of friendship is inversely proportional to home distance.
Based on this model given friends of a user and their home locations, the most probable
location of the user could be easily found. These attempts depend on direct friendship
by assuming that friendship observed on an online social network reflects real off-line
friendship, and thus close home distances and this may be far from being accurate. In
Kong et al.[12] they have studied the phenomena and found that a pair of friends has
83% of chance to live within 10 kilometers if their common friends are more than half of
their friends and this chance decreases to 2.4% if their common friend ratio is limited to
10%.
6 CHAPTER 2. RELATED WORK
Besides McGee et al.[14] work, Compton et al.[5] also exploit mentions between two
users. They build a user mention graph and optimize unknown home locations such
that users mentioning each other are located as close as possible. Results show that
they achieve 89% coverage at a median error of 6.33 km. Jurgens et al.[11] also take
bidirectional mentions relationships into consideration instead of friendships. Rahimi et
al.[15] find that bidirectional mention are too rare to be useful. they adopt unidirectional
mentions as undirected edge.
In the work done by Hua et al.[10] , they assume that the more a user is influenced
by others mentioning an en entity, the more likely he will mention the same entity.
Specifically, they adopt an incremental disambiguation approach. In the offline stage,
they preprocess a large number of tweets as a base system. Such preprocessing enables
them to estimate friendship-based user interest for entities in the online stage. When a
candidate entity is considered for a mention in users tweet, they look for other users who
once mentioned this entity, An entity is preferred if the users have good reachability to
the user who mentioned the entity in the friendship network. Efforts also were made to
achieve efficient reachability queries.
2.3. CONTEXT-BASED LOCATION PREDICTION 7
Mahmud et al.[13] take tweet posting time into consideration. In their dataset, all post-
ing times are recorded and the the day is divided into equal length time slots, users are
then viewed as distributions of tweet posting times . Since users in different time zones
exhibits time shifts in their distribution, a timezone classifier is trained with this distri-
bution as features. Such classification reveal timezones of users and could provide wide
range of users locations.
In the work of Han et al.[2] they observe that self-declared locations and timezones as
free text are not always accurate. Informal abbreviations like mel (for Melbourne) may
occur. Therefore, besides tweet contents they also include all four-grams of self-declared
locations and timezones as features to train a home location predicton.
Chapter 3
Methodology
Twitter users may reveal some location information through tweets explicitly by men-
tioning location names such as country names, states, restaurants or local places. They
also could reveal the information in implicit way by using some words that may encode
geographical information, these words could be belonging to some particular regions more
than other regions or maybe used only on some regions, for example Texas residents use
the word Howdy, while in Philadelphia they call themselves Phillies, so if we could
find approaches to measure the locality of the word, then using these words as features
will help in building a city-level location classifier for tweets.
Heuristic methods are not guaranteed to be optimal or perfect, but sufficient for the
immediate goals where finding an optimal solution is impossible or impractical, heuristic
methods can be used to speed up the process of finding a satisfactory solution. In this
part two Heuristic approaches are investigated.
9
10 CHAPTER 3. METHODOLOGY
CALGARI
In this approach a simple heuristic algorithm which is called CALGARI[9] is used. This
algorithm is based on intuition that a model will perform better if it is trained on words
that are more likely to be used by some tweets from particular regions than tweets from
the general population. A score for each term is calculated then the words are ranked
according to this score. In mathematical words the score for each term is defined as the
maximum conditional probability of the word given each class (cities) being examined
over the probability of the word.
We will explain how this score is calculated below:
First Let score(W) be a function which takes a word and calculate the score for that
word W, f (W) be the frequency of a word W in our dataset, count(W, c) be a function
that count how many times the word W appeared with class c, S is the set of all words
in our dataset and C be the set of all classes (city locations) in our dataset, The score for
each word is calculated as follows:
Where P(W) is the probability of the presence of word and is calculated as the frequency
f (T )
of the word over the total number of words. P(W) = , and P (W| c) is the conditional
|S|
probability of the presence of word W given some class c and is calculated as the number
of times word W appeared with class c over the total number of all words occurrences
with class c.
C count(W, ci )
, so max(P (W| c = C)) is evaluated as max P
i
count(tj , ci )
j
Now after calculating a score for each word, the algorithm sorts the words according to
the calculated score in non-decreasing order.
3.1. LIW IDENTIFICATION ALGORITHMS 11
TF-IDF stands for term frequency - inverse city frequency, this approach is often used
in information retrieval and text mining as TF-IDF which stands for term frequency -
inverse document frequency. This weight is a statistical measure used to evaluate how
important a word is in our dataset, so the intuition here is that location indicative words
should have two properties:
1. High Term Frequecy TF : The word should be reasonably observed from tweets
generated from some city.
2. High Inverse City Frequency ICF : The word should be local so it occurs in relatively
small number of cities.
We calculate TF as the number of times the word appeared in the dataset as in the
previous section with f(T)
T F (W) = f (T )
Then we calculate ICF as the number of classes (cities) divided by the total number of
cities with tweets including the word.
C
icf (W) =
cf (W)
After calculating these two parameters, words are ranked by ICF first, and then by
TF in decreasing order, so that the word that local but has relatively large number of
appearances is preferred.
12 CHAPTER 3. METHODOLOGY
First lets define two important terms, the first one is Information Gain (IG). The In-
formation Gain is the difference in class (location) entropy due to data split on some
attribute (word), so the higher the value the greater the predictability of the word, so
given a set of all words in our training set S, the IG of a word W S across all classes
(cities) C is calculated as follows:
where P(W) and P(W) is the probability of the presence and absence of word W, respec-
tively, P (c | W) and P (c | W) is the conditional probability of class c when word W is
present and absent respectively. Because H(c) is constant over all words, so to rank the
features only the conditional entropy given word W needs to be calculated.
The second term we need to mention is the Intrinsic Entropy Value (IV), local words
occurring in a small number of cities usually have low intrinsic value according to ob-
servations by Han et al.[2], where non-local words have high intrinsic value, so when the
words are comparable in IG values, words with smaller intrinsic value should be preferred
because it means that the words are more locally employed (location indicative).
Now with the two terms mentioned above, Information Gain Ratio (IGR) is defined as
the ratio between information gain to the intrinsic value.
IG(W)
IGR(W) =
IV (W)
3.2. MODELING SPATIAL WORD USAGE 13
In this work, Multinomial Naive Bayes classifier is used to predict locations using feature
set consisting of local words included in the tweet, This classifier is based on Bayes
theorem shown below:
P (B | A) P (A)
P (A | B) =
P (B)
In the non-naive Bayes way, we look at sentences in entirety, thus once the sentence
does not show up in the training set, we will get a zero probability, making it difficult for
further calculations. Whereas for Naive Bayes, there is an assumption that every word is
independent of one another. Now, we look at individual words in a sentence, instead of
the entire sentence.
To make predictions the Multinomial Naive Bayes classifier calculates the probability
of each class City given set of features (local-words) and select the city with highest
probability for prediction result.
Chapter 4
Implementation
The main aim of this study is to detect the location based purely on content of tweets,
this task is very challenging in each part of it, for example too many data is needed
for training a classifier to get good results, also the training part itself is challenging
as we need to implement out of core learning due to this big data. In this chapter we
will present and discuss how each part of the project operates, also we will present the
challenges faced us when developing each component.
The first step to establish a good machine learning model is collecting data, so the
more data we have the more accurate results the model could give us. In this study we
focus mainly on twitter data. As mentioned in previous chapters, twitter gives the option
to include geolocation (GPS-coordinates) with tweets, so the main task here is to get a
lot of geotagged tweets, this task is challenging due to lack of geotagged tweets so we
have used multiple sources of data. The first source of data is a collected dataset from
the period 2016-01-15 till 2016-02-06 collected by Benjamin Bischke1 , for this dataset
geotagged new posts were collected using a simple python script. for the second source of
twitter data we have used an archived dataset from August 2012 till November 2012 by
the archivist Jason Scott2 , finally we had to stream more million tweet covering multiple
1
https://www.dfki.de/web/forschung/sds/mitarbeiter/basev iew?uid = bebi02
2
https://archive.org/details/jason scott
15
16 CHAPTER 4. IMPLEMENTATION
regions of the world over one week starting in 20th of August 2017 using Twitter Public
API with the help of tweepy library3 .
4.1.4 Tokenization
Tokenization process is an important step before filtering tweets, in this process the
text is broken down into individual elements (words). In this process the sentences is
split into individual words, punctuations are removed and all letters are converted to
lowercase.
Stemming describes the process of transforming a word into its root form, for exam-
ple transforming likes to like and swimming to swim.In contrast to stemming,
lemmatization aims to obtain the canonical (grammatically correct) forms of the words
which is called lemmas. Lemmatization is computationally more difficult and expensive
than stemming, and in practice, both stemming and lemmatization have little impact on
the performance of text classification[18]
In this section we will present pseudo code for CALGARI algorithm. In Algorithm 1,
the main function CALGARI is presented which takes word as one of its parameters
and calculate the score for that word based on the algorithm discussed in chapter 3, this
function makes use of two other functions presented in the same Algorithm which are
COUNT and FREQUENCY. First function is COUNT which calculates the total
number of occurrences of words in some list of words with some class, this function makes
use of another function Occurrence which calculates the number of times where a word
is occurred with some class. FREQUENCY is a function that calculates the frequency
of a word in all tweets in the dataset.
18 CHAPTER 4. IMPLEMENTATION
In this section, the pseudo code for TF-ICF algorithm is presented. In this Algorithm
(Algorithm 2) each word is associated to two values TF and ICF. Then the words are
sorted in decreasing order according to ICF and breaking the tie with sorting on TF as
mentioned in the algorithm explanation earlier in chapter 3.
In this section we will present pseudo code for Information Gain Ratio algorithm
in Algorithm 3, the main function IGR is presented which takes word as one of its
parameters and calculate Information Gain Ratio for the word by dividing Information
gain of the word over Intrinsic Value of the word, calculating these two parameters is
done using IG and IV functions respectively as discussed in chapter 3, IG function uses
COUNT function presented in Algorithm 1.
The best feature of scikit-learn library that it can be used to implement an out-of-core
approach. In this approach learning from data that does not fit into main memory is
feasible. So we make use of online classifier, this classifier supports partial fit method
that will be fed with batches of samples. Remains the task of saving feature space the
same over time, so hashing trick is used using library by scikit-learn also which is called
FeatureHasher that will project each sample into the same feature space over learning
steps, this is especially useful in our case as new features (words) may appear in each
batch.
In experiments that will be presented in the next chapter, the batch size is fixed with one
thousand tweets per learning step, in Algorithm 4. we present the procedure that we use
in order to implement out-of-core apporach mentioned above. Out-Of-Core-Learning
function is used to partially fit the model to passed batch of size 1000 tweet, this function
makes use of Make Features Vector that count the occurrence of each seen features in
a tweet, finally FeatureHasher is used to transform features vectors to one sparse matrix
of (samples, features) to be used in fitting by function Partial Fit.
22 CHAPTER 4. IMPLEMENTATION
Algorithm 4 Out-Of-Core-Learning
1: function Fit-Batch(batch)
2: labels list {}
3: features vector list {}
4: for tweet in batch do
5: labels list labels list + label of tweet
6: feature vector list feature vector list + Make Feature Vector(tweet)
7: X Transform(FeatureHasher, features vector list)
8: Y labels list
9: Partial Fit(Model, X, Y )
10: function Make Feature Vector(tweet)
11: features vector {}
12: for word in tweet do
13: if word in feature set then
14: feature vector (word) feature vector (word) +1
return features vector
4.3. BUILDING A CLASSIFIER 23
In order to test the model, prediction of testing dataset is compared to actual labels.
Again the same challenge of big data appears, so the testing process is done using the
same manner used in previous section by calling prediction procedure presented in Al-
gorithm 5. on batches of 1000 tweets each, then the result is evaluated according to
evaluation metrics presented in chapter 5 by making use of the prediction label list from
Predict and actual label list.
Algorithm 5 Out-Of-Core-Testing
1: function Test-Batch(batch)
2: labels list {}
3: features vector list {}
4: for tweet in batch do
5: labels list labels list + label of tweet
6: feature vector list feature vector list + Make Feature Vector(tweet)
7: X Transform(FeatureHasher, features vector list)
8: Y labels list
9: return Predict(Model, X ), Y
10: function Make Feature Vector(tweet)
11: features vector {}
12: for word in tweet do
13: if word in feature set then
14: feature vector (word) feature vector (word) +1
return features vector
24 CHAPTER 4. IMPLEMENTATION
1. User Interface
2. Back-End Engine
The user interface provides a stream of tweets from our used dataset, the username and
profile image of the user are also shown beside the tweet text itself, through this page,
user can use the prediction service by hovering on the text and pressing the geolocate icon
(Figure 4.2), then a modal containing a map will be popped out showing the estimated
location and the actual location of the tweet if provided as well as the error distance
between the correct city location and the estimated location (Figure 4.4). The user
interface provide the user with two options:
1. ALL DATASET: Shows all the tweets included in the dataset, by scrolling down
the user can get more tweets in infinite scroll manner (Figure 4.1).
2. SEARCH: Gives the user an option to search for tweets including some hashtag
text and also show tweets in the same infinite scroll manner used in ALL DATASET
page (Figure 4.4 4.5).
The back-end of this service consists of controller and prediction engine, The flow of
the system is shown in Figure 4.6, The controller is the main bridge between the user
interface and the back-end prediction engine. When user chooses the option to geolocate
some tweet, the index of this tweet is sent to the controller, the controller make a request
to the back-end engine with tweet information and listen to response from the back-end
engine with the geolocation prediction of the chosen tweet, the coordinates are passed to
google-maps directive to visualize the location on google maps through the user interface.
The prediction engine takes tweet as an input and use the same procedure in testing
phaes in subsection 4.3.2 to get the city label and returns the city center GPS - coordinates
to the controller.
4.4. WEB SERVICE 25
In this part of the work, MEAN stack (MongoDB, Express, AngularJS and NodeJS) is
used for developing the web service, for the user Interface it is written in HTML (Hyper-
Text Markup Language) and CSS (Cascading Style Sheets), Materialize CSS library is
used in user interface 7 . For embedding google-maps in the user interface and integrating
it with AngularJS, ng-map an AngularJS directive for showing maps is used8 .
7
http://www.materializecss.com
8
https://ngmap.github.io/
26 CHAPTER 4. IMPLEMENTATION
Figure 4.3: Google-maps modal showing actual and estimated locations as well as error
distance
4.4. WEB SERVICE 27
Figure 4.4: Search Page with option to type Hashtag to get tweets with that hashtag
Figure 4.5: Search result when texting some hashtag in search option
28 CHAPTER 4. IMPLEMENTATION
Experimentation
2. MNB - TF-ICF: A Multinomial Naive Bayes classifier trained using feature set
chosen according to TF-ICF algorithm.
3. MNB - IGR: A Multinomial Naive Bayes classifier trained using location indicative
words feature set chosen according to IGR algorithm.
In these two algorithms, 40% of the feature set extracted were used for training due
to memory requirements needed when the feature set size is increased.
1. Acc@161: The proportion of predictions that are within 161 kilometers (100 miles)
from the correct city-level-location, this relaxed version of accuracy captures near-
miss predictions.
29
30 CHAPTER 5. EXPERIMENTATION
Table 5.2: Results for Tweet-level Geolocation Prediction. The bolded results indicate
the best performing algorithms (highest value for accuracy and lowest value for median
error) for the testing set.
5.3.3 Results
Table 6.2 shows the geolocation prediction results for the tweet-level, in terms of
accuracy at 161 kilometers and median error distances. The results shows that Algorithm
MNB-IGR achieves the best result in terms of Acc@161 and Median Error Distance
which is the case also in the study [8] where IGR feature selection method achieved the
best results, the table also shows that MNB-TF-ICF achieves lower Acc@161 but better
Median Error Distance than MNB-CALGARI.
5.3.4 Analysis
Although investigated algorithms have been tested in previous studies and had satis-
factory results, the results of experiments conducted in this work were less than expected
for some reasons, In this section some of these reasons are presented:
1. Dataset Size: The dataset size affects the performance of the location indicative
words identification process as well as fitting a classifier, In previous work that
showed better results they have used larger datasets, for example in work done
by (Bo han et al.) they have used a dataset of around 38 Million tweets, using
this large amount of data they were able to achieve better results using two of the
location indicative words identification algorithms presented in this work as shown
in table 5.3
Table 5.3: Results for Geolocation Prediction in work done by Bo Han et al using larger
dataset of 38 Million Tweets
4. Lack of Information: Because of short 140-character tweet, the tweet does not
include many words and so it does not contain location indicative words that
help in prediction task given only the tweet as an input. The information here is
more limited than that for user-level location prediction task where users tweets
are all provided. Therefore it is worthwhile to exploit the input with reasonable
redundancy, for example by including all rare n-grams, even those occurring just
three times.
Chapter 6
Future Work
This study is discussing a very challenging task, the results of this task could be
improved by various directions. In this chapter some future work that could be discussed
and researched will be presented.
33
34 CHAPTER 6. FUTURE WORK
Conclusion
Twitter as one of the most popular online social networks provides a virtual world for
users where they can establish friendships and share their daily interests. Locations in
real life have been involved in every corner of the Twitter world. If locations provided
in correct form, they may facilitate various applications and benefit users in real life.
Because of incompleteness and inaccuracy of locations provided from Twitter users,
extensive research efforts have been spent on geolocation sparseness problem in Twitter.
In this study, research efforts spent on location prediction in Twitter are presented
and categorized into three categories: Content-Based, Social-Network-Based and
Context-Based. Focusing on content based approaches, identifying location indicative
words to use them as features in order to predict tweets location is investigated,
three algorithms that try to identify these words are tested using Multinomial Naive
Bayes classifierfollowed by an implementation for a web service that visualize the ge-
olocation prediction of tweets is implemented to make the prediction task more interactive
The experiments conducted showed that information theoretic methods are achieving
the best results in terms of accuracy at some threshold distance and median error
distance. Also the study presented analysis to explain the less-than-expected results and
concluded that the performance is affected by some parameters such as dataset size,
noisy tweeting nature, misleading travelling users, and lack of information in tweets
content.
Finally the study present some future research directions to enhance the performance
and overcome challenges in tweet location prediction task such as implementing hierar-
chical classification models, exploiting social network data, exploiting user metadata and
non geotagged tweets as well as building language-specific classifiers.
35
Appendix
36
Appendix A
Lists
37
List of Figures
38
List of Tables
39
Bibliography
[1] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: Improving
geographical prediction with social and spatial proximity. In Proceedings of the 19th
International Conference on World Wide Web, WWW 10, pages 6170, New York,
NY, USA, 2010. ACM.
[2] Han Bo, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media
data by finding location indicative words. In Proceedings of COLING, pages 1045
1062, 2012.
[3] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grob-
ler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gael Varoquaux.
API design for machine learning software: experiences from the scikit-learn project.
In ECML PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108122, 2013.
[4] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. pages 759768, 2010.
[5] Ryan Compton, David Jurgens, and David Allen. Geotagging one hundred million
twitter accounts with total variation minimization. In Big Data (Big Data), 2014
IEEE International Conference on, pages 393401. IEEE, 2014.
[6] Clodoveu A Davis Jr, Gisele L Pappa, Diogo Renno Rocha de Oliveira, and Filipe
de L Arcanjo. Inferring the location of twitter messages based on user relationships.
Transactions in GIS, 15(6):735751, 2011.
[7] Bo Han, Paul Cook, and Timothy Baldwin. A stacking-based approach to twitter
user geolocation prediction. In ACL (Conference System Demonstrations), pages
712, 2013.
[8] Bo Han, Paul Cook, and Timothy Baldwin. Text-based twitter user geolocation
prediction. Journal of Artificial Intelligence Research, 49:451500, 2014.
[9] Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Chi. Tweets from justin biebers
heart: the dynamics of the location field in user profiles. In Proceedings of the
SIGCHI conference on human factors in computing systems, pages 237246. ACM,
2011.
40
BIBLIOGRAPHY 41
[10] Wen Hua, Kai Zheng, and Xiaofang Zhou. Microblog entity linking with social tem-
poral context. In Proceedings of the 2015 ACM SIGMOD International Conference
on Management of Data, pages 17611775. ACM, 2015.
[11] David Jurgens, Tyler Finethy, James McCorriston, Yi Tian Xu, and Derek Ruths.
Geolocation prediction in twitter using social networks: A critical analysis and review
of current practice. ICWSM, 15:188197, 2015.
[12] Longbo Kong, Zhi Liu, and Yan Huang. Spot: Locating social media users based
on social network context. Proceedings of the VLDB Endowment, 7(13):16811684,
2014.
[13] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. Where is this tweet from?
inferring home locations of twitter users. ICWSM, 12:511514, 2012.
[14] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Ho-
mophily in social networks. Annual review of sociology, 27(1):415444, 2001.
[15] Afshin Rahimi, Duy Vu, Trevor Cohn, and Timothy Baldwin. Exploiting text
and network context for geolocation of social media users. arXiv preprint
arXiv:1506.04803, 2015.
[16] Kejiang Ren, Shaowu Zhang, and Hongfei Lin. Where are you settling down: Geo-
locating twitter users based on tweets and social networks. In Asia Information
Retrieval Symposium, pages 150161. Springer, 2012.
[17] KyoungMin Ryoo and Sue Moon. Inferring twitter user locations with 10 km ac-
curacy. In Proceedings of the 23rd International Conference on World Wide Web,
pages 643648. ACM, 2014.
[18] Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on
text classification. Proceedings of InSciT, 4:354358, 2006.
[19] Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. Spatially
aware term selection for geotagging. IEEE transactions on Knowledge and Data
Engineering, 26(1):221234, 2014.
[20] Benjamin Wing and Jason Baldridge. Hierarchical discriminative classification for
text-based geolocation. In EMNLP, pages 336348, 2014.
[21] Yuto Yamaguchi, Toshiyuki Amagasa, Hiroyuki Kitagawa, and Yohei Ikawa. Online
user location inference exploiting spatiotemporal correlations in social streams. In
Proceedings of the 23rd ACM International Conference on Conference on Informa-
tion and Knowledge Management, pages 11391148. ACM, 2014.
[22] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text
categorization. In Icml, volume 97, pages 412420, 1997.