Location Detection Over Social Media

Media Engineering and Technology Faculty
German University in Cairo
Location Detection Over Social

Media
Bachelor Thesis
Author: Ahmed Soliman

Supervisors: Sarah Elkasrawy
Submission Date: XX July, 20XX

This is to certify that:
(i) the thesis comprises only my original work toward the Bachelor Degree
(ii) due acknowlegement has been made in the text to all other material used
Ahmed Soliman
XX July, 20XX
Acknowledgments
Text
V
Abstract
Geographical location is significantly needed in many applications like local search,

event detection, trend analysis and personalized advertising and recommendations. Al-
though social network platforms like Twitter have attracted a large number of users who
provide huge volumes of data, the location information of these data is sparse. In this
study some approaches to automatically identify locations associated with Tweets are dis-
cussed, followed by an investigation of three methods for selecting and extracting words
that encode geographic information location indicative words, Furthermore, we apply
these methods to build Tweet-level geolocation predictors using a dataset of over four
million tweets. The implemented work is visualized through a web application that use
streamed tweets and visualize the geolocation prediction on maps. Finally, we list future
research directions.
VII
Contents
Acknowledgments V
1 Introduction 1
2 Related Work 3
2.1 Content-Based location prediction . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Social-Network-Based location prediction . . . . . . . . . . . . . . . . . . 5
2.2.1 Friendship-Based Methods . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Social-Closeness-Based location prediction . . . . . . . . . . . . . 6
2.3 Context-Based location prediction . . . . . . . . . . . . . . . . . . . . . . 7
3 Methodology 9
3.1 LIW Identification Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Heuristic-Based Approaches . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Information Theory-Based Approach . . . . . . . . . . . . . . . . 12
3.2 Modeling Spatial Word Usage . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Implementation 15
4.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.3 The Bag Of Words Model . . . . . . . . . . . . . . . . . . . . . . 16
4.1.4 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.5 Stop Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.6 Stemming and Lemmatization . . . . . . . . . . . . . . . . . . . . 17
4.2 Location Indicative Words Identification . . . . . . . . . . . . . . . . . . 17
4.2.1 CALGARI Algorithm Implementation . . . . . . . . . . . . . . . 17
4.2.2 TF-ICF Algorithm Implementation . . . . . . . . . . . . . . . . . 19
4.2.3 IGR Algorithm Implementation . . . . . . . . . . . . . . . . . . . 20
4.3 Building a Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.1 Fitting a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.3.2 Testing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Web Service . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.1 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
IX
4.4.2 Back-End Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4.3 Used Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5 Experimentation 29
5.1 Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6 Future Work 33
6.1 Hierarchical Classification Models . . . . . . . . . . . . . . . . . . . . . . 33
6.2 Exploiting Social Network Data . . . . . . . . . . . . . . . . . . . . . . . 33
6.3 Exploiting User Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.4 Exploiting Non-Geotagged tweets . . . . . . . . . . . . . . . . . . . . . . 34
6.5 Influence of Language on Geolocation Prediction . . . . . . . . . . . . . . 34
7 Conclusion 35
Appendix 36
A Lists 37
List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
References 41
X
Chapter 1
Introduction
The last decade has witnessed the creation of so many social network platforms. Those
include general purpose ones such as Twitter, Tumblr and Facebook, location-based
ones like Foursquare and Gowalla, photo-sharing platforms like Instagram, Pinterest and
Flickr, as well as domain specific platforms such as Yelp and Linkedin. On these social
media platforms users may establish online friendships with others and share similar in-
terestes in form of texts, photos, videos or check-ins.
Among all social media platforms, Twitter is characterized by the unique way of
following users and sending posts to the timeline so that your followers can see posts.
Twitter friendships are not bidirectional, for example users may follow public figures
without requiring them to follow back. Twitter also have the textual posts a.k.a tweets
and are limited to only 140 characters. Twitter users have also the option to provide
their long-term residential addresses in their profiles. With increasing the popularity of
GPS-enabled devices such as smart phones and tablets, users may provide their current
location while tweeting, The problem is that users may not provide locations neither
in their profiles nor using GPS-enabled devices. knowing physical location involved in
Twitter helps us to understand what is happening in real life developing various applica-
tions, such as event detection, predicting elections, news recommendation, personalized
advertising and local places recommendation systems.
Although Twitter users have the option to provide their locations correctly, the lo-
cation information in Twitter are far from being accurate and complete, In a previous
study that investigated the behaviour of twitter users, they found that only 26% out
of a random sample of over 1 million Twitter users revealed their city-level location in
their profiles and only 0.42% of the tweets in this dataset were geo-tagged[4]. Moreover
these profile locations are not always valid as reported that only 42% of Twitter users
in a random dataset have reported a valid city-level location in their profiles [9] and
the remaining users provided inaccurate or even invalid location fields, for example they
have found that some user was located in Justin Biebers Heart, On the tweets level a
1
2 CHAPTER 1. INTRODUCTION
research firm Sysomos studied Twitter usage between mid-October and mid-December
2009 and found that only 0.23% of tweets in that time period were geo-tagged which is
a good indicator how much this information is sparse.
The problem of predicting locations associated with objects have been studied on
Wikipedia, web pages, and general documents for decades. Intuitively, recognizing tweets
location also could be done as users may reveal their location information in various ways,
for example users may discuss local landmarks, buildings, events or mention local words
that is only employed in their state or city. However the nature of Twitter platform
caused challenges for recognizing these words, as Twitter users often tweets in a very
casual manner. Acronyms, misspellings, and special tokens make tweets noisy and tech-
niques developed for formal documents are error-prone on tweets.
In this study, related work that have been done to solve location sparseness problem
in Twitter is presented in chapter 2 followed by the methodology of the investigated
approaches in chapter 3. In chapter 4 the implementation details are further discussed
and explored deeply then we highlight the experiments conducted using the investigated
algorithms, results of the conducted experiments, and analysis of these results. Future
research directions that could enhance the approaches and solve challenges that faced us
while working are then presented in chapter 6, finally in chapter 7 the study is summarized
and concluded.
Chapter 2
Related Work
In this study we focus on location prediction in Twitter, this chapter shows a variety
of prior work that is related to this study. The prior studies categorized the task of
location prediction in Twitter into three areas: Content-Based, Social-Media-Based, and
Context-Based location prediction approaches.
2.1 Content-Based location prediction

Through sending tweets, users home locations may be casually revealed by certain
words in the content of the tweet, previous researches spent great efforts in finding ap-
proaches that identify these words and use them as features for predicting the location
of users posts, for example Laere et al.[19] proposed two types of local words selec-
tion methods. One uses leverages Kernel Density Estimation which could smooth term
occurrences, and the other one is based on Ripleys K statistic which measures terms
geographical deviation. On the other hand Ren et al.[16] and Han et al.[2] both inspired
by Inverse Document Frequency (IDF), they proposed Inverse Location Frequency (ILF)
and Inverse City Frequency (ICF) to measure locality of the words by finding words that
are distributed in fewer locations. Besides these information retrieval methods, some
studies exploited information theory to find local words using the assumption that local
words should be biased than ordinary ones, for example Han et al.[2] used information
gain ratio and maximum entropy and Yamaguchi et al.[21] used K-L divergence, in Hecht
et al.[9] they proposed score for each word that measures the locality of the word sim-
ilar to information theory based measures. Mahmud et al.[13] apply heuristic rules to
select local words. Han et al.[2] compares statistical-based, information theory-based and
heuristic-based methods on local words selection and observe the effect on the method
used on the accuracy of predictability for locations.
3
4 CHAPTER 2. RELATED WORK
On the other hand, supervised methods are also taken into consideration by a number
of previous studies. In Cheng et al.[4] they have treated local word identification as a
classification problem by fitting geographical distribution of each word with Backstroom
et al.[1]s spatial variation model. Specifically, the spatial variation model assumes that
each word has a geographical center and dispersion ratio. In other words the probability
of seeing the word is inversely proportional to the distance to the geographical center
of the word with exponential decay. After the model is fit, the parameters are used as
word features. Second they manually labeled 19,178 words in a dictionary as either local
or non-local. Finally they train a classification model and apply it to all other words in
the tweet dataset. Ryoo and Moon[17] apply the above methods to Korean dataset and
achieve good results.
After identifying local words, the problem is how to use them to predict users home
location. Most studies model the prediction problem in a probabilistic manner. Proba-
bilistic models to characterize the conditional distribution of users location with respect
to their tweets content, then decompose the model to make predictions. A few stud-
ies adopt classification-based approaches to make predictions, they treat users statistics
about local words as features, and all candidate locations as classification labels, Hecht et
al.[9] select top 10,000 words with highest predictability score as local words, In their work
users are represented as feature vector of fixed size 10,000 and fed into a Multinomial
Naive Bayes classifier for training, Similarly Rahimi et al.[15] apply logistic regression
on users TF-IDF vectors. Instead of selecting local words as features, In Mahmud et
al.[13] they adopt hierarchical ensemble algorithm to train two-level classifier ensembles
on timezone-city or state-city granularities, In their extended work proposed identifying
and removing travelling users from training data to improve the performance of home
location classifiers, the travelling users are identified as users who have two tweets sent
from locations with distance above 161 kilometers. Wing and Baldridge[20] use K-D tree
to achieve adaptive grids in their hierarchy. This leads to better granularity for populated
regions, and avoids unnecessarily over-representing less populated areas.
2.2. SOCIAL-NETWORK-BASED LOCATION PREDICTION 5
2.2 Social-Network-Based location prediction

In addition to posting tweets, other major activities that users involve on Twitter
are establishing connections and interact with friends. Like their tweet contents, social
relationships between users may be location indicative, it was also argued in previous
studies that social-closeness which is based on interactions between users are more reliable
for estimating locations. Social-Network-Based methods can be categorized into two
categories: Friendship-Based Methods and Social-Closeness-Based Methods.
2.2.1 Friendship-Based Methods
In social science, the assumption of homophilly suggests that similar people make con-
tacts at higher rate than dissimilar ones. The intuition here was that ones location is
very likely to coincide with ones friends locations. Ren et al.[16] assume that the higher
proportion of a users friends live at a location, the higher probability for the user to
stay at the same location. In Davis et al.[6] they propose the same approach but only
considering mutual friendships.
One of the attempts to find relation between friendship and home location distances is
made by Backstroom et al.[1]. Although this study is conducted on Facebook, it is taken
into consideration for its impact on Twitter-based studies. In their work they analyze
a large number of Facebook users with known locations and their friendships. They try
to fit the probability of two users being friends with respect to their home distances,
they found that the probability of friendship is inversely proportional to home distance.
Based on this model given friends of a user and their home locations, the most probable
location of the user could be easily found. These attempts depend on direct friendship
by assuming that friendship observed on an online social network reflects real off-line
friendship, and thus close home distances and this may be far from being accurate. In
Kong et al.[12] they have studied the phenomena and found that a pair of friends has
83% of chance to live within 10 kilometers if their common friends are more than half of
their friends and this chance decreases to 2.4% if their common friend ratio is limited to
10%.
6 CHAPTER 2. RELATED WORK
2.2.2 Social-Closeness-Based location prediction
In previous subsection, friendship methods involve friendships available in Twitter

network. In Kong et al.[12] it was observed that Social closeness, or how familiar two
users are to each other in real life, is a better indicator of home proximity but not
easy to estimate. Many researches find that the inverse proportion model proposed by
Backstrrom et al.[1] on Facebook does not hold for Twitter, for example McGee et al.[14]
observed that friendship probability with respect to home distance on Twitter satisfy a
bimodal distribution.
In Twitter network, mentions are considered other forms of user interactions than es-
tablishing friendships, for example when users mention each other or have conversation
with each other, then these two users should have closer relationship or share similar
interests in real life as well, and such kind of friends are more valuable when using them
in predicting locations as proven by McGee et al.[14] doing analysis on U.S. Twitter
dataset finding that users actions of mentioning and actively chatting with each other
also indicate home location proximity.
Besides McGee et al.[14] work, Compton et al.[5] also exploit mentions between two
users. They build a user mention graph and optimize unknown home locations such
that users mentioning each other are located as close as possible. Results show that
they achieve 89% coverage at a median error of 6.33 km. Jurgens et al.[11] also take
bidirectional mentions relationships into consideration instead of friendships. Rahimi et
al.[15] find that bidirectional mention are too rare to be useful. they adopt unidirectional
mentions as undirected edge.
In the work done by Hua et al.[10] , they assume that the more a user is influenced
by others mentioning an en entity, the more likely he will mention the same entity.
Specifically, they adopt an incremental disambiguation approach. In the offline stage,
they preprocess a large number of tweets as a base system. Such preprocessing enables
them to estimate friendship-based user interest for entities in the online stage. When a
candidate entity is considered for a mention in users tweet, they look for other users who
once mentioned this entity, An entity is preferred if the users have good reachability to
the user who mentioned the entity in the friendship network. Efforts also were made to
achieve efficient reachability queries.
2.3. CONTEXT-BASED LOCATION PREDICTION 7
2.3 Context-Based location prediction

A tweet is more than a bit of short content. At the point when a tweet is conveyed,
it is connected with its posting timestamp. Also, with the pervasiveness of GPS-enabled
devices like cell phones and tablets, users may alternatively distribute their present ar-
eas as geotags on tweets. At last, users may finish their profiles to incorporate data
like home cities, timezones, and personal web sites. We note that all above information
provide context helping us better understand tweets. A users day by day life tweets
can be interpreted more precisely, if all such data are accessible. Since timestamps, geo-
tags, and user profiles fill in as relevant data for tweets, we refer to them as tweet context.
Mahmud et al.[13] take tweet posting time into consideration. In their dataset, all post-
ing times are recorded and the the day is divided into equal length time slots, users are
then viewed as distributions of tweet posting times . Since users in different time zones
exhibits time shifts in their distribution, a timezone classifier is trained with this distri-
bution as features. Such classification reveal timezones of users and could provide wide
range of users locations.
In the work of Han et al.[2] they observe that self-declared locations and timezones as
free text are not always accurate. Informal abbreviations like mel (for Melbourne) may
occur. Therefore, besides tweet contents they also include all four-grams of self-declared
locations and timezones as features to train a home location predicton.
Chapter 3
Methodology
Twitter users may reveal some location information through tweets explicitly by men-
tioning location names such as country names, states, restaurants or local places. They
also could reveal the information in implicit way by using some words that may encode
geographical information, these words could be belonging to some particular regions more
than other regions or maybe used only on some regions, for example Texas residents use
the word Howdy, while in Philadelphia they call themselves Phillies, so if we could
find approaches to measure the locality of the word, then using these words as features
will help in building a city-level location classifier for tweets.
3.1 LIW Identification Algorithms

In our work we investigate algorithms that try to eliminate location-irrelevant words i.e.,
identifying local-words that are employed by tweets generated from particular places.
Since local words are not enumerable like stop-words, local-words identification methods
are investigated.
3.1.1 Heuristic-Based Approaches
Heuristic methods are not guaranteed to be optimal or perfect, but sufficient for the
immediate goals where finding an optimal solution is impossible or impractical, heuristic
methods can be used to speed up the process of finding a satisfactory solution. In this
part two Heuristic approaches are investigated.
9
10 CHAPTER 3. METHODOLOGY
CALGARI
In this approach a simple heuristic algorithm which is called CALGARI[9] is used. This
algorithm is based on intuition that a model will perform better if it is trained on words
that are more likely to be used by some tweets from particular regions than tweets from
the general population. A score for each term is calculated then the words are ranked
according to this score. In mathematical words the score for each term is defined as the
maximum conditional probability of the word given each class (cities) being examined
over the probability of the word.
We will explain how this score is calculated below:
First Let score(W) be a function which takes a word and calculate the score for that
word W, f (W) be the frequency of a word W in our dataset, count(W, c) be a function
that count how many times the word W appeared with class c, S is the set of all words
in our dataset and C be the set of all classes (city locations) in our dataset, The score for
each word is calculated as follows:
max(P (W| c = C))

score(W) = where c C
P(W)
Where P(W) is the probability of the presence of word and is calculated as the frequency
f (T )
of the word over the total number of words. P(W) = , and P (W| c) is the conditional
|S|
probability of the presence of word W given some class c and is calculated as the number
of times word W appeared with class c over the total number of all words occurrences
with class c.
C count(W, ci )
, so max(P (W| c = C)) is evaluated as max P
i
count(tj , ci )
j
Now after calculating a score for each word, the algorithm sorts the words according to
the calculated score in non-decreasing order.
3.1. LIW IDENTIFICATION ALGORITHMS 11
Term Frequency - Inverse Term Frequency
TF-IDF stands for term frequency - inverse city frequency, this approach is often used
in information retrieval and text mining as TF-IDF which stands for term frequency -
inverse document frequency. This weight is a statistical measure used to evaluate how
important a word is in our dataset, so the intuition here is that location indicative words
should have two properties:
1. High Term Frequecy TF : The word should be reasonably observed from tweets
generated from some city.
2. High Inverse City Frequency ICF : The word should be local so it occurs in relatively
small number of cities.
We calculate TF as the number of times the word appeared in the dataset as in the
previous section with f(T)
T F (W) = f (T )
Then we calculate ICF as the number of classes (cities) divided by the total number of
cities with tweets including the word.
C
icf (W) =
cf (W)
After calculating these two parameters, words are ranked by ICF first, and then by
TF in decreasing order, so that the word that local but has relatively large number of
appearances is preferred.
12 CHAPTER 3. METHODOLOGY
3.1.2 Information Theory-Based Approach
In addition to Heuristic method mentioned in previous section, we also discuss an

information-theoretic feature selection method as it proved to be efficient in text classi-
fication tasks, e.g., Information Gain (IG) in work done by Yang and Pedersen[22]. In
addition it was reported in the work done by Han et al.[2] that using this method the
best results in location detection task is achieved [7].
First lets define two important terms, the first one is Information Gain (IG). The In-
formation Gain is the difference in class (location) entropy due to data split on some
attribute (word), so the higher the value the greater the predictability of the word, so
given a set of all words in our training set S, the IG of a word W S across all classes
(cities) C is calculated as follows:
IG(W) = H(c) H(c|W)

H(c|W)
X X
P(W) P (c | W) log P (c | W) + P(W) P (c | W) log P (c | W)
c C c C
where P(W) and P(W) is the probability of the presence and absence of word W, respec-
tively, P (c | W) and P (c | W) is the conditional probability of class c when word W is
present and absent respectively. Because H(c) is constant over all words, so to rank the
features only the conditional entropy given word W needs to be calculated.
The second term we need to mention is the Intrinsic Entropy Value (IV), local words
occurring in a small number of cities usually have low intrinsic value according to ob-
servations by Han et al.[2], where non-local words have high intrinsic value, so when the
words are comparable in IG values, words with smaller intrinsic value should be preferred
because it means that the words are more locally employed (location indicative).
IV (W) = P(W) log P(W) P(W) log P(W)
Now with the two terms mentioned above, Information Gain Ratio (IGR) is defined as
the ratio between information gain to the intrinsic value.
IG(W)
IGR(W) =
IV (W)
3.2. MODELING SPATIAL WORD USAGE 13
3.2 Modeling Spatial Word Usage

After local words are identified, the aim now is to use these words to predict tweets
location. In this work, a probabilistic model is used to characterize the conditional dis-
tribution of tweets location with respect to the content (local-words included).
In this work, Multinomial Naive Bayes classifier is used to predict locations using feature
set consisting of local words included in the tweet, This classifier is based on Bayes
theorem shown below:
P (B | A) P (A)
P (A | B) =
P (B)
In the non-naive Bayes way, we look at sentences in entirety, thus once the sentence
does not show up in the training set, we will get a zero probability, making it difficult for
further calculations. Whereas for Naive Bayes, there is an assumption that every word is
independent of one another. Now, we look at individual words in a sentence, instead of
the entire sentence.
To make predictions the Multinomial Naive Bayes classifier calculates the probability
of each class City given set of features (local-words) and select the city with highest
probability for prediction result.
Chapter 4
Implementation
The main aim of this study is to detect the location based purely on content of tweets,
this task is very challenging in each part of it, for example too many data is needed
for training a classifier to get good results, also the training part itself is challenging
as we need to implement out of core learning due to this big data. In this chapter we
will present and discuss how each part of the project operates, also we will present the
challenges faced us when developing each component.
4.1 Data Preprocessing

The first important part to get good results from machine learning model is having
clean and well preprocessed data. In this section we will present every step in making
data ready for training and testing phases.
4.1.1 Data Collection
The first step to establish a good machine learning model is collecting data, so the
more data we have the more accurate results the model could give us. In this study we
focus mainly on twitter data. As mentioned in previous chapters, twitter gives the option
to include geolocation (GPS-coordinates) with tweets, so the main task here is to get a
lot of geotagged tweets, this task is challenging due to lack of geotagged tweets so we
have used multiple sources of data. The first source of data is a collected dataset from
the period 2016-01-15 till 2016-02-06 collected by Benjamin Bischke1 , for this dataset
geotagged new posts were collected using a simple python script. for the second source of
twitter data we have used an archived dataset from August 2012 till November 2012 by
the archivist Jason Scott2 , finally we had to stream more million tweet covering multiple
1
https://www.dfki.de/web/forschung/sds/mitarbeiter/basev iew?uid = bebi02
2
https://archive.org/details/jason scott
15
16 CHAPTER 4. IMPLEMENTATION
regions of the world over one week starting in 20th of August 2017 using Twitter Public
API with the help of tweepy library3 .
4.1.2 Data Labeling

To make the data ready for training and testing, the tweets need to be labeled with
unique identifiers that map to a location as these labels are used as the corresponding
output for tweet input, so with python script the tweets are labeled by extracting the
geolocation from coordinates provided with tweets using reverse geocoding library4 , In
this library a K-D tree is populated with cities that have a population more than one
thousand. The source of the data is GeoNames5 , when calling this library with GPS
- coordinates it returns city name which is mapped to a unique identifier to label the
queried tweet.
4.1.3 The Bag Of Words Model

Prior to fitting models and use machine learning, tweets need to be represented as
feature vectors. A commonly used model in Natural language processing is the bag of
words model. The idea is to create a collection of all different words that occur in the
training set and each word is represented by a unique id mapped to count of how it
occurs.
4.1.4 Tokenization
Tokenization process is an important step before filtering tweets, in this process the
text is broken down into individual elements (words). In this process the sentences is
split into individual words, punctuations are removed and all letters are converted to
lowercase.
4.1.5 Stop Words

Stop words are the words that are very common in a text corpus and thus these words
are not considered informative (e.g., words such as so, and, or, the, ). One approach to
remove stop words from tweets is to get a language-specific stop words dictionary and
create a list of these words, then we could filter out any word from tweets that belongs
to that list of stop words, In our study stop words list from NLTK6 corpus is used.
3
http://www.tweepy.org/
4
https://github.com/thampiman/reverse-geocoder
5
http://download.geonames.org/export/dump/
6
http://www.nltk.org
4.2. LOCATION INDICATIVE WORDS IDENTIFICATION 17
4.1.6 Stemming and Lemmatization
Stemming describes the process of transforming a word into its root form, for exam-
ple transforming likes to like and swimming to swim.In contrast to stemming,
lemmatization aims to obtain the canonical (grammatically correct) forms of the words
which is called lemmas. Lemmatization is computationally more difficult and expensive
than stemming, and in practice, both stemming and lemmatization have little impact on
the performance of text classification[18]
4.2 Location Indicative Words Identification

As mentioned in chapter 3, in order to make good classifiers, not all words have the
same effect, some words could be location indicative than other words. Location indicative
words are the words which could encode some geographical information. In order to
rank words according to their location indicativeness, three algorithms mentioned in
chapter 3 are implemented.
4.2.1 CALGARI Algorithm Implementation
In this section we will present pseudo code for CALGARI algorithm. In Algorithm 1,
the main function CALGARI is presented which takes word as one of its parameters
and calculate the score for that word based on the algorithm discussed in chapter 3, this
function makes use of two other functions presented in the same Algorithm which are
COUNT and FREQUENCY. First function is COUNT which calculates the total
number of occurrences of words in some list of words with some class, this function makes
use of another function Occurrence which calculates the number of times where a word
is occurred with some class. FREQUENCY is a function that calculates the frequency
of a word in all tweets in the dataset.
Algorithm 1 CALGARI Algorithm

1: function CALGARI(word, classes, total number of words)
2: for class in classes do
3: conditional probability count(word, class) / count(W , class)
4: max probability max(conditional probability, max probability)
5: probability of word frequency(word) / total number of words
6: score of word max probability / probability of word
7: return score of word
8: function COUNT(words, class)
9: for word in words do
10: if word appeared with class then
11: count count+ Occurrence(word, class)
12: return count
13: function FREQUENCY(word )
14: W words in tweets
15: for w in W do
16: if word is w then
17: count count + 1
18: return count
19: function Make Feature Set(classes, words in tweets)
21: rank {}
22: for w in W do
23: rank (word) CALGARI(w, classes, |W |)
24: sort(rank)
25: return rank
4.2. LOCATION INDICATIVE WORDS IDENTIFICATION 19
4.2.2 TF-ICF Algorithm Implementation
In this section, the pseudo code for TF-ICF algorithm is presented. In this Algorithm
(Algorithm 2) each word is associated to two values TF and ICF. Then the words are
sorted in decreasing order according to ICF and breaking the tie with sorting on TF as
mentioned in the algorithm explanation earlier in chapter 3.
Algorithm 2 TF-ICF Algorithm

1: function ICF(word, classes, total number of words)
2: cf 0
4: if word in tweet with location class then
5: cf cf + 1
6: return |classes|/ cf
7: function TF(word )
9: for w in W do
10: if word is w then
11: count count + 1
12: return count
15: rank {}
16: for w in W do
17: rank (word) (ICF(w, classes, |W |), TF(w))
18: sort(rank)
19: return rank
4.2.3 IGR Algorithm Implementation
In this section we will present pseudo code for Information Gain Ratio algorithm
in Algorithm 3, the main function IGR is presented which takes word as one of its
parameters and calculate Information Gain Ratio for the word by dividing Information
gain of the word over Intrinsic Value of the word, calculating these two parameters is
done using IG and IV functions respectively as discussed in chapter 3, IG function uses
COUNT function presented in Algorithm 1.
Algorithm 3 IGR Algorithm

1: function IGR(word, classes, total number of words)
2: Information Gain IG(word, classes, total number of words)
3: Intrinsic Value IV(word, classes, total number of words)
4: Information Gain Ratio Information Gain / Intrinsic Value
5: return Information Gain Ratio
6: function IG(word, classes, total number of words)
9: appearance probability count(word, class) / count(W , class)
10: appearance probability appearance probability / probability of word
11: sum1 sum1 + appearance probability log2 ( appearance probability )
12: absence probability (appearance probability 1) / (probability of word 1)
13: sum2 sum2 + absence probability log2 ( absence probability )
14: Information Gain probability of word sum1 + (probability of word 1) sum2
15: return Information Gain
16: function IV(word, classes, total number of words)
18: appearance entropy probability of word log2 ( probability of word )
19: absence entropy (probability of word 1) log2 ( (probability of word 1) )
20: return appearance entropy + absence entropy
23: rank {}
24: for w in W do
25: rank (word) IGR(w, classes, |W |)
26: sort(rank)
27: return rank
4.3. BUILDING A CLASSIFIER 21
4.3 Building a Classifier

After extracting features and ranking them according to location indicativeness, A
Multinomial Naive Bayes Classifier is trained using training set and observing words
belonging to feature set (Location Indicative Words). In this study library implemented
by scikit-learn is used to implement Multinomial Naive Bayes algorithm[3].
4.3.1 Fitting a Model
The best feature of scikit-learn library that it can be used to implement an out-of-core
approach. In this approach learning from data that does not fit into main memory is
feasible. So we make use of online classifier, this classifier supports partial fit method
that will be fed with batches of samples. Remains the task of saving feature space the
same over time, so hashing trick is used using library by scikit-learn also which is called
FeatureHasher that will project each sample into the same feature space over learning
steps, this is especially useful in our case as new features (words) may appear in each
batch.
In experiments that will be presented in the next chapter, the batch size is fixed with one
thousand tweets per learning step, in Algorithm 4. we present the procedure that we use
in order to implement out-of-core apporach mentioned above. Out-Of-Core-Learning
function is used to partially fit the model to passed batch of size 1000 tweet, this function
makes use of Make Features Vector that count the occurrence of each seen features in
a tweet, finally FeatureHasher is used to transform features vectors to one sparse matrix
of (samples, features) to be used in fitting by function Partial Fit.
Algorithm 4 Out-Of-Core-Learning
1: function Fit-Batch(batch)
2: labels list {}
3: features vector list {}
4: for tweet in batch do
5: labels list labels list + label of tweet
6: feature vector list feature vector list + Make Feature Vector(tweet)
7: X Transform(FeatureHasher, features vector list)
8: Y labels list
9: Partial Fit(Model, X, Y )
10: function Make Feature Vector(tweet)
11: features vector {}
12: for word in tweet do
13: if word in feature set then
14: feature vector (word) feature vector (word) +1
return features vector
4.3. BUILDING A CLASSIFIER 23
4.3.2 Testing a Model
In order to test the model, prediction of testing dataset is compared to actual labels.
Again the same challenge of big data appears, so the testing process is done using the
same manner used in previous section by calling prediction procedure presented in Al-
gorithm 5. on batches of 1000 tweets each, then the result is evaluated according to
evaluation metrics presented in chapter 5 by making use of the prediction label list from
Predict and actual label list.
Algorithm 5 Out-Of-Core-Testing
1: function Test-Batch(batch)
2: labels list {}
3: features vector list {}
4: for tweet in batch do
5: labels list labels list + label of tweet
6: feature vector list feature vector list + Make Feature Vector(tweet)
7: X Transform(FeatureHasher, features vector list)
8: Y labels list
9: return Predict(Model, X ), Y
10: function Make Feature Vector(tweet)
11: features vector {}
12: for word in tweet do
13: if word in feature set then
14: feature vector (word) feature vector (word) +1
return features vector
4.4 Web Service

In this section, the proposed geolocation system is presented. The core structure of
the system consists of two main parts:
1. User Interface
2. Back-End Engine
4.4.1 User Interface
The user interface provides a stream of tweets from our used dataset, the username and
profile image of the user are also shown beside the tweet text itself, through this page,
user can use the prediction service by hovering on the text and pressing the geolocate icon
(Figure 4.2), then a modal containing a map will be popped out showing the estimated
location and the actual location of the tweet if provided as well as the error distance
between the correct city location and the estimated location (Figure 4.4). The user
interface provide the user with two options:
1. ALL DATASET: Shows all the tweets included in the dataset, by scrolling down
the user can get more tweets in infinite scroll manner (Figure 4.1).
2. SEARCH: Gives the user an option to search for tweets including some hashtag
text and also show tweets in the same infinite scroll manner used in ALL DATASET
page (Figure 4.4 4.5).
4.4.2 Back-End Engine
The back-end of this service consists of controller and prediction engine, The flow of
the system is shown in Figure 4.6, The controller is the main bridge between the user
interface and the back-end prediction engine. When user chooses the option to geolocate
some tweet, the index of this tweet is sent to the controller, the controller make a request
to the back-end engine with tweet information and listen to response from the back-end
engine with the geolocation prediction of the chosen tweet, the coordinates are passed to
google-maps directive to visualize the location on google maps through the user interface.
The prediction engine takes tweet as an input and use the same procedure in testing
phaes in subsection 4.3.2 to get the city label and returns the city center GPS - coordinates
to the controller.
4.4. WEB SERVICE 25
4.4.3 Used Technologies
In this part of the work, MEAN stack (MongoDB, Express, AngularJS and NodeJS) is
used for developing the web service, for the user Interface it is written in HTML (Hyper-
Text Markup Language) and CSS (Cascading Style Sheets), Materialize CSS library is
used in user interface 7 . For embedding google-maps in the user interface and integrating
it with AngularJS, ng-map an AngularJS directive for showing maps is used8 .
Figure 4.1: Stream of Tweets from the dataset
7
http://www.materializecss.com
8
https://ngmap.github.io/
Figure 4.2: On hovering on text, option to geolocate the tweet
Figure 4.3: Google-maps modal showing actual and estimated locations as well as error
distance
4.4. WEB SERVICE 27
Figure 4.4: Search Page with option to type Hashtag to get tweets with that hashtag
Figure 4.5: Search result when texting some hashtag in search option
Figure 4.6: Flow of the geolocation prediction web service

Chapter 5
Experimentation
5.1 Prediction Algorithms

In this study we present two main prediction algorithms that we are experimenting:
1. MNB - CALGARI: A Multinomial Naive Bayes classifier trained using location

indicative words feature set chosen according to CALGARI algorithm.
2. MNB - TF-ICF: A Multinomial Naive Bayes classifier trained using feature set
chosen according to TF-ICF algorithm.
3. MNB - IGR: A Multinomial Naive Bayes classifier trained using location indicative
words feature set chosen according to IGR algorithm.
In these two algorithms, 40% of the feature set extracted were used for training due
to memory requirements needed when the feature set size is increased.
5.2 Evaluation Metrics

To evaluate these three algorithms mentioned in previous section, the following eval-
uation metrics are used:
1. Acc@161: The proportion of predictions that are within 161 kilometers (100 miles)
from the correct city-level-location, this relaxed version of accuracy captures near-
miss predictions.
2. Median Error Distance: median prediction error distance, measured in kilome-

ters between the predicted city centres and the true geolocation embedded in tweets.
The median is preferred over mean because median is less sensitive to extremely in-
correct predictions, for example a user located in United Kingdom and predicted as
being located in Australia opposed to mean which increases substantially because
of small number of extreme misclassification.
29
30 CHAPTER 5. EXPERIMENTATION
5.3 Experiments and Results

In this section, dataset, experimental setup and key results of the three proposed
algorithms are highlighted.
5.3.1 Dataset Description

Modifications are applied to the collected dataset. The classes in the collected dataset
were not represented equally, so models that is trained using this data will suffer from
multi-class classification problems by providing misleading classification accuracy. Be-
cause the collected dataset was relatively large, resampling the dataset to have more
balanced dataset was the best choice, so the dataset was under-sampled to get balanced
representation for each class. Table 6.1 shows the summary statistics of the training,
testing dataset.
Table 5.1: Dataset description.

Set No. of Tweets
Training 3,389,130
Testing 378,570
5.3.2 Experimental Setup

The experimental setup consists of two main phases. In the first phase, the training
dataset is used to extract various feature sets based on the three algorithms mentioned,
then geolocation predictors (MNB-CALGARI, MNB-TF-ICF and MNB-IGR)
are trained using the procedure of out-of-core learning mentioned in chapter 4. In the
second phase the various predictors are rerun on the testing dataset using the out-of-core
testing procedure mentioned in chapter 4.
Table 5.2: Results for Tweet-level Geolocation Prediction. The bolded results indicate
the best performing algorithms (highest value for accuracy and lowest value for median
error) for the testing set.
Algorithm Acc@161 Median Error Distance

MNB-CALGARI 0.0465 7260.5573
MNB-TF-ICF 0.0254 7034.6405
MNB-IGR 0.0468 6058.6531
5.3. EXPERIMENTS AND RESULTS 31
5.3.3 Results
Table 6.2 shows the geolocation prediction results for the tweet-level, in terms of
accuracy at 161 kilometers and median error distances. The results shows that Algorithm
MNB-IGR achieves the best result in terms of Acc@161 and Median Error Distance
which is the case also in the study [8] where IGR feature selection method achieved the
best results, the table also shows that MNB-TF-ICF achieves lower Acc@161 but better
Median Error Distance than MNB-CALGARI.
5.3.4 Analysis
Although investigated algorithms have been tested in previous studies and had satis-
factory results, the results of experiments conducted in this work were less than expected
for some reasons, In this section some of these reasons are presented:
1. Dataset Size: The dataset size affects the performance of the location indicative
words identification process as well as fitting a classifier, In previous work that
showed better results they have used larger datasets, for example in work done
by (Bo han et al.) they have used a dataset of around 38 Million tweets, using
this large amount of data they were able to achieve better results using two of the
location indicative words identification algorithms presented in this work as shown
in table 5.3
Table 5.3: Results for Geolocation Prediction in work done by Bo Han et al using larger
dataset of 38 Million Tweets
Algorithm Acc@161 Median Error Distance

MNB-TF-ICF 0.359 533
MNB-IGR 0.450 260
2. Tweeting Nature: The characteristics of Twitter cause challenges in the task of

geolocation prediction, On the one hand, users often write tweets in a very casual
manner. Acronyms, misspellings, and special tokens make tweets very noisy, and
techniques developed for formal documents are error-prone on tweets. In addition
the limit of 140-character make tweets short and this limit the number of local
words seen in each tweet.
32 CHAPTER 5. EXPERIMENTATION
3. Misleading Travelling Users: In this work, geolocation embedded in the tweets

are assumed to be the location of the tweet, for users who generate tweets from
different geolocations are misleading for extracting location indicative words and
training a classifier, identifying and removing travelling people from training data
to improve the performance of location classifiers. A person is considered travelling
if any two of her tweets were sent from locations with distance above 100 miles
According to Mahmud et al s work.
4. Lack of Information: Because of short 140-character tweet, the tweet does not
include many words and so it does not contain location indicative words that
help in prediction task given only the tweet as an input. The information here is
more limited than that for user-level location prediction task where users tweets
are all provided. Therefore it is worthwhile to exploit the input with reasonable
redundancy, for example by including all rare n-grams, even those occurring just
three times.
Chapter 6
Future Work
This study is discussing a very challenging task, the results of this task could be
improved by various directions. In this chapter some future work that could be discussed
and researched will be presented.
6.1 Hierarchical Classification Models

A hierarchical classifier is a type of classifiers that apply classification on a low-level
with specific pieces of data, then the individual pieces are combined and classified on a
higher level in an iterative manner. These classifiers are becoming increasingly popular,
as an example of these classifiers the stacking approach[1], In this classifier the training
of stacking consists of two steps, first one is Multinomial Naive Bayes classifier and the
second one is logistic regression learner. these machines rely on the power of the hierar-
chical structure itself instead of the computational abilities of the individual components.
Using these classifiers could improve the accuracy of our results by combining multiple
learners.
6.2 Exploiting Social Network Data

Although the task of retrieving social network data is not very trivial, exploiting this
data could result in significant improve in the accuracy of predictors. Social network data
as friendship in Facebook or followers in Twitter could be location indicative, for example
in (backstroom 2010)[1] they made an investigation of the probability of friendship as
a function of distance and found that the probability of friendship is proportional to
inverse of distance, thus we can exploit followers and friends and combine their location
information with current predictors in order to improve geolocation accuracy.
33
34 CHAPTER 6. FUTURE WORK
6.3 Exploiting User Metadata

Incorporating user meta data could give valuable source of geographic information
beyond information extracted from tweets[7], User metadata contains some fields that
related to geographic information such as location, timezone, description and username.
Unlike social network data mentioned in previous section, retrieving metadata is not a
challenging task as it is provided in JSON object provided by the Twitter Streaming API,
but they are dynamic as it could be changed by the user over time, another problem
concerning these data that users are free to write whatever information they choose.
Overcoming these problems could give a good chance to exploit this data and improve
prediction task.
One of the fields provided in metadata is timezone, this field could be a powerful tool,
in a previous study (Mahmud et al. 2012) relation between timezone and average tweet
volumes is used to train a classifier, so by combining predictors with other learners using
timezone information could enhance predictions.
6.4 Exploiting Non-Geotagged tweets

In order to enhance the prediction accuracy of predictors or increase the feature set
size, training set should be enlarged, the problem of rare geotagged tweets prevent this
enlargement. Making use of geotagged tweets users to get more tweets of these users
and knowing in advance their location (GPS coordinates included in their tweets) gives
good chance to enlarge the dataset and thus enhance the performance.
Although researches on text-based geolocation have used geotagged data for evaluation,
the ultimate goal of these researches is to be able to predict the locations of users for whom
the location is not known. Because geotagged tweets are typically sent via GPS-enabled
devices such as smartphones and non-geotagged tweets are sent from a wider range of
devices, there could be differences in the content of geotagged and non-geotagged tweets,
so exploiting non-geotagged data for evaluation so that the dataset represent users as
combination of geotagged and non-geotagged tweets will produce better results.
6.5 Influence of Language on Geolocation Prediction

Twitter is considered a multilingual medium, so in order to make use of huge volume
of data provided by Twitter, tweets in all languages have to be geolocated. A series
of experiments done by previous researches[2] has shown the influence of language on
geolocation prediction. Among the top 10 found on Twitter, English is shown to be the
most difficult language to perform user geolocation because English is the most global
language. The best proposed way to perform multilingual geolocation prediction is to
train language-specific models and geolocating users based on their primary language.
Chapter 7
Conclusion
Twitter as one of the most popular online social networks provides a virtual world for
users where they can establish friendships and share their daily interests. Locations in
real life have been involved in every corner of the Twitter world. If locations provided
in correct form, they may facilitate various applications and benefit users in real life.
Because of incompleteness and inaccuracy of locations provided from Twitter users,
extensive research efforts have been spent on geolocation sparseness problem in Twitter.
In this study, research efforts spent on location prediction in Twitter are presented
and categorized into three categories: Content-Based, Social-Network-Based and
Context-Based. Focusing on content based approaches, identifying location indicative
words to use them as features in order to predict tweets location is investigated,
three algorithms that try to identify these words are tested using Multinomial Naive
Bayes classifierfollowed by an implementation for a web service that visualize the ge-
olocation prediction of tweets is implemented to make the prediction task more interactive
The experiments conducted showed that information theoretic methods are achieving
the best results in terms of accuracy at some threshold distance and median error
distance. Also the study presented analysis to explain the less-than-expected results and
concluded that the performance is affected by some parameters such as dataset size,
noisy tweeting nature, misleading travelling users, and lack of information in tweets
content.
Finally the study present some future research directions to enhance the performance
and overcome challenges in tweet location prediction task such as implementing hierar-
chical classification models, exploiting social network data, exploiting user metadata and
non geotagged tweets as well as building language-specific classifiers.
35
Appendix
36
Appendix A
Lists
37
List of Figures
4.1 Stream of Tweets from the dataset . . . . . . . . . . . . . . . . . . . . . 25

4.2 On hovering on text, option to geolocate the tweet . . . . . . . . . . . . . 26
4.3 Google-maps modal showing actual and estimated locations as well as error
distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Search Page with option to type Hashtag to get tweets with that hashtag 27
4.5 Search result when texting some hashtag in search option . . . . . . . . . 27
4.6 Flow of the geolocation prediction web service . . . . . . . . . . . . . . . 28
38
List of Tables
5.1 Dataset description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Results for Tweet-level Geolocation Prediction. The bolded results indi-
cate the best performing algorithms (highest value for accuracy and lowest
value for median error) for the testing set. . . . . . . . . . . . . . . . . . 30
5.3 Results for Geolocation Prediction in work done by Bo Han et al using
larger dataset of 38 Million Tweets . . . . . . . . . . . . . . . . . . . . . 31
39
Bibliography
[1] Lars Backstrom, Eric Sun, and Cameron Marlow. Find me if you can: Improving
geographical prediction with social and spatial proximity. In Proceedings of the 19th
International Conference on World Wide Web, WWW 10, pages 6170, New York,
NY, USA, 2010. ACM.
[2] Han Bo, Paul Cook, and Timothy Baldwin. Geolocation prediction in social media
data by finding location indicative words. In Proceedings of COLING, pages 1045
1062, 2012.
[3] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grob-
ler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gael Varoquaux.
API design for machine learning software: experiences from the scikit-learn project.
In ECML PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108122, 2013.
[4] Zhiyuan Cheng, James Caverlee, and Kyumin Lee. You are where you tweet: a
content-based approach to geo-locating twitter users. pages 759768, 2010.
[5] Ryan Compton, David Jurgens, and David Allen. Geotagging one hundred million
twitter accounts with total variation minimization. In Big Data (Big Data), 2014
IEEE International Conference on, pages 393401. IEEE, 2014.
[6] Clodoveu A Davis Jr, Gisele L Pappa, Diogo Renno Rocha de Oliveira, and Filipe
de L Arcanjo. Inferring the location of twitter messages based on user relationships.
Transactions in GIS, 15(6):735751, 2011.
[7] Bo Han, Paul Cook, and Timothy Baldwin. A stacking-based approach to twitter
user geolocation prediction. In ACL (Conference System Demonstrations), pages
712, 2013.
[8] Bo Han, Paul Cook, and Timothy Baldwin. Text-based twitter user geolocation
prediction. Journal of Artificial Intelligence Research, 49:451500, 2014.
[9] Brent Hecht, Lichan Hong, Bongwon Suh, and Ed H Chi. Tweets from justin biebers
heart: the dynamics of the location field in user profiles. In Proceedings of the
SIGCHI conference on human factors in computing systems, pages 237246. ACM,
2011.
40
BIBLIOGRAPHY 41
[10] Wen Hua, Kai Zheng, and Xiaofang Zhou. Microblog entity linking with social tem-
poral context. In Proceedings of the 2015 ACM SIGMOD International Conference
on Management of Data, pages 17611775. ACM, 2015.
[11] David Jurgens, Tyler Finethy, James McCorriston, Yi Tian Xu, and Derek Ruths.
Geolocation prediction in twitter using social networks: A critical analysis and review
of current practice. ICWSM, 15:188197, 2015.
[12] Longbo Kong, Zhi Liu, and Yan Huang. Spot: Locating social media users based
on social network context. Proceedings of the VLDB Endowment, 7(13):16811684,
2014.
[13] Jalal Mahmud, Jeffrey Nichols, and Clemens Drews. Where is this tweet from?
inferring home locations of twitter users. ICWSM, 12:511514, 2012.
[14] Miller McPherson, Lynn Smith-Lovin, and James M Cook. Birds of a feather: Ho-
mophily in social networks. Annual review of sociology, 27(1):415444, 2001.
[15] Afshin Rahimi, Duy Vu, Trevor Cohn, and Timothy Baldwin. Exploiting text
and network context for geolocation of social media users. arXiv preprint
arXiv:1506.04803, 2015.
[16] Kejiang Ren, Shaowu Zhang, and Hongfei Lin. Where are you settling down: Geo-
locating twitter users based on tweets and social networks. In Asia Information
Retrieval Symposium, pages 150161. Springer, 2012.
[17] KyoungMin Ryoo and Sue Moon. Inferring twitter user locations with 10 km ac-
curacy. In Proceedings of the 23rd International Conference on World Wide Web,
pages 643648. ACM, 2014.
[18] Michal Toman, Roman Tesar, and Karel Jezek. Influence of word normalization on
text classification. Proceedings of InSciT, 4:354358, 2006.
[19] Olivier Van Laere, Jonathan Quinn, Steven Schockaert, and Bart Dhoedt. Spatially
aware term selection for geotagging. IEEE transactions on Knowledge and Data
Engineering, 26(1):221234, 2014.
[20] Benjamin Wing and Jason Baldridge. Hierarchical discriminative classification for
text-based geolocation. In EMNLP, pages 336348, 2014.
[21] Yuto Yamaguchi, Toshiyuki Amagasa, Hiroyuki Kitagawa, and Yohei Ikawa. Online
user location inference exploiting spatiotemporal correlations in social streams. In
Proceedings of the 23rd ACM International Conference on Conference on Informa-
tion and Knowledge Management, pages 11391148. ACM, 2014.
[22] Yiming Yang and Jan O Pedersen. A comparative study on feature selection in text
categorization. In Icml, volume 97, pages 412420, 1997.

Location Detection Over Social Media

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Location Detection Over Social Media

Загружено:

Авторское право:

Доступные форматы

Media Engineering and Technology Faculty

German University in Cairo

Location Detection Over Social

Author: Ahmed Soliman

Submission Date: XX July, 20XX

Geographical location is significantly needed in many applications like local search,

2.1 Content-Based location prediction

2.2 Social-Network-Based location prediction

2.2.1 Friendship-Based Methods

2.2.2 Social-Closeness-Based location prediction

In previous subsection, friendship methods involve friendships available in Twitter

2.3 Context-Based location prediction

3.1 LIW Identification Algorithms

3.1.1 Heuristic-Based Approaches

max(P (W| c = C))

Term Frequency - Inverse Term Frequency

3.1.2 Information Theory-Based Approach

In addition to Heuristic method mentioned in previous section, we also discuss an

IG(W) = H(c) H(c|W)

IV (W) = P(W) log P(W) P(W) log P(W)

3.2 Modeling Spatial Word Usage

4.1 Data Preprocessing

4.1.1 Data Collection

4.1.2 Data Labeling

4.1.3 The Bag Of Words Model

4.1.5 Stop Words

4.1.6 Stemming and Lemmatization

4.2 Location Indicative Words Identification

4.2.1 CALGARI Algorithm Implementation

Algorithm 1 CALGARI Algorithm

4.2.2 TF-ICF Algorithm Implementation

Algorithm 2 TF-ICF Algorithm

4.2.3 IGR Algorithm Implementation

Algorithm 3 IGR Algorithm

4.3 Building a Classifier

4.3.1 Fitting a Model

4.3.2 Testing a Model

4.4 Web Service

4.4.1 User Interface

4.4.2 Back-End Engine

4.4.3 Used Technologies

Figure 4.1: Stream of Tweets from the dataset

Figure 4.2: On hovering on text, option to geolocate the tweet

Figure 4.6: Flow of the geolocation prediction web service

5.1 Prediction Algorithms

1. MNB - CALGARI: A Multinomial Naive Bayes classifier trained using location

5.2 Evaluation Metrics

2. Median Error Distance: median prediction error distance, measured in kilome-

5.3 Experiments and Results

5.3.1 Dataset Description

Table 5.1: Dataset description.

5.3.2 Experimental Setup

Algorithm Acc@161 Median Error Distance

Algorithm Acc@161 Median Error Distance

2. Tweeting Nature: The characteristics of Twitter cause challenges in the task of

3. Misleading Travelling Users: In this work, geolocation embedded in the tweets

6.1 Hierarchical Classification Models

6.2 Exploiting Social Network Data

6.3 Exploiting User Metadata

6.4 Exploiting Non-Geotagged tweets

6.5 Influence of Language on Geolocation Prediction

4.1 Stream of Tweets from the dataset . . . . . . . . . . . . . . . . . . . . . 25

5.1 Dataset description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30