Вы находитесь на странице: 1из 5

International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.

11 (2012)
Research India Publications; http://www.ripublication.com/ijaer.htm


Twitter Spamming: Techniques And Defence Approaches

Arun Kumar R
Electronics & Computer Engineering Department
Indian Institute of Technology Roorkee
Roorkee, India
rampurearun@gmail.com

Sandeep Kumar
Electronics & Computer Engineering Department
Indian Institute of Technology Roorkee
Roorkee, India
sgargfec@iitr.ernet.in


Abstract:
Rapid growth of social networking sites have made a huge
impact on todays society and Web platform. Social
networking sites are growing in both size and popularity with
a very high rate in recent years. Among the Social Networking
Sites, Twitter is the fastest growing one. Its popularity attracts
many spammers to infiltrate legitimate users accounts with a
large amount of spam messages. With the amount of data
growing in Twitter in recent years, detection of spam in real-
time has become a challenging task for researchers as well as
for Twitter itself. Tremendous work is being carried out
towards spam detection. Twitter spam detection consist both
the varieties of detecting spammers and detecting spam links
posted by the users. In this paper we give a hierarchy of
spamming techniques, defence approaches and evasion tactics
adopted by the spammers to evade detectors.
Keywords- Spam, Twitter, Machine Learning, Social
Networking


INTRODUCTION
The web has changed the way we inform and get informed.
Online Social Networking (OSNs) sites provide new form of
communication in todays world. With the growing size of
OSNs, number of spam messages has also increased. Spam is
an inseparable part of the web. Spam has taken various forms
since its discovery in 1978 as e-mail spam to currently
emerging social networking spam like facebook, LinkedIn and
twitter spam through various other kinds. Twitter is a micro-
blogging service, founded in 2006, where users can post 140
character messages called tweets. The goal of Twitter is to
allow friends communicate and stay connected through the
exchange of short messages. Unlike Facebook and MySpace,
Twitter is directed, meaning that a user can follow another
user, but the second user is not required to follow back. Most
accounts are public and can be followed without requiring the
owners approval.
With the structure of following public accounts of Twitters
with-out owners permission, spammers can easily follow
legitimate users as well as other spammers.
The term spam refers to unsolicited bulk messages and
spammer is the one who spreads these messages. Spam is
becoming an increasing problem on Twitter as on other online
social networking sites. Unfortunately, spammers use Twitter
as a tool to post malicious links, send unsolicited messages to
legitimate users, and hijack trending topics. Tweet spammers
are driven by several goals, such as to spread advertise to
generate sales, disseminate pornography, viruses, phishing, or
simple just to compromise system reputation. The reason that
spammers are so efficient in sending out spam is that they
follow several twitter users and hope that those users turn
around and follow them. This procedure is known to be proper
etiquette in Twitter [1]. Spam not only pollutes real time
search, but they can also consume extra resources from users
and systems [8]. Most importantly, spam wastes human
attention, most valuable resource in current era of World Wide
Web.
Given that spammers are increasingly arriving on Twitter, the
success of real time search services and mining tools relies at
the ability to distinguish valuable tweets from spam attacks.


RELATED WORK
Spam Detection has an extensive scope of research exploring
identification of spammers or spam, preventing spammers and
counter balancing its effects on the media; society etc. Kwak
et al. [2] have shown an exhaustive and qualitative study of
Twitter user accounts behavior, like the variations in the
number of followers and followings for normal user and
spammer etc.
Cha et al. [3] have design alternative metrics to measure
Twitter accounts. In M. McCord, M. Chuah [5], have shown
influence of user-based and content-based features, which are
influenced by Twitter Policies, can be used to distinguish
between spammers and legitimate users on Twitter.
Usefulness of these features is evaluated in spammer detection
using traditional classifiers like Random Forest, Nave
Bayesian, Support Vector Machine, K-Nearest Neighbor
schemes using the Twitter dataset collected.
Benevenuto et al. [16], has investigated different tradeoffs for
classification approach of detecting spammers instead of
tweets containing spam and the impact of different attribute
sets. And it was also shown that change in the performance of
classifiers output based on different feature set selected. In
Nikita Spirin [9], has shown importance if URL derived
features set in detecting spammers.Yang et al. [4] focuses
more on analysing evasion tactics utilized by current Twitter
spammers and authors designed new machine learning
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012)
Research India Publications; http://www.ripublication.com/ijaer.htm


features to more effectively detect Twitter spammers. In
addition, authors also formalize the robustness of 24 detection
features.


TWITTER TAXONOMY
Twitter has its own taxonomy. This section defines Twitter
taxonomy.
a) Tweets: Short messages with maximum 140 characters
in length are used as the communication tool.
b) Followers: A users followers are the set of users who
receive a tweet when posted. If a user posts a tweet on
his home page, all of his followers receive the same
tweet on their home pages too.
c) Friends: Friends are the set of users an account
subscribes to in order to obtain access to status updates.
d) Hashtags: These are indicated by a #symbol and are
combined with keywords to indicate a topic of interest.
Hashtags become popular when many people use the
hashtag.
e) Trending Topics: These are the popular hashtags,
appear on the main Twitter page and can significantly
increase the number of tweets containing that topic.

TWITTER SPAMMING TECHNIQUES
Twitter Spamming techniques can be divided into two
categories:

A. Profile-Based Spamming Techniques
Follow Spam: Follow spam is the act of following
mass number of people, not because a user actually
interested in their tweets, but simply to gain attention,
get views of a respective users profile (and possibly
clicks on URLs therein), or (ideally) to get followed
back. Automated programs are used to make this task
easier, this way they can follow thousands of users with
in a fraction of seconds. In extreme cases, these
automated accounts have followed so many people and
they are threat to the performance of the entire system.
In less-extreme cases, they simply annoy thousands of
legitimate users who get a notification about this new
follower only to find out their interest may not be
entirely sincere. These types of accounts can be
examined by checking the tweets posted by the users
and examining their behaviour. Figure 1 shows an
instance of follow spam.


Figure 1: Instance of Follow Spam Technique[7]

Mention Spam: Spammers mention the username of a
targeted user before tweeting. Targeted users attention
can be grabbed by this method.
B. Content-Based Spamming Techniques:
Trend Abuse: Twitters API also provides a list of the
top trends per hour. Spammers use these trending topics
in their tweets and it gets posted in the time line causing
annoyance to all the users because public accounts can
be seen by anyone on the twitter. Figure 2 shows the
two instances of typical (a) trend abuse and (b) multi-
trend spam.


(a)Trend Abuse Scenario


(b)Multi-Trend Abuse Scenario

Figure 2: Instances of Trend abuse spamming techniques [11]

Trend Setting: Here spammers post a large number of
tweets containing a specific word in it, making the word
or hashtag a new trending topic.
Fake Re-tweets: In this technique spammers take
advantage of the Twitters Re-Tweet convention to
make it appear that a Spammers tweet was originally
published by another user. These can be identified by
twitters search capabilities where re-tweets can be
distinguished from original tweets.
Embedding Popular Search Terms: In this technique
spammers act very smart. They include popular search
terms in their tweets and when a user search for the
same terms, these tweets gets displayed in the result set,
which is again an annoying experience for a legitimate
user, who does not get the expected results.
Direct Message: This is traditional spamming
technique where spammers send personal message to
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012)
Research India Publications; http://www.ripublication.com/ijaer.htm


the targeted user directly. New spamming techniques
are emerging still today. Above explained techniques
are the most popular spamming techniques used by
spammers.


TWITTER SPAM DETECTION
Twitter itself offers several options to the users to report spam
messages or spammers. Some of them are: Report a user as
spammer by clicking Report @username as Spam button
under the Actions section of a profiles sidebar, Report a tweet
link as spam, Block suspicious user. Twitter also provide
guidelines for analysing a spammer and provide rules of
DONT for the users [10].
Detection Techniques of Twitter spam can be classified into
two categories:

A. Detecting Spammers(nodes)
Detection Techniques of Twitter Spam is carried out by
applying machine learning algorithms on the data extracted
through various data mining techniques having features
specified in feature sets to detect spammers. This approach for
Twitter spam detection methods is done in three steps:
a) Crawling twitter data and Building labelled collection:
Data about the twitter users can be crawled using
different approaches. Twitter provides different APIs
like REST API, STREAMING API, and SEARCH API.
Based on feature set selection the data is crawled
accordingly. Then collected data is manually labelled as
spam and non-spam labels by examining recent tweets
and time line of the user. The links given in the each user
tweet is examined and checked manually. As this is a
time consuming process labelling is done for small set of
data. Labelling is done on the basis of feature set
selected.
b) Construction of feature sets: One of the crucial and time
consuming tasks in the web spam detection systems is
the process of feature extraction, which is usually
accomplished after crawling and during the indexing
phase. If less number of features is used to detect the
spam pages, then one might save some computational
costs and therefore the performance of the system will be
increased. The automated data mining feature selection
technique provides an effective method for selecting the
most predictable features from many presented features.
After features are selected by feature selection methods,
their effectiveness can be investigated by accuracy of
classification algorithms applied to only these selected
features vs. all features. The fewer features lead to reach
the higher or the same level of performance.
Classification of feature sets will be discussed in the next
section. Types of features:

1) Graph-based Features[11]:
#friends
#followers
Reputation score[5] (#friends/#followers)
Users with certain distance in social graph
2) Content-based Features
#Duplicate Tweets
#HTTP Links
#Replies and Mentions
#Trending Topics
3) URL based features
#tweets containing spammy URLs
fraction of tweets containing spammy URLs
#spammy URLs
#unique URLs
Various types of other feature set e.g. user based features,
tweet based features, timeline based features, neighbourhood
based features etc. are used depending on the requirement of
the detecting systems.
c) Classify spammers and non-spammers using machine
learning algorithms: After selecting features, we apply
various classification algorithms to obtain performance
of them on our dataset. The results of these algorithms
are used to compare effectiveness of different feature
selection methods. For the classification tasks, we use
various algorithms such as neural network, Support
Vector Machine (SVM), Nave Bayesian classifier,
Decision trees, and logical regression. The performance
of the detection process is based on the right
combination of selection of feature sets and machine
learning algorithm.

B. Detecting Spam (tweets, links)
In this paper we focus on detecting spammers and detecting
spam links is a falls under the class of web spam detection.


EVASION TACTICS
In spite of many researchers and twitters effort to detect and
avoid spam (as discussed in section V), Spammers follow
evasion tactics to get rid of these detection methods. This
section discusses about classification and methods adopted for
evasions.

A. Evasion Techniques
The main evasion tactics, utilized by the spammers to evade
existing detection approaches, can be categorized into the
following two types:
a) Profile-based Feature Evasion Tactics:
A common intuition for discovering Twitter spam
accounts can originate from accounts basic profile
information such as number of followers and number of
tweets, since these indicators usually reflect Twitter
accounts reputation. To evade such profile-based
detection features, spammers mainly utilize tactics
including gaining more followers and posting more
tweets.
Gaining More Followers: In general, popularity of a
user can be measured through the number of followers of
that account. A higher number of followers of an account
commonly imply that more users trust this account and
would like to receive the information from it. Thus,
many profile-based detection features such as number of
followers, fofo ratio (ratio of the number of an accounts
following to its followers) and reputation score are built
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012)
Research India Publications; http://www.ripublication.com/ijaer.htm


based on this number (number of followers). To evade
these features or break-through Twitters 2,000
Following Limit Policy [8], spammers can mainly adopt
the following strategies to gain more followers. The first
approach is to purchase followers from websites. These
websites charge a fee and then use a group of Twitter
accounts to follow their customers. The specific methods
of providing these accounts may differ from site to site.
The second approach is to exchange followers with other
users. This method is usually assisted by a third party
website. These sites use existing customers accounts to
follow new customers accounts. Since this method does
only require Twitter accounts to follow several other
accounts to gain more followers without paying any fee,
Twitter spammers can get around the referral clause by
creating more number of fake accounts. In addition,
Twitter spammers can gain followers for their accounts
by using their own created fake accounts. In this way,
spammers can create a bunch of fake accounts, and then
follow their spam accounts with these accounts. Figure 3
shows a existing online website from which users can
directly buy followers.



Figure 3: Example of twitter followers online trading
websites [6]

Posting More Tweets: Tweet based feature is also
widely used in the existing Twitter spammers detection
approaches. To evade this feature, spammers can post
more Tweets at regular intervals of time to behave more
like legitimate accounts, especially continuing to utilize
some public tweeting tools or software.
b) Content-based Feature Evasion Tactics:
The percentage of Tweets containing URLs is an
effective indicator of spam accounts, which is utilized in
work such as [9]. Many existing approaches design
content-based features such as tweet similarity (number
of tweets posted having similar semantic meaning) [6]
and duplicate tweet count (number of duplicate tweet
posted) [6] to detect spam accounts. To evade such
content-based detection features, spammers mainly
utilize the tactics including mixing normal tweets and
posting heterogeneous tweets.
Mixing Normal Tweets: Spammers can utilize this
tactic to evade content-based features such as URL ratio
(ratio of number of tweets posted that contain link to the
number of tweets posted), unique URL ratio (ration of
number of unique URLs posted to the total number of
URLs posted), hashtag ratio (ration of tweets containing
hashtags to the total number of tweets posted) [9]. By
using this tactic, spammers are able to dilute their spam
tweets and make it more difficult to be distinguished
from legitimated accounts. Figure 4 shows an instance of
such scenario.


Figure 4: Instance of mixing Normal Tweets.

Posting Heterogeneous Tweets: Spammers can post
heterogeneous tweets to evade content-based features
such as tweet similarity [9] and duplicate tweet count
[9]. Spammers can utilize public tools to convert a few
different spam tweets into hundreds of variable tweets
with the same semantic meaning using different words.
Figure 5 shows a online tool which does this job.


Figure 5: Scenario of posting heterogeneous tweets and Spin
Bot [6].


CONCLUSIONS AND FUTURE WORK
In this paper we have categorised and discussed various types
of spamming techniques, general approach to detect spammers
and category of evasion tactics to evade features used by
International Journal of Applied Engineering Research, ISSN 0973-4562 Vol.7 No.11 (2012)
Research India Publications; http://www.ripublication.com/ijaer.htm


detectors. With the techniques of spamming and detection
methods explained in previous sections one could able to:
1. Identify instances of spam
2. Prevent spammers and
3. Counterbalance the effect of spamming.
Spam detection and counter balancing its impact is a never
ending story. It is just matter of how fast one can detect spam
accurately. Removal spam completely is a myth. This work
can be further extended by giving an expert system to detect
spammers on twitter using minimal and robust feature sets
which imposes cost to evade.


ACKNOWLEDGMENT
I would like to acknowledge the contribution of Late Dr.
Anjali Sardana, Assistant Professor, Electronics and
Computers Department, IIT Roorkee, whose guidance was
indispensable throughout the course of this work.

REFERENCES

[1] K. Beck, Analyzing Tweets to Identify Malicious
Messages, in IEEE International Conference on
Electro/Information Technology (EIT), pp.1-5, May
15-17, 2011.
[2] H. Kwak, C. Lee, H. Park, and S. Moon. What is
Twitter, a Social Network or a News Media? In Intl
World Wide Web (WWW 10), 2010.
[3] M. Cha, H. Haddadi, F. Benevenuto, and K.
Gummadi. Measuring User Influence and Social
Media (ICWSM), 2010.
[4] Chao Yang, Robert Chandler Harkreader, Guofei Gu
Die Free or Live Hard? Empirical Evaluation and
New Design for Fighting Evolving Twitter
Spammers, RAID 2011, pp. 318337, Springer-
Verlag Berlin Heidelberg 2011.
[5]
[6] M. McCord, M. Chuah, Spam Detection on Twitter
Using Traditional Classifiers, in Proc. 8th
International Conference on Autonomic and Trusted
Computing, ATC 2011, pp. 175-186, Banff, Canada,
September 2-4, 2011.
[7] Wang, Alex Hai, DONT FOLLOW ME Spam
Detection in Twitter, in Proc. International
Conference on Security and Cryptography
(SECRYPT), 2010, pp. 1-10, July 26-28, 2010.
[8] Social Signals and SEO-Can Facebook and Twitter
help my SEO? ,http://thesemblog.com/tag/twitter/,
March 2011.
[9] DITESCO The 2000 Following Limit Policy On
Twitter, http://twittnotes.com/2009/03/2000-
following-limit-on-twitter.html, March 10, 2009.
[10] Nikita Spirin, Mutually Reinforcing Spam Detection
on Twitter and Web, In VIII All-Russian scientific
conference Microsoft Technologies in the theory and
practice of programming, pp. 1-7, Saint-Petersburg,
Russia, 2011.
[11] Twitter, Twitter Rules,
https://support.twitter.com/articles/18311-the-twitter-
rules, 2012.
[12] F. Benevenuto, Gabriel Magno, Tiago Rodrigues,
and Virglio Almeida, Detecting Spammers on
Twitter, in CEAS 2010 Seventh annual
Collaboration, Conference on Electronic messaging,
AntiAbuse and Spam, July 13-14, 2010, Redmond,
US.

Вам также может понравиться