Вы находитесь на странице: 1из 7

Spammer Detection and Fake User Identification on Social

Networks

Abstract
Social networking sites include millions of users worldwide. The interaction of users with these
social networking sites, such as Twitter and Facebook, has a profound effect and sometimes has
negative effects on daily life. Large social networking sites have become a platform for
beneficiaries to disseminate huge amounts of misinformation. For example, Twitter has become
one of the most commonly used platforms and therefore unintentionally allows for the wrong
amount of spam. False users send unsolicited tweets to users to promote services or websites that
not only affect legitimate users and interfere with the use of the services In addition, Twitter
marketing strategies that showcase strategies are based on their ability to find spam and fake
users. The strategies you presented were compared based on a variety of factors, such as user
characteristics, content features, graph features, layout features, and time elements. We hope that
this presented study will be a valuable resource for researchers to find highlights of the recent
development of Twitter spam. This project will identify those spam and fake user by using
techniques of machine learning algorithms and apply all these algorithm on our data sets and best
algorithm is selected for the spam detection having best precision and accuracy .

Introduction
The term social media has taken the Internet by storm since more than a decade. It is the fastest
growing technology since its inception. It is a medium which allows interaction. Web 2.0 is an
example of social media. Other widely used social media are the Facebook, Twitter, Blogs,
Wikis, LinkedIn, and Instagram. The concept of social media is top of the agenda for many
business executives today. The general concept of our proposed framework is to model a given
review dataset as a Heterogeneous Information Network (HIN) and to map the problem of spam
detection into a HIN classification problem. In particular, we model a review dataset as a HIN in
which reviews are connected through different node types (such as features and users). A
weighting algorithm is then employed to calculate each feature’s importance (or weight). These
weights are utilized to calculate the final labels for reviews using both unsupervised and
supervised approaches. We propose NetSpam framework that is a novel network-based approach
which models review networks as heterogeneous information networks. The classification step
uses different metapath types which are innovative in the spam detection domain. A new
weighting method for spam features is proposed to determine the relative importance of each
feature and shows how effective each of features are in identifying spams from normal reviews.
NetSpam improves the accuracy compared to the state of- the art in terms of time complexity,
which highly depends on the number of features used to identify a spam review; hence, using
features with more weights will result in detecting fake reviews easier with less time complexity.

Literature review
In recent years, review spam detection has received significant attention in both business
and academia due to the potential impact fake reviews can have on consumer behaviour and
purchasing decisions. This survey covers machine learning techniques and approaches that have
been proposed for the detection of online spam reviews. Supervised learning is the most
frequent machine learning approach for performing review spam detection; however, obtaining
labelled reviews for training is difficult and manual identification of fake reviews has poor
accuracy. This has led to many experiments using synthetic or small datasets. Features extracted
from review text (e.g., bag of words, POS tags) are often used to train spam detection classifiers.
An alternative approach is to extract features related to the metadata of the review, or features
associated with the behavior of users who write the reviews. Disparities in performance of
classifiers on different datasets may indicate the review spam detection may benefit from
additional cross domain experiments to help develop more robust classifiers. Multiple
experiments have shown that incorporating multiple types of features can result in higher
classifier performance than using any single type of feature.
One of the most notable observations of current research is that experiments should use
real world data if possible. Despite being used in many studies; synthetic or artificially
generated datasets have been shown to give a poor indication of performance on real world data.
As it is difficult to procure accurately labelled real-world datasets, unsupervised and semi-
supervised methods are of interest. While unsupervised and semi-supervised methods are
currently unable to match the
performance of supervised learning methods, research is limited and results are inconclusive,
warranting further investigation. A possibility for a less labour- intensive means of generating
labelled training data is to find and label duplicate reviews as Multiple studies have shown
duplication, or near duplication, of review content is a strong indicator of review spam. Another
data related concern is that real world data may be highly class imbalanced, as there are
currently many more truthful than fake reviews online. This could be addressed through data
sampling and ensemble learning techniques. A final concern related to quality of data is the
presence of noise, particularly class noise due to mislabeled instances. Ensemble methods, and
experiments with different levels of class noise, could be used to evaluate the impact of noise on
performance and how its effects may be reduced.
As review text is an important source of information and tens of thousands of text
features can easily be generated based on this text, high dimensionality can be an issue.
Additionally, millions of reviews are available to be used to train classifiers, and training
classifiers from a large, highly dimensional dataset is computationally expensive and potentially
impractical. Despite this, feature selection techniques have received little attention. Many
experiments have avoided the issue by extracting only a small number of features, avoiding the
use of n- grams, or by limiting the number of features through alternative means such as using
term frequencies to determine what n-grams are included as features. Further work needs to be
conducted to establish how many features are required and what types of features are the most
beneficial. Feature selection should not be considered optional when training a classifier in a big
data domain with potential for high feature dimensionality. Additionally, we could find no
studies that incorporated distributed or streaming implementations for learning from Big Data
into their spam detection frameworks.

Problem statement
The profile data in social networks consist of two main parts, static and dynamic. Former is
about the information which is set by the user statically, while the latter is observed by the
system and is the result of users’ activity on the social network. The static data typically includes
users’ demographics and interests, and dynamic data relates to user activities and position in the
social network. Most of the existing research solutions depend on both static and dynamic data,
which is inapplicable to other social networks, where it has merely a smaller number of visible
static profiles and no dynamic profile details to the public. Due to its privacy policies and very
restricted information visibility, none of the existing practical and theoretical means of fake
profile detections are feasible to apply. Therefore, in this research our goal is to identify an
approach to determine the spammers and fake profiles in Social Networks.

Methodology

Data preprocessing:

When the data is considered, always a very large data sets with large no. of rows and
columns will be noted. But it is not always the case the data could be in many forms such
as Images, Audio and Video files Structured tables etc.
Machine doesn’t understand images or video, text data as it is, Machine only understand
1s and 0s.
Steps in Data Preprocessing:
Data cleaning: In this step the work like filling of “missing values”, “smoothing of noisy
data”, “identifying or removing outliers “, and “resolving of inconsistencies is done.”
Data Integration: In this step addition of several databases, information files or
information set is performed.
Data transformation: Aggregation and normalization is performed to scale to a specific
value.
Data reduction: This section obtains a summary of the dataset which is very small in size
but so far produces the same analytical result
Stop words:
“Stop words are the English words that do not add much meaning to a sentence.” They can
be safely ignored without forgoing the sense of the sentence. For example if it is tried to
search a query like” How to make a veg cheese sandwich”, the search engine will try to
search the web pages that contains the term “how”, “to” ,”make”, “a”,”veg”,
“cheese” ,”sandwich”. The search engine tries to find the web pages that contains the term
“how” ,”to”, ”a” than page containing the recipes of veg cheese sandwich because the terms
” how” ,”to”, “a” are so commonly used in English language ,If these three words are
removed or stopped and actually focuses on retrieving pages that contains the keyword ”
veg”, “cheese”, “sandwich” – that would give the result of interest.

Tokenization:
“Tokenization is the process of splitting a stream of manuscript into phrase, symbols,
words, or any expressive elements named as tokens.” The rundown of token further
utilized for contribution for additional handling, for example, content mining and parsing.
Tokenization is valuable in both semantics (where it is as content division), and as lexical
examination in software engineering and building.
It is occasionally hard to define what is intended by the term “word”. As tokenization
happens at the word level. Frequently a token trusts on modest heuristics, for instance:
Tokens are parted by whitespaces characters, like “line break” or “space”, or by
“punctuation characters”. Every single neighboring string of alphabetic characters are a
piece of one token; similarly, with numbers. White spaces and punctuations might or might
not involve in the resulting lists of tokens.

Bag of words
“Bag of Words (BOW) is a method of extracting features from text documents. Further
these features can be uses for training machine learning algorithms. Bag of Words creates a
vocabulary of all the unique words present in all the document in the Training dataset.”
Classic Classifiers
Classification is a form of data analysis that extracts the models describing important data
classes. A classifier or a model is constructed for prediction of class labels for example:
“A loan application as risky or safe.” Data classification is a two-step
- learning step (construction of classification model.) and
- a classification step
Naïve Bayes:
Naïve Bayes classifier algorithm is an algorithm which is used for supervised learning. The
Bayesian classifier works on the dependent events and works on the probability of the event
which is going to occur in the future that can be detected from the same event which occurred
previously. Naïve Bayes was made on the Bayes theorem which assumes that features are
autonomous of each other. Naïve Bayes classifier technique can be used for classifying spam
emails as word probability plays main role here. If there is any word which occurs often in spam
but not in ham, then that email is spam. Naive Bayes classifier algorithm has become a best
technique for email filtering. For this the model is trained using the Naïve Bayes filter very well
to work effectively. The Naive Bayes always calculates the probability of each class and the
class having the maximum probability is then chosen as an output. Naïve Bayes always provide
an accurate result. It is used in many fields like spam filtering.

Tools
Software tool: Py Charm 2020.2.1
Programming language: Python
Operating System: Windows

References
1. Y. Sun and J. Han. Mining Heterogeneous Information Networks; Principles and
Methodologies, In ICCCE, 2012.
2. http://www.emarketer.com/Article/Social-Networking-Reaches-Nearly- One-Four-
Around-World/1009976#2k77mDov2c5DtbcZ.99.
3. A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, and R. Ghosh,
“Spotting opinion spammers using behavioral footprints,” In ACM KDD, 2013.
4. http://www.cognizant.com/InsightsWhitepapers/How-Banks-Can-Use-Social-Media-
Analytics-To-Drive-Business-Advantage.pdf
5. S. Xie, G. Wang, S. Lin, and P. S. Yu, “Review spam detection via temporal pattern
discovery,” in ACM KDD, 2012.
6. G. Wang, S. Xie, B. Liu, and P. S. Yu, “Review graph based online store review
spammer detection,” IEEE ICDM, 2011.
7. Andreas M. Kaplan and Michael Haenlein, “Users of the world, unite! The challenges
and opportunities of Social Media,” Business Horizons, 53(1), Jan.-Feb. 2010, pp. 59-68.

Вам также может понравиться