MIM (Mobile Instant Messaging) Classification Using Term Frequency-Inverse Document Frequency (TF-IDF) and Bayesian Algorithm

International Journal of Modern Research in Engineering & Management (IJMREM)
||Volume|| 2||Issue|| 2 ||Pages|| 01-05 || February 2019|| ISSN: 2581-4540
MIM (Mobile Instant Messaging) Classification using Term

Frequency-Inverse Document Frequency (TF-IDF) and Bayesian
Algorithm
1,
Kashaf-u-Duja, 2, Muhammad Bux Alvi, 3, Tariq Jameel Saifullah Khanzada,
4,
Nisha Kumari
1,4,
Institute of Information and Communication Technology, Mehran University of Engineering and Technology
Jamshoro
2,
Department of Computer Systems Engineering, The Islamia University of Bahawalpur
3,
Department of Computer Systems Engineering, Mehran University of Engineering and Technology Jamshoro
---------------------------------------------------ABSTRACT------------------------------------------------------
The focus of the study is based on binary sentiment classification on aspect level to develop a hybrid sentiment
classification framework of WhatsApp MIMs (Mobile Instant Messages). It has been carried out into two phases
i.e. training phase and testing phase. The training phase, 75% data is used for training dataset. Pre-processing
techniques like tokenization, removing stop words, case normalization, removing punctuation and stemming are
applied to acquire cleaner dataset to be used as input. The output is sent to the classifier after applying TF-IDF
for feature weighting. In the second phase, the classifier is trial with 25% testing dataset. Bernoulli’s Naïve
Bayesian classifier which is an improved form of traditional Naïve Bayesian classifier is used to classify
sentiments. There are 417 messages in total where 244 and 173 are classified as positive and negative
respectively. The proposed model has achieved satisfactory results up to 81.73% in comparison to base-line
classification model by getting 12 points higher accuracy i.e. 69.23%.
KEYWORDS: Mobile Instant Messages (MIMs), Naïve Bayesian, Sentiment classification, TF-IDF,
WhatsApp
-------------------------------------------------------------------------------------------------------------------------------------------
Date of Submission: 30 January 2019 Date of Accepted: 03 February 2019
-------------------------------------------------------------------------------------------------------------------------------------------
I. INTRODUCTION
Web development has changed human interaction and communication substantially and has prompted huge and
quick development in user generated data [4]. It is estimated that 95% of available data is unstructured. To
extract information and create knowledge from raw resources it needed to be processed properly and analyzed
correctly because knowledge present in text data is not directly accessible through computers [1]. With the
striking development of social media platforms like Facebook, Twitter, WhatsApp, WeChat etc, more and more
people post online texts on different platforms to express their opinions on social issues and share their reviews
[5]. Significant consideration has focused on examining this data in terms of the sentiment it conveys, which has
resulted in the emergence of the sentiment analysis research field. It involves the computational analysis of user-
generated data, such as reviews, to determine its orientation (positive, negative or neutral). There are two main
reasons to automate sentiment analysis: first, the abundance of online data is beyond human analysis; and
second, public opinion is a significant consideration when governments, institutions, and individuals are making
decisions [4]. Utilization of WhatsApp text data has increased more problems such as word-shortening,
neologism, and spelling variations. Traditional machine learning methods have proved inadequate to accomplish
the task.
To address this problem, we proposed a methodology based on binary sentiment classification on aspect level.
This work is focused on developing a hybrid sentiment classification framework for WhatsApp MIMs using
recursive preprocessing and machine learning combined approach to achieve higher accuracy for closed domain
dataset obtained from the WhatsApp group containing 417 messages. This dataset is labeled manually consisting
of 244 positive and 173 negative opinions. The dataset uses a cleaner data through preprocessing for better
accuracy and naïve Bayesian machine learning algorithm is used to develop the model to test its suitability.
www.ijmrem.com IJMREM Page 1

MIM (Mobile Instant Messaging) Classification using Term…
II. LITERATURE REVIEW

[1] Proposed a novel hybrid method with a recursive preprocessing approach for sentiment analysis on online
twitter data consists of 6090 tweets. The dataset is labeled manually with 3111 positive, 1114 negative and 1865
neutral tweets. Multinomial Naïve Bayesian, Linear SVM and Neural Network algorithms are used to develop
different hybrid models to test their suitability. Bag-of-words, TF-IDF and N-Gram are used as feature
engineering models. Hold out splitting method is used to evaluate the accuracy where 80% and 20% data is used
for training testing respectively. The model acquires 86.18% overall accuracy with 82% baseline accuracy.
Reference [2] compares six commonly used preprocessing techniques on two Twitter datasets for sentiment
analysis. The recommended preprocessing techniques are lemmatization, replacing repetitions of punctuation,
replacing contractions, and removing numbers. While five preprocessing techniques: replace URLs and user
mentions, replace contractions, remove numbers, replace repetition of punctuation and lemmatization for a
classic machine learning sentiment analysis is a winning combination.
[3] Uses preprocessing techniques and merged 10 existing sentiment lexicons to make a high-coverage lexical
resource (HCLr). Seven classifiers are used to evaluate their efficiency where SVM with 34.16% outperforms
among all. While the second best classifier was found to be boosted Naïve Bayesian with the overall accuracy of
30.61%.
They have proposed a two-phase hybrid method [4]. The first phase, contextual analysis consists of
preprocessing techniques while the second phase, ensemble clustering phase consists of feature extraction and
unsupervised machine algorithms. A sentiment lexicon SentiWordNet 3.0 is used to measure the strength of each
term’s polarity. The proposed method increased the accuracy rate by an average of 3.0% when applying
contextual analysis procedures. Feature weighting schemes including TF-IDF enhance the performance from (5-
20) %.
III. METHODOLOGY
Fig.1. shows methodology in this paper which comprises of 12 steps explaining further.
Figure 1 MIMs classification model
Data Collection: we have created a group on WHATSAPP named as “Internet; Good or Bad” consisting of 15
members. A total of 417 messages manually labeled as 244 “Favor” and 173 “Against” are collected. A copy of
the history of a group chat is been extracted using the email chat feature in “.txt document” format which is then
converted into “.csv” file to be used [8].
Tokenization: A process of breaking down the corpus into individual elements [6]. It is also termed as word
segmentation [1]

Figure 2 MIM after tokenization
Removing Stop Words: Stop words are unnecessary word that commonly appear in the text such
as so, and, or, the … [2]. There are 153 English language stop words that need to be removed because they
possess insignificance with most of datasets [1].
Figure 3 MIM after removing stop words
Case Normalization: An irreversible process that converts the terms into lower case [1].
Figure 4 MIM after case normalization
Removing Punctuation: A classic technique in information retrieval and data mining that removes punctuation
marks from the text [2].
Figure 5 MIM after removing punctuation
Stemming: Converts the word into its root forms, effective for polarity detection [1] and generally yields good
results [2].
Figure 6 MIM after stemming
Term Frequency-Inverse Document Frequency (TF-IDF): A commonly used scoring scheme used to
evaluate the importance of a token in a document and ultimately in the given dataset. It can be used to remove
stop words, punctuations, most frequent and least frequent tokens successfully [1]. Term Frequency measures
how frequently a term occurs in a document. Inverse document frequency factor decreases the weight of terms
that occur very frequently in the document set and increases the weight of terms that rarely occur [7].
Mathematically [4],
(1)
Where,
▪ tfi,j is the term normalization of term i
▪ idfi is the inverse document frequency of term i.
Bernoulli’s Naïve Bayesian: Naïve Bayesian is a probabilistic classifier based on the Bayesian theorem to
calculate the probability of a data sample belonging to a specific class widely used in sentiment classification.
The Bayesian theorem supposes all features are completely independent of each other [3]. The probability of a
sample belonging to a class can be computed using the following formula.

(2)
Where,
▪ P (c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
▪ P (c) is the prior probability of class.
▪ P (x|c) is the likelihood which is the probability of predictor given class.
▪ P (x) is the prior probability of predictor.
The Bernoulli Naïve Bayesian algorithm is a modified form of traditional Naïve Bayesian, where the weight of
each term is equal to 1 if it exists in the sentence and 0 if not [2].
IV. TOOLS AND TECHNOLOGIES

Python 3.0 (Anaconda Python Distribution) is used to acquire the results of the model. Python libraries like
NumPy (Numerical Python), NLTK (Natural Language Tool Kit), Sci-kit learn and Matplotlib are used for
scientific computing (arrays, mathematical calculations), preprocessing, machine learning and plotting library
(for graphs etc) respectively.
Figure 7 Tools and technologies used
V. RESULTS AND DISCUSSION

Hold out splitting is used to evaluate the accuracy of the proposed model where 75% data is used for training
and 25% data is used for testing the classifier. The model attained the accuracy of 81.73% with 69.23% baseline
accuracy. The results show that the proposed hybrid binary sentiment classification model with preprocessing
techniques have achieved satisfactory results by getting 12 points higher accuracy.
In Fig. 8 GRAPH 1(a) shows the message count, there were total 417 messages where 244 and 173 are labeled
as favor and against respectively. While GRAPH 1(b) shows the results after elimination of 4 repeated messages
which left 240 favor messages.
Figure 8 Impact of preprocessing at message level

In Fig. 9 GRAPH 2(a) shows the results before preprocessing while GRAPH 2(b) shows the results after
preprocessing. It can be clearly concluded that the preprocessing techniques trims the lengthy and verbose
messages into important useful tokens to acquire a cleaner dataset to get better results.
Figure 9 Impact of preprocessing at token level
VI. CONCLUSION AND FUTURE WORK

The proposed model is based on binary sentiment classification on aspect level to develop a hybrid sentiment
classification framework with preprocessing techniques to process WhatsApp MIM dataset. A machine learning
technique is used to develop a sentiment classification model with TF-IDF feature weighting scheme. The model
attains satisfactory results as compared to the baseline accuracy. For future work, it is suggested to increase the
dataset to get better results as more data leverages better accuracy. Furthermore, applying more preprocessing
techniques with the well-ordered winning combination to extract significant features of sentiment classification.
REFERENCES
[1] Alvi, M.B., Mahoto, A.N., Alvi, M., Unar, M.A., Shaikh, M.A, Hybrid Classification Model for Twitter
Data- A Recurssive Preprocessing Approach, 5th International Multi-topic ICT Conference (IMTIC),
2018, 1-6
[2] Symeonidis, S., Effrosynidis , D., & Arampatzis, A, A comparative evaluation of pre-processing
techniques and their interactions for twitter sentiment analysis, Expert System With Applications,110,
2018, 298-310
[3] Abdi, A., Shamsuddin, S. M., Hasan, S., & MD, J. P, Machine learning-based multi-documents
sentiment-oriented summarization using linguistic treatment, Expert Systems with Applications,109,
2018, 66-85
[4] Al-Sharuee, M. T., Liu, F., & Pratama. M, Sentiment analysis: An automatic contextual analysis and
ensemble clustering approach and comparison, Data and Knowledge Engineering, 115, 2018, 194-213
[5] Liu,Y., Bi, J.W., & Fan, Z.P, Multi-class sentiment classification: The experimental comparisons of
feature selection and machine learning algorithms, Expert Systems With Applications, 80, 2017, 323-
339
[6] A. Faraz, An elaboration of text categorization and automatic text classification through mathematical
and graphical modeling, An International Journal (CSEIJ), 5(2), 2015, 239-248.
[7] Ahmed, I., Guan, D., & Chung, C.T, SMS Classification Based on Naïve Bayes Classifier and Apriori
Algorithm Frequent Itemset, International Journal of Machine Learning and Computing, 4(2), 2014
[8] Patil, S, WhatsApp Group Data Analysis with R, International Journal of Computer Applications, 154
(4), 2016, 31-36
[9] Tang, Y., Hew, K.F, Is mobile instant messaging (MIM) useful in education? Examining its
technological, pedagogical, and social affordances, Educational Research Review, 21, 2017, 85-104
[10] Appel, O., Chiclana, F., Carter, J., & Fujita, H., A hybrid approach to the sentiment analysis problem at
the sentence level, Knowledge-Based System, 108, 2016, 110-124
[11] Katz, G., Ofek, N., & Shapira, B, ConSent: Context-based sentiment analysis, Knowledge-Based
Systems, 84, 2015, 162-178
[12] Fersini, E., Messina, E., & Pozzi, F. A, Sentiment analysis: Bayesian Ensemble Learning, Decision
Support Systems, 68, 2014, 26-38

MIM (Mobile Instant Messaging) Classification Using Term Frequency-Inverse Document Frequency (TF-IDF) and Bayesian Algorithm

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

MIM (Mobile Instant Messaging) Classification Using Term Frequency-Inverse Document Frequency (TF-IDF) and Bayesian Algorithm

Загружено:

Авторское право:

International Journal of Modern Research in Engineering & Management (IJMREM)

||Volume|| 2||Issue|| 2 ||Pages|| 01-05 || February 2019|| ISSN: 2581-4540

MIM (Mobile Instant Messaging) Classification using Term

www.ijmrem.com IJMREM Page 1

II. LITERATURE REVIEW

Figure 1 MIMs classification model

www.ijmrem.com IJMREM Page 2

Figure 2 MIM after tokenization

Figure 3 MIM after removing stop words

Figure 4 MIM after case normalization

Figure 5 MIM after removing punctuation

Figure 6 MIM after stemming

www.ijmrem.com IJMREM Page 3

IV. TOOLS AND TECHNOLOGIES

Figure 7 Tools and technologies used

V. RESULTS AND DISCUSSION

Figure 8 Impact of preprocessing at message level

www.ijmrem.com IJMREM Page 4

Figure 9 Impact of preprocessing at token level

VI. CONCLUSION AND FUTURE WORK

www.ijmrem.com IJMREM Page 5

Вам также может понравиться