Вы находитесь на странице: 1из 9

Available online at www.sciencedirect.

com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com
Available online at www.sciencedirect.com

ScienceDirect
Procedia Computer Science 00 (2018) 000–000
Procedia Computer Science 00 (2018) 000–000 www.elsevier.com/locate/procedia
Procedia Computer
Procedia Science
Computer 14300
Science (2018) 378–386
(2018) 000–000 www.elsevier.com/locate/procedia
www.elsevier.com/locate/procedia
8th International Conference on Advances in Computing and Communication (ICACC-2018)
8th International Conference on Advances in Computing and Communication (ICACC-2018)
8th International Conference on Advances in Computing and Communication (ICACC-2018)
(ICACC-2018)
Sarcasm classification: A novel approach by using Content Based
Sarcasm classification: A novel approach by using Content Based
Sarcasm classification: A novel
Feature approach
Selection by using Content Based
Method
Feature Selection Method
HFeature Selection
M Keerthi Kumara,∗, BMethod
S Harishb a,∗ b
HM Keerthi Kumar JSS TIa,∗
, B SMysuru,
Harish b
b Department
H M
a JSS
a
Keerthi
Research Kumar
Foundation, , B S
Campus, Harish India
JSS Research
of Information Science Foundation,
and Engineering, SriJSS TI Campus, Mysuru,
Jayachamarajendra Indiaof Engineering, Mysuru, India
College
b Department a JSS Research Foundation, JSS TI Campus, Mysuru, India
of Information Science and Engineering, Sri Jayachamarajendra College of Engineering, Mysuru, India
b Department of Information Science and Engineering, Sri Jayachamarajendra College of Engineering, Mysuru, India

Abstract
Abstract
In recent decades, social media sites such as twitter, facebook, and review site produces huge number of textual information posted
Abstract
In
by recent
many decades,
users. Thesocial
usermediatendssites such ashis/her
to express twitter,sentiment
facebook, in andthereview
form site producesutterances.
of sarcastic huge number Theofsarcastic
textual information posted
utterance usually
In
by recent
shiftsmanythedecades,
users. social
The
polarity textmedia
of user from sites
tends such to
to express
negative ashis/her
twitter,sentiment
positive facebook, and
in
and likewise. thereview
form
The site produces
of sarcastic
automatic huge number
utterances.
classification ofThe ofsarcastic
textual
sarcastic information
utterance
utterances posted
usually
present in
by
textmany
shifts isthe users.
a very The
polarity of user tends
texttask.
challenge torequires
express
fromItnegative toahis/her
systemsentiment
positive and can
that in theThe
likewise.
manage form of sarcastic
toautomatic
detect utterances.
classification
content ofThe
based text sarcastic
sarcastic
properties orutterance
utterances usually
present
features in
present
shifts
text isthe
in sarcastic polarity
a very of text
challenge
utterances. from
Intask. negative
this Itregard, thetoapaper
requires positive
system and can
that
propose likewise.
manage
a novel Thetoautomatic
approachdetect classification
content
to classify oftext
based text
sarcastic sarcastic
properties utterances present
or features
using content based in
present
feature
text
in is a
sarcasticvery challenge
utterances. In task.
this It requires
regard, the a system
paper that
propose can
a manage
novel to
approachdetect
to content
classify based text
sarcastic text
selection method. The proposed approach consists of two stage feature selection method to select most representative features. In properties
using or features
content based present
feature
in
firstsarcastic
selection utterances.
stage,method.
conventional Infeature
The proposed this regard,
approach
selectionthemethods
paper propose
consists of
suchtwoasastage
novel approach
feature
CHI-square, to classify
selection
Information method sarcastic
Gain to(IG)
selecttext
andmostusing content
representative
Mutual based
Information feature
features.
(MI) In
are
selection
first
used stage, method.
to select The features
conventional
relevant proposed approach
featuresubset.
selection consists
selectedof
Themethods suchtwoasstage
feature feature
CHI-square,
subset selection
Information
are further method
refined Gaintosecond
using select
(IG) andmost representative
Mutual
stage. Information
In second features.
stage, (MI) In
are
k-means
first
used stage,
clustering conventional
to select
algorithm feature
relevantis features selection
subset.most
used to select methods such
The representative
selected feature as CHI-square,
subsetamong
feature Information
are further
similarrefined Gain
usingThe
features. (IG)
secondand Mutual
stage.
selected Information
In second
features (MI) are
stage, k-means
are classified using
used to select
clustering
two classifiers relevant
algorithm is features
Support used
Vector subset.
to select
Machine The
most selected
andfeature
representative
(SVM) Random subset
feature are (RF).
among
Forest further
similarrefined
The usingThe
features.
proposed second stage.
selected
approach In second
features stage,
the k-means
are classified
out-performance using
existing
clustering
two
methods inalgorithm
classifiers
terms is used
Support Vector
of Precision, to select
Machine
Recall, most representative
(SVM)
F-measure and Random
on Amazon feature among
Forest
product similar
(RF).
review The features.
proposedThe
dataset. selectedout-performance
approach features are classified using
the existing
two classifiers Support Vector Machine (SVM) and Random
methods in terms of Precision, Recall, F-measure on Amazon product review dataset. Forest (RF). The proposed approach out-performance the existing
methods
c 2018 The in terms of Precision,
Authors. PublishedRecall, F-measure
by Elsevier B.V. on Amazon product review dataset.
c
This2018is The
an Authors.
open access Published
article under
© 2018 The Authors. Published by Elsevier B.V.by Elsevier
the CC B.V.
BY-NC-ND license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
c 2018 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license
This is
Selection an open
and access article
peer-review under
under the CC BY-NC-ND
responsibility of the license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
scientific committee of the 8th International Conference on Advances in
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an and
Selection
Computing
Selection open access article
peer-review
andpeer-review
and Communication under
under
under the CC BY-NC-ND
responsibility
(ICACC-2018).
responsibility of
of the license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
the scientific
scientific committee
committee of of the
the 8th
8th International
International Conference
Conference on on Advances
Advances in in
Selection and
Computing andpeer-review
Communication under(ICACC-2018).
responsibility of the scientific committee of the 8th International Conference on Advances in
(ICACC-2018).
Keywords:
Computing Sarcasm;
and Communication
Classification;(ICACC-2018).
Feature Selection; k-means clustering.
Keywords: Sarcasm; Classification; Feature Selection; k-means clustering.
Keywords: Sarcasm; Classification; Feature Selection; k-means clustering.

1. Introduction
1. Introduction
1. Introduction
In internet era, advance in usage of social media sites such as twitter, facebook, and review site produces a large
In internet
amount era, advance
of textual in usage
information. The of socialinformation
textual media sitesserves
such as
astwitter,
a vital facebook, and review
source to identify site produces
public/user’s a large
opinion or
In internet
amount era, advance
oftowards
textual in usage
information. of socialinformation
media sitesserves
such as twitter,
a vital facebook, and review site produces a large
sentiment political party,The textual
products or an event [20]. Theassentiments source to identify
expressed public/user’s
by public/users opinion
are in or
the form
amount
sentimentof textual political
information. The textualorinformation serves
Theas a vital source to identify public/user’s opinion or
of positive,towards party,
negative or neutral products
polarity. an event
The textual [20].
information sentiments expressed
in social media by public/users
sites plays are in
a crucial role the form
decision
sentiment
of towards
positive, political
negative or party,
neutral products
polarity. Theortextual
an event [20]. The sentiments
information in social expressed
media sites by public/users
plays a crucial are in
role in the form
decision
support systems and individual decision makers [7]. The process of automating identification of sentiment in a text is
of positive,
support negative
systems or neutral polarity.
and Analysis
individual decision The
makerstextual
[7]. information
The process inof social mediaidentification
automating sites plays a of
crucial role in
sentiment indecision
a text is
known as Sentiment (SA).
support
known as systems and Analysis
Sentiment individual(SA).
decision makers [7]. The process of automating identification of sentiment in a text is
known as Sentiment Analysis (SA).
∗ Corresponding author.
∗ Corresponding
E-mail address:author.
hmkeerthikumar@gmail.com, bsharish@sjce.ac.in
∗ Corresponding
E-mail address:author.
hmkeerthikumar@gmail.com, bsharish@sjce.ac.in
E-mail address:
1877-0509  hmkeerthikumar@gmail.com,
c 2018 The bsharish@sjce.ac.in
Authors. Published by Elsevier B.V.
This is
1877-0509 an c
open access article under the CC BY-NC-ND
1877-0509 © 2018 The Authors. Published by Elsevier B.V.
 2018 The Authors. Published by Elsevier license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
B.V.
1877-0509
Selection
This
This isis an 
anandc 2018
open
open Thearticle
peer-review
access
access Authors.
under
article Published
under
underthethe byBY-NC-ND
responsibility
CCCC Elsevier
BY-NC-ND B.V.
of the scientific
license committee of the 8th International Conference on Advances in Computing and
(https://creativecommons.org/licenses/by-nc-nd/4.0/)
license (https://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an and
Communication
Selection
Selection openpeer-review
and access article
(ICACC-2018).
peer-review under
under
under the CC BY-NC-ND
responsibility
responsibilityofofthe license
thescientific(https://creativecommons.org/licenses/by-nc-nd/4.0/)
committee
scientific committee ofofthethe
8th8th
International Conference
International on Advances
Conference in Computing
on Advances and
in Computing
Selection and peer-review
and Communication
Communication under responsibility of the scientific committee of the 8th International Conference on Advances in Computing and
(ICACC-2018).
(ICACC-2018).
10.1016/j.procs.2018.10.409
Communication (ICACC-2018).
H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386 379
2 H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000

Sentiment analysis (SA) is considered as a classification task, which classifies text into positive, negative or neutral
polarity. Many characters influences SA process in social media sites namely: (a) characters limits in blogs, (b) use
of slang words, (c) use of non-literal language, such as irony/sarcasm and many more. The major challenge in SA is
present of irony/sarcasm in text, which literally shifts the polarity of the sentences. Sarcasm is a sophisticated form of
sentiment which acts as an interfering factor that can flip the polarity of the text [15]. Sarcasm is often characterized
as ironic content which is used to insult, mock, or amuse. The process of identifying and classification of sarcastic
content present in a text is known as sarcasm detection/classification [2].
Recently automatic sarcasm detection has attracted the attention of research fraternity from both Machine Learning
(ML) and Natural Language Processing (NLP) domains [6][19]. NLP based approach uses linguistic features and
lexicon corpus to understand the subjective information. On the other hand, ML approaches uses supervised or unsu-
pervised learning techniques for understanding sarcastic sentence based on label or unlabeled text.
In sarcasm detection, linguistic and content based text properties or features play a vital role in deciding a text is
sarcastic or not. Linguistic feature concentrate more on punctuations, patterns, hyperbole, ellipsis etc. On the other
hand, content based text property more rely on the term or words appear in the text. In this paper, content based text
properties are used to extract more discriminative terms or features to classify a text into sarcastic or non-sarcastic
content.
The social media site contains large volume of textual data which outrages human’s ability to understand and han-
dle. ML plays an important role to discover patterns for such high dimensionality data. It is a challenging task for
ML algorithms to find relevant and non-redundant data from social media sites, which exhibits the characteristics
of variety, veracity and volume of data. While performing data preprocessing and representation, a huge number of
unfamiliar terms or features are collected. These unfamiliar features may consist of irrelevant and redundant features,
which significantly increase the computational cost of learning process. Whenever the dimensionality of data is high
its computational cost will also be high, which intern tends to decline the accuracy of ML algorithm. Hence, the curse
of dimensionality is solved by selecting relevant and non-redundant features. The process of selecting relevant and
non-redundant features is known as feature selection method. The feature selection methods become more scalable,
reliable and accurate by selecting discriminative features from unfamiliar relevant features.
Many research community [15][14][4][5][10] concentrated their work on the feature selection such as CHI-square
[17], Information Gain (IG) [11], Mutual Information (MI) [13] and many more on social media textual data. Tripathy
et al., in [18] presented classification of sentiment reviews using n-gram features. The method extracts features based
on uni-gram, bi-gram, trigram and their combinations with Term frequency - Inverse document frequency (TF-IDF)
schema. The methods are classified using machine learning algorithms such as Nave Bayes (NB), Maximum Entropy
(ME), Stochastic Gradient Descent (SGD), and Support Vector Machine (SVM). Mukherjee et al., in [12] extracted
features based on content word, function word, part of speech tags and their combination to detect sarcastic content.
The method tests a range of different feature sets using NB and Fuzzy C-means (FCM) to classify tweet into sarcastic
or non-sarcastic content over tweeter data.
Reganti et al., in [14] briefed automatic sarcasm detection in tweets, product reviews and newswire articles. The
method extracts generic features based on lexicon and baseline features. The baseline feature such as character n-
grams, word n-grams, word skipgrams are extracted and combine with lexicon features. The methods are classified
using ensemble classifier such as Logistic Regression (LR), Decision Tree (DT) and Random Forest (RF) classifiers.
Buschmeier et al., in [3] described the impact of features in classification of irony detection in product reviews. The
method extracts text feature based on Bag-of-Words (BoW) features and lexicon based features. The methods are
classified using SVM, LR, DT, RF and NB classifiers. Similarly, Filatova in [9] demonstrated the sentiment flow
shifts (from negative to positive and likewise) in sarcasm detection. The method captures the presence of sarcasm in
product reviews and identifies the shift in polarity of sentiment label assigned to the product reviews. The method
classifies polarity using K-Nearest Neighbor (KNN), SVM (Linear and Radial Basis Function), DT, RF, AdaBoost
and LR classifiers.
Chakraborty et al., in [4] provides an insight on process of learning representation or features using Deep Learning
on Word2vec model. The model uses two diverse essences such as Continuous Bag-of-Words model (CBOW) and
Skip-Gram on IMDB movie reviews. The comparative analysis are drawn on performance of k-means and k-means++
clustering algorithm. Similarly, Chormunge and Jena in [5], proposed feature selection based on correlation based fea-
ture selection with Clustering for high dimensional data. The feature selection method eliminates irrelevant features
380 H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386
H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000 3

by using k-means clustering method and then non-redundant features are selected by correlation measure from each
cluster. The methods are compared with renowned feature selection method such as ReliefF and Information Gain
(IG) methods using NB classifier. Sarkar et al., in [16] presented a novel feature selection method for text classifi-
cation. The method uses two layer of feature selection based on CHI-square feature selection method and clustering
technique to select more discriminative features. The extracted feature subsets are classified using NB, DT, SVM and
KNN classifiers.
In literature, many researchers used various feature selection methods to select relevant features from high dimension-
ality feature space. The existing work concentrates more on dimensionality reduction by selecting relevant features
using feature selection methods. The main objective of this paper is to present two stage feature selection method
to select most representative features. The feature selection method selects the relevant feature subset from high di-
mensionality feature space. However, selected feature subset may have features which convey similar information.
The features exhibits similar information can be grouped and most representative features can be selected. Due to
this reason, the proposed work applies two stage feature selection method to select representative features. In first
stage, conventional feature selection method such as CHI-square, IG and Mutual Information (MI) are used to select
relevant features subset. The selected feature subset from conventional feature selection methods are further refined
using second stage. In second stage, k-means clustering algorithm is used to select most representative feature among
similar features. The selected features are classified using various classifiers such as SVM and RF classifiers on Ama-
zon product review dataset [8].
The rest of the paper is organized as follows: Section 2 depicts the methodology of the proposed model. The detailed
experimental results and discussion are presented in section 3. Finally, the paper concludes with stating future scopes
in section 4.

2. Methodology

This section depicts the detailed description of methodology of the proposed two stage feature selection method.
The proposed model is used to select more representative features to classify text into sarcastic or non-sarcastic con-
tents. The block diagram of the proposed model is presented in Fig. 1.

Fig. 1: Block diagram of proposed Feature Selection Method

The various steps involved in proposed method are: Firstly, the raw text data is preprocessed using various prepro-
cessing techniques. Further, features are extracted and represented using uni-gram representation with term frequency
schema. The extracted features are in high dimension, which need to be reduced or refined using feature selection
methods. The proposed feature selection method is applied to select more discriminative features. The proposed fea-
ture selection method uses two stages of feature selection. In first stage, conventional feature selection such as CHI-
square (χ2 ), Information Gain (IG) and Mutual Information (MI) methods are used to select discriminative features.
The selected features are further refined in second stage of feature selection. In second stage, k-means clustering algo-
rithm is used to select most representative features from feature subset selected in first stage feature selection method.
The k-means clustering algorithm is one of the simplest and most popular clustering techniques, which clusters similar
H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386 381
4 H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000

information into k number of groups or cluster. Hence, in this work k-means clustering algorithm is used to select most
representative features of each group. Finally, the feature subset obtained using proposed feature selection method is
classified using two classifiers such as SVM and RF classifiers.
2.1. Preprocessing

During preprocessing, non-trivial and less informative terms are eliminated, which doesn’t contribute to the classi-
fication processes. Initially, words are converted into lower case and then various preprocessing techniques are applied.
In this work, the product reviews are preprocessed by eliminating stop words, digits and punctuations. The prepro-
cessed reviews are represented using uni-gram representation model with term frequency (t f ) schema. Let m be the
number of reviews and n be the number of terms or features represented in Term Document Matrix (T DM). The entry
T DMi j indicates the corresponding t f of ith review of jth feature.
2.2. Proposed Feature Selection Method

The proposed feature selection method consists of two stages: In first stage, the conventional feature selections
method is used to select relevant feature subsets. The conventional feature selection methods such as CHI-square
(χ2 ), Information Gain (IG) and Mutual Information (MI) feature selection methods are used. These features selection
methods are widely used to select relevant feature subset in text processing domain. The Chi-square (χ2 ) is used
to test whether the occurrence of a specific term and the occurrence of a specific class are independent. IG gives
the information that’s gained by knowing the value of the attribute, which is the difference between entropy of the
distribution before the split and the entropy of the distribution after split. Similarly, MI calculates mutual dependence
of the two random variables. The outcome of each feature selection method will be scores (S ) corresponding to each
feature. The scores (S ) are arranged in descending order to select most relevant features which contain high feature
score. The feature subsets are selected by fixing threshold value (T ) empirically. The selected features subset using
conventional feature selection method is represented using Term Document Matrix T DM (m × T ) where m represent
number of reviews and T indicates number of feature subset selected from feature selection methods. The selected
feature subset (T ) may have features which convey similar information. To select most representative features among
the feature subset (T ), second stage feature selection method are applied. Algorithm 1 depicts the first stage feature
selection method.

Algorithm 1: First stage of Feature selection method


Data: Term Document Matrix T DM (m × n), m number of reviews, n total number of features, Features scores
S , F number of features
Result: Term Document Matrix T DM (m × T )
Initialize threshold value to T;
Step 1: S =FS method [T DM]  compute score for each feature using CHI, MI and IG feature selection methods
Step 2: S ={s1 , s2 , ...., sn }
Step 3: Sort S in descending order
Step 4: Select first T number of features based on S
Step 5: F = {F 1 , F 2 , ...., F T }

Selected features subset using feature selection method are represented in Term Document Matrix T DM (m × T ),
where m represent number of reviews and T indicates number of feature subset selected from feature selection method.
Further, we transpose T DM  (T × m), where each row represents number of features T and column indicates number
of reviews m.
In second stage, similarity based algorithm is used to select most representative feature. In this work, similarity
based k-means clustering algorithm is used to select most representative feature from each clusters (k). The most
representative features from each cluster are selected based on the feature nearer to its cluster center. The algorithm
works iteratively to assign features to one of the k cluster based on the similar features. To determine the optimal
number of cluster (k) is one of the open question in clustering. In this work, the number of cluster (k) is initialized
√ T
to threshold value (NC), where NC is number of clusters varied from T to as mentioned in [16]. Algorithm 2
2
382 H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386
H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000 5

elucidates second stage Feature selection method.

Algorithm 2: Second stage of Feature selection method


Data: Term Document Matrix T DM  (T × m), T total number of features, m number of reviews, number of
√ T
cluster NC= T to
2
Result: Term Document Matrix T DM (m × NC)
Initialize NC number of cluster center, t=0, F={ }, V 0 = {V 01 , V 02 , ...., V 0NC }
repeat
Step 1: for i ← to 1 by NC do
Step 2: for j ← to 1 by T do
Di j = dist(Vti ,F j )  compute distance between cluster center and feature F j (Euclidean norm is used
to calculate the distance)
end
end
Step 3: Assign F j to nearest cluster center vi
Step 4: Update cluster center
Step 5: for i ← to 1 by NC do

Fj
j=i
v(ti + 1) =
| vi |
end
until v(t − 1) - v(t) < ξ, t=t+1;
Step 6: for i ← to 1 by NC do
Step 7:F j =min j dist j ∈ i (vi , F ji )  find representative feature for each cluster center
Step 8: F=F ∪ F j
end

Further, the feature subset F, which consists of NC number of features are represented in Term Document Matrix
T DM(m × NC), where m is the number of reviews and NC is the number features (NC < T < n).

2.3. Classification

In order to determine the efficacy of the proposed feature selection method, the two classifiers linear Support Vector
Machine (SVM) and Random Forest (RF) classifiers are used. The SVM is a statistical classification approach which
can be used for both classification and regression challenges [21]. On the other hand, RF is an automatic learning
technique which combines the concepts of random subspaces and bagging [1]. These two classifiers are widely used
in classification of sarcastic content [3][14].

3. Experimental Result and Discussion

This section describes the details of the experiment conducted to evaluate the proposed feature selection method.

3.1. Dataset

The experiment is conducted on Amazon product reviews created by [8] using crowd sourcing platform Amazon
Mechanical Turk. The dataset consists of 1,254 Amazon product reviews, which consists of 437 sarcastic and 817
non-sarcastic reviews. The structure of the dataset contains star-rating and the reviews. Here, star-rating are ranged
between 1star to 5star and review text are in English language.
H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386 383
6 H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000

3.2. Experimental setup

In this work, experiments are conducted using 80:20 split on Amazon product review dataset [8]. The various pre-
processing techniques are applied on Amazon product review dataset. Once the dataset is preprocessed, the terms are
represented using unigram with term frequency (t f ) schema. The total number of features obtained are 20,985 dis-
tinct features. Further, the proposed feature selection method is applied on the 20,985 distinct features. The proposed
feature selection method consists of two stages: In first stage, feature subsets are selected in between 1,000 to 20,000
features by empirically. During this process, 10000 feature subset yield promising results compared to other feature
subsets. In second stage, k-means clustering is used to select most representative feature subset from first stage feature
selection method. The number of cluster (k) is varied from 1000 to 5000 as explained in section 2.2. The obtained
feature subsets from second stage are classified using linear Support Vector Machine (SVM) and Random Forest (RF)
classifiers. We initialized number of trees to 100 in RF classifier by empirically. The performance of proposed feature
selection method is evaluated using classification accuracy and F-measure as a metric.

3.3. Experimental results

To report the performance of various classifiers, experiments are varied in number of features from 1,000 to 20,000
using CHI-square (χ2 ), Information Gain (IG) and Mutual Information (MI) feature selection methods and found
10000 feature numbers yield competitive accuracy compared to other feature sets. Hence, experiments are conducted
on 10000 features in first stage of feature selection method. The Table 1 shows the performance of the classifiers on
original Term Document Matrix (T DM) and various feature selection methods. MI with 10000 features gives 75.60%
classification accuracy with 0.675 F-measure compared to other feature selection methods using SVM classifier. Sim-
ilarly, IG with 10000 features gives 72.40% classification accuracy with F-measure of 0.595 for RF classifiers.

Table 1: Performance of the classifiers on various feature selection methods

SVM RF
Feature No. of Features Accuracy F-measure Accuracy F-measure
TDM only 20985 68.40 0.597 69.60 0.539
CHI-square 10000 73.60 0.638 70.40 0.551
IG 10000 73.60 0.646 72.40 0.595
MI 10000 75.60 0.675 72.00 0.581

Further, k-means clustering algorithm is applied on 10000 features obtained from feature selection methods. As ex-
plained in section 2.2, the numbers of cluster are varied from 1000 to 5000. The Fig. 2 depicts the classification
accuracy of proposed method with various number of cluster using SVM and RF classifiers. From Fig. 2 (a), we
observed that the CHI-square with k-means achieve maximum accuracy for 4500 feature set, IG with k-means gives
maximum accuracy for 5000 feature set compared to respective feature set using SVM classifier. Similarly, MI with
k-means yield maximum accuracy for 5000 feature set compared to other feature set and other combinations using
SVM classifier. On the other hand, Fig. 2 (b) depicts the results of various feature selection methods for different num-
ber of cluster in k-means using RF classifier. The MI with k-means yield maximum accuracy for 3000 feature subset
compared to CHI-square with k-means for 4000 and IG with k-means for 3000 feature subset using RF classifier.
Further, table 2 depicts the performance of proposed approach (feature selection + clustering) using SVM and RF
classifiers. The proposed method (MI + k-means) for 5000 number of features achieve highest classification accuracy
of 79.40% and 0.711 F-measure using SVM classifier. On the other hand, proposed method (MI + k-means) achieves
77.20% classification accuracy and 0.696 F-measure for 4500 number of features using RF classifier. From table 2, the
proposed method (MI + k-means) outperforms in term of number of features, classification accuracy and F-measure
compared to other feature selection combinations using SVM and RF classifiers.
The table 3 summarizes results of proposed method in terms of reduction in number of features along with improve-
ment in classification accuracy. The proposed method (MI + k-means) exhibits 86% of reduction with 3000 number
of features compared to original features (20985), which intern enhances the classification accuracy by 11% using
RF classifier. On the other hand, the proposed method (MI + k-means) achieve 76% reduction in number of features
(5000) with increase of classification accuracy by 16% using SVM classifier.
384 H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386
H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000 7

Fig. 2: Classification accuracy of proposed feature selection methods with various number of cluster using (a) SVM and (b) RF classifiers

Table 2: Comparison of proposed method with various combination of feature selection method

SVM RF
Methods No. of Features Accuracy F-measure No. of Features Accuracy F-measure
CHI-square + Cluster 4500 75.20 0.695 3000 74.40 0.649
IG + Cluster 5000 76.40 0.711 4000 75.20 0.668
MI + Cluster 5000 79.60 0.752 3000 77.20 0.696

Table 3: Summary of feature reduction and classification accuracy improvement

Methods % of Feature Reduction % of improvement in the Classification Accuracy


SVM [CHI-Square] 52% 8%
SVM [CHI-square + Cluster] 79% 10%
SVM [IG] 52% 8%
SVM [IG+Cluster] 76% 12%
SVM [MI] 52% 11%
SVM [MI + Cluster] 76% 16%
RF [CHI-square] 52% 1%
RF [CHI-square + Cluster] 86% 7%
RF [IG] 52% 4%
RF [IG + Cluster] 81% 8%
RF [MI] 52% 3%
RF [MI + Cluster] 86% 11%

3.4. Comparisons with existing methods

From literature, Buschmeier et al. [3] and Reganti et al. [14], elucidates sarcasm/irony detection on product review
dataset [8] using SVM and RF classifier. The similar sets of experiments are conducted and comparison results are
presented in Table 4. However, same dataset is used for various purposes using different approaches [9]. Hence,
the proposed approach is compared with [3] and [14] in Table 4. The proposed method (MI + k-means) with 5000
number of feature set exhibits maximum performance in term of Precision, Recall and F-measure using SVM classifier.
Similarly, proposed method (MI + k-means) with 3000 number of feature depicts maximum performance in Precision,
Recall and F-measure using RF classifier.
The table 4 elucidates the comparisons of proposed method with existing methods. In Reganti et al. [14], the baseline
features such as character n-grams, word n-grams, word skipgrams are used but total number of feature are not stated
clearly. Hence, proposed method is compared with F-measure metric obtained from SVM and RF classifier. From
table 4, it can be concluded that the proposed method outperforms the existing methods in terms of Precision, Recall,
F-measure using SVM and RF classifiers.
H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386 385
8 H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000

Table 4: Performance comparison with existing methods on product review dataset

SVM RF
Total Total
Feature F- F-
Methods no. of Precision Recall no. of Precision Recall
Representation measure measure
features features
Buschmeier
Bag of words 21,773 0.672 0.618 0.641 21,773 0.704 0.217 0.329
et al., [13]
Reganti Baseline
– – – 0.728 – – – 0.665
et al., [14] Features
Proposed
method Unigram 5000 0.801 0.736 0.752 3000 0.822 0.683 0.696
(MI + Cluster)

3.5. Result Analysis and Discussion

The feature selection method reduce the high dimensionality feature space by selecting relevant features set from
original features based on scores. However, selected feature subset may have features which convey similar infor-
mation. To select most representative features among selected feature subset, two stage feature selection method is
applied. The advantage of proposed two stage feature selection method is to select most representative features which
convey similar information among feature subset. The proposed method uses two stage feature selection method us-
ing conventional feature selection method and k-means similarity clustering algorithm. The performance of proposed
method (conventional feature selection method with k-means) depends on the feature set selected using first stage. In
first stage, features depend on the feature selection algorithm used. In this experiment, the conventional feature selec-
tion methods such as CHI-square (χ2 ), Information Gain (IG) and Mutual Information (MI) feature selection methods
are used. It is noticeable from table 1, that first stage feature selection method using conventional feature selection
method reduces the high dimensionality feature space, which intern increases the classification accuracy of the classi-
fiers. The MI and IG feature selection method yields maximum accuracy using SVM and RF classifiers, respectively.
The CHI-square (χ2 ) is used to measure the lack of independence between the specific term and the specific class.
IG gives the information gained by knowing the value of the attribute, which is the difference between entropy of
the distribution before the split and the entropy of the distribution after split. Similarly, MI calculates mutual depen-
dence of the two random variables. In second stage feature selection method, k-means clustering algorithm groups the
similar features into specified number of clusters. Here, features are grouped based on the similarities which convey
similar information. The most representative feature of each cluster, which is nearer to cluster centers are selected.
The Fig. 2 (a) and 2 (b) presents the variation of classification accuracy from 1000 to 5000 features set using SVM
and RF classifiers. The MI with k-means consistently outrages other feature selection combinations using SVM and
RF classifiers. The MI with k-means yields promising results compared to other combination because MI compares
the probability of observing feature and class together (the joint probability) instead of observing feature and class
independently. Due to this reason, MI accumulates the similar information features together. On the other hand, k-
means groups the similar information, which intern reduces the feature spaces with increasing classification accuracy.
Hence, the combination of these methods exhibits maximum accuracy on both the classifiers. The table 2 depicts com-
parisons of various feature selection with k-means algorithm. From table 2 results, the proposed method reduces the
number of features set with increasing classification accuracy. A crucial observation is note in table 3, which depicts
the summary of percentage reduction in features and the improvement of classification accuracy in proposed method.
To evaluate the proposed method in table 4, the proposed method is compared with the existing methods. From table
4, it can be concluded that the proposed method outperforms other existing methods in classifying sarcastic content
present in product review dataset.

4. Conclusion and Future work

In this work, we propose a novel approach to classify sarcastic text using content based feature selection method.
The proposed approach consists of two stage feature selection method to select most representative features. In first
stage, conventional feature selection methods are used to select relevant features subset. The selected feature subset
386 H M Keerthi Kumar et al. / Procedia Computer Science 143 (2018) 378–386
H M Keerthi Kumar and B S Harish / Procedia Computer Science 00 (2018) 000–000 9

from conventional feature selection methods such as CHI-square, IG and Mutual Information (MI) are further refined
using second stage. In second stage, k-means clustering algorithm is used to select most representative feature among
similar features. The selected features are classified using various classifiers such as SVM and RF classifiers. The
results of the proposed approach (MI + clustering) outperform the exiting methods in terms of Precision, Recall, F-
measure on benchmark dataset.
In future, the proposed work can be extended to (a) n-gram (bi-gram, trigram) representation, (b) various feature se-
lection methods and (c) variance of k-means clustering techniques along with different classifiers which enhances the
classification accuracies. Further, the proposed approach can be also extended to various fields such as text classifica-
tion, sentiment analysis, information retrieval and many more.

Acknowledgements

H M Keerthi Kumar has been financially supported by UGC under Rajiv Gandhi National Fellowship (RGNF) Let-
ter no: F1-17.1/2016-17/RGNF-2015-17-SC-KAR-6370/(SA-III Website), JSSRF (University of Mysore), Karnataka,
India.

References

[1] Al Amrani, Y., Lazaar, M., El Kadiri, K.E., 2018. Random forest and support vector machine based hybrid approach to sentiment analysis.
Procedia Computer Science 127, 511–520.
[2] Bharti, S., Vachha, B., Pradhan, R., Babu, K., Jena, S., 2016. Sarcastic sentiment detection in tweets streamed in real time: a big data approach.
Digital Communications and Networks 2, 108–121.
[3] Buschmeier, K., Cimiano, P., Klinger, R., 2014. An impact analysis of features in a classification approach to irony detection in product
reviews, in: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 42–49.
[4] Chakraborty, K., Bhattacharyya, S., Bag, R., Hassanien, A.E., 2018. Comparative sentiment analysis on a set of movie reviews using deep
learning approach, in: International Conference on Advanced Machine Learning Technologies and Applications, Springer. pp. 311–318.
[5] Chormunge, S., Jena, S., 2018. Correlation based feature selection with clustering for high dimensional data. Journal of Electrical Systems
and Information Technology, doi:https://doi.org/10.1016/j.jesit.2017.06.004.
[6] Dave, A.D., Desai, N.P., 2016. A comprehensive study of classification techniques for sarcasm detection on textual data, in: Electrical,
Electronics, and Optimization Techniques (ICEEOT), International Conference on, IEEE. pp. 1985–1991.
[7] Fersini, E., Messina, E., Pozzi, F.A., 2014. Sentiment analysis: Bayesian ensemble learning. Decision support systems 68, 26–38.
[8] Filatova, E., 2012. Irony and sarcasm: Corpus generation and analysis using crowdsourcing, in: LREC, Citeseer. pp. 392–398.
[9] Filatova, E., 2017. Sarcasm detection using sentiment flow shifts, in: FLAIRS-30, Association for the Advancement of Artificial Intelligence.
pp. 264–269.
[10] Harish, B., Revanasiddappa, M., 2017. A comprehensive survey on various feature selection methods to categorize text documents. Interna-
tional Journal of Computer Applications 164, 1–7. doi:10.5120/ijca2017913711.
[11] Lee, C., Lee, G.G., 2006. Information gain and divergence-based feature selection for machine learning-based text categorization. Information
processing & management 42, 155–165.
[12] Mukherjee, S., Bala, P.K., 2017. Sarcasm detection in microblogs using naı̈ve bayes and fuzzy clustering. Technology in Society 48, 19–27.
[13] Novovičová, J., Malı́k, A., Pudil, P., 2004. Feature selection using improved mutual information for text classification, in: Joint IAPR Interna-
tional Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer.
pp. 1010–1017.
[14] Reganti, A.N., Maheshwari, T., Kumar, U., Das, A., Bajpai, R., 2016. Modeling satire in english text for automatic detection, in: Data Mining
Workshops (ICDMW), 2016 IEEE 16th International Conference on, IEEE. pp. 970–977.
[15] Riloff, E., Qadir, A., Surve, P., De Silva, L., Gilbert, N., Huang, R., 2013. Sarcasm as contrast between a positive sentiment and negative
situation, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 704–714.
[16] Sarkar, S.D., Goswami, S., Agarwal, A., Aktar, J., 2014. A novel feature selection technique for text classification using naive bayes. Interna-
tional scholarly research notices 2014, 1–10. doi:http://dx.doi.org/10.1155/2014/717092.
[17] Song, F., Liu, S., Yang, J., 2005. A comparative study on text representation schemes in text categorization. Pattern analysis and applications
8, 199–209.
[18] Tripathy, A., Agrawal, A., Rath, S.K., 2016. Classification of sentiment reviews using n-gram machine learning approach. Expert Systems
with Applications 57, 117–126.
[19] Wallace, B.C., 2015. Computational irony: A survey and new perspectives. Artificial Intelligence Review 43, 467–483.
[20] Wang, G., Sun, J., Ma, J., Xu, K., Gu, J., 2014. Sentiment classification: The contribution of ensemble learning. Decision support systems 57,
77–93.
[21] Witten, I.H., Frank, E., Hall, M.A., Pal, C.J., 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Вам также может понравиться