Вы находитесь на странице: 1из 18

ABUSIVE CONTENT DETECTION

USING SENTIMENT ANALYSIS


Under the guidance of:
Dr.P.Radhika Raju

Project Team:
Patil Rahul Reddy(16001A0501)
Mummadi Ruthwick Reddy(16001A0550)
Muppidi Snigdha(16001A0554)
INTRODUCTION
Sentiment analysis is a logical evaluation of people’s opinions and
emotions. It is currently an active research area in Natural Language
Processing(NLP)and Text Mining.
Sentiment analysis is used to keep the spread of false news in check,
to remove any abusive content, to know the customer experience
and to monitor social media.
The number of social media users are increasing daily, therefore the
need for sentiment analysis cannot be over emphasized.
WHY?
There are many people using social media nowadays, hence abusive
content is also on the rise, due to this many people are getting
effected.
So, to handle this kinds of scenarios, an application which can detect
abusive content from the text data is required to be developed.
SCOPE:
This model only works with emotion oriented information seeking system.
This model works only with text data but not with multimedia data.
WORKING OF SENTIMENT ANALYSIS
IMPLEMENTATION:

Sentiment analysis is a classification algorithm, where a classifier is


fed with data and it returns the corresponding categories like
abusive, non-abusive.
Features are extracted from the data using feature extracting
techniques like TF-IDF(Term Frequency-Inverse Document
Frequency).
Sentiment analysis is done by supervised classification algorithms
such as logistic regression which is fed with features.
Once training is done, the trained model can classify the data into
the corresponding categories.
NORMALIZING AND CLEANING:
In normalizing, we replace some of the short forms of the words into
their full forms.
Eg: ‘u’ will be replaced with ‘you’.
In cleaning stage, we remove all the stop words, punctuations.
Eg: stop words include ‘you’, ‘is’ ,’they’ etc:-.
 Pre-processing of the data happens in these two stages.
TERM FREQUENCY-INVERSE
DOCUMENT FREQUENCY(TF-IDF):
TF-IDF is a statistical measure that evaluates how relevant a word is to a
document in a collection of documents.
 This is done by multiplying two metrics: how many times a word appears in a
document, and the inverse document frequency of the word across a set of
documents.
Term frequency is the frequency of word in the document.
Inverse document frequency gives the inverse of the total occurrences of the
word in all documents.
Chi-Square test:
A Chi-Square test is a test of statistical significance for categorical variables.
The data should be in the form of frequencies or counts of a particular category
and not in percentages
LOGISTIC REGRESSION:
Logistic regression is a classification algorithm used to assign observations to a
discrete set of classes.
Logistic regression is of different types. For example, binary logistic regression
where the possible outcomes are only two.
WORK FLOW OF ALGORITHM

INPUT
Text data for training and testing

Data Cleaning
Tokenization Abbreviation Treatment
Stop Words Removal Bad-words Synonyms
Mapping Punctuation Removal

APPLICATION FLOW
TF-IDF Transformation

Chi Square Feature


Selection

Modelling

Classification
RESULT
INSIGHTS:
53% of comments which have abusive words are not actually abusive
For every one in five comments, abusive word variants are used to insult rather
than direct abusive words
Typing errors are a common part of chat but are penalized heavily by model in
case of a resemblance with abusive words
SUMMARY:
Model has an accuracy of 91.2% on training data and 81% on cross
validation
Logistic Regression is found to be the best suitable model in
comparison to popularly used Naïve Bayes and SVM
1500 relevant features are selected using Chi square test

Common Chat Words Abusive Word Variants


CONCLUSION:
This application helps in reduction of negativity, revolutionary and terror
thoughts.
Reference:
Monkeylearn.com
Datacamp.com
Kaggle.com
Towardsdatascience.com
Scikit-learn.org
THANK YOU

Вам также может понравиться