Вы находитесь на странице: 1из 20

All India Shri Shivaji Memorial Society’s

College of Engineering

PROJECT REPORT

ON

MOVIE REVIEWS SENTIMENT ANALYSIS

IN THE FULLFILLMENT OF

MINI PROJECT

IN

DATA MINING AND WAREHOUSE

Year 2019-2020

SUBMITTED

BY

RAHUL HIPPARKAR

AKSHATA KADAM

NOORUL AMIN

Under the Guidance Of

Prof. D. M. UJALAMBKAR
1
All India Shri Shivaji Memorial Society’s

College of Engineering

Certificate
This is to certify that the report has been submitted by following

Mrs. Akshata Kadam Roll no: - 16CO024

Mr. Noorul Amin Roll no: - 16CO041

Mr. Rahul Hipparkar Roll no: - 16CO046

This project work has been completed by Final Year Students of Course Computer Department as a
Part of Team Work Prescribed by Savitribai Phule University. We have guided and assisted the Students
for the above work, which has been found Satisfactory/Good/Very Good.

Signature of Signature of

Guide H.O.D

(Prof. D. M. Ujalambkar) (Dr. D. P. Gaikwad)

Name & Signature of External Examiner

2
DECLARATION

We declare that these written submissions represent my ideas in my own words and where other
ideas or words have been included. We have adequately cited and referenced the original sources. We
also declare that we have adhered to all principals of academics honestly and integrity has not
misrepresented or fabricated or falsified and idea/data/fact sources in our submission. We understand
that any violation of the above will be cause for disciplinary action by the institute and can also evoke
penal action from the source which has thus not been properly cited or from whom proper permission
has not been taken when needed.

Akshata Kadam

Noorul Amin

Rahul Hipparkar

3
ABSTRACT

Text mining (deriving information from text) is a wide field which has gained popularity with the huge
text data being generated. Automation of a number of applications like sentiment analysis, document
classification, topic classification, text summarization, machine translation, etc has been done using
machine learning models.
Sentiment Analysis is the most common text classification tool that analyses an incoming
message and tells whether the underlying sentiment is positive, negative our neutral. You can input a
sentence of your choice and gauge the underlying sentiment by playing with the demo here.

4
ACKNOWLEDGEMENT

We wish to take this opportunity to express our sincere gratitude to all those who aided us in some
or the other way for our project.

We owe our profound gratitude to our Project Guide Prof. D. M. Ujalambkar who took keen
interest in our project work and guided us all along by providing all the necessary information for
developing a good android application.

We heartily thank our H. O. D Prof. D. P. Gaikwad for providing us with all necessary support
and guidance which helped us a lot in our project.

We are thankful to and fortunate enough to get constant encouragement, support and guidance
from all Teaching staff of Department of Computer which helped us throughout this period.

Finally, we express our indebtedness to all who have directly or indirectly contributed towards our
project.

5
CONTENTS

Sr. No. Title Page No.


1 Introduction 7
2 System Requirements 8
3 Data Mining 9
4 Text Mining 10
5 RapidMiner 11
6 Support Vector Machine 12
7 Process 13
8 Snapshots 14
9 Conclusion 17
10 References 18

6
INTRODUCTION

. Sentiment analysis is contextual mining of text which identifies and extracts subjective
information in source material, and helping a business to understand the social sentiment of their brand,
product or service while monitoring online conversations. However, analysis of social media streams is
usually restricted to just basic sentiment analysis and count based metrics. This is akin to just scratching
the surface and missing out on those high value insights that are waiting to be discovered. So what
should a brand do to capture that low hanging fruit?
With the recent advances in deep learning, the ability of algorithms to analyse text has improved
considerably. Creative use of advanced artificial intelligence techniques can be an effective tool for
doing in-depth research.
Sentiment Analysis is the most common text classification tool that analyses an incoming
message and tells whether the underlying sentiment is positive, negative our neutral. You can input a
sentence of your choice and gauge the underlying sentiment by playing with the demo here.

Text mining, also referred to as text data mining, roughly equivalent to text analytics, is the process
of deriving high-quality information from text. High-quality information is typically derived through
the devising of patterns and trends through means such as statistical pattern learning.

Text mining usually involves the process of structuring the input text (usually parsing, along
with the addition of some derived linguistic features and the removal of others, and subsequent insertion
into a database), deriving patterns within the structured data, and finally evaluation and interpretation
of the output.

7
SYSTEM REQUIREMENT

Hardware requirement: -

 4GB RAM
 500GB HDD(Minimum)

Software requirement: -

 RapidMiner 9.3
 Java JRE 8 or above
 Text Mining Plugin installed on RapidMiner

8
Data Mining

Data Mining” refers to the extraction of useful information from a bulk of data or data warehouses.
One can see that the term itself is a little bit confusing. In case of coal or diamond mining, the result of
extraction process is coal or diamond. But in case of Data Mining, the result of extraction process is not
data!! Instead, the result of data mining is the patterns and knowledge that we gain at the end of the
extraction process. In that sense, Data Mining is also known as Knowledge Discovery or Knowledge
Extraction.

Gregory Piatetsky-Shapiro coined the term “Knowledge Discovery in Databases” in 1989. However,
the term ‘data mining’ became more popular in the business and press communities. Currently, Data
Mining and Knowledge Discovery are used interchangeably.

Now a days, data mining is used in almost all the places where a large amount of data is stored and
processed. For example, banks typically use ‘data mining’ to find out their prospective customers who
could be interested in credit cards, personal loans or insurances as well. Since banks have the transaction
details and detailed profiles of their customers, they analyze all this data and try to find out patterns
which help them predict that certain customers could be interested in personal loans etc.

Main Purpose of Data Mining

Basically, the information gathered from Data Mining helps to predict hidden patterns, future trends
and behaviors and allowing businesses to take decisions.
Technically, data mining is the computational process of analyzing data from different perspective,
dimensions, angles and categorizing/summarizing it into meaningful information.
Data Mining can be applied to any type of data e.g. Data Warehouses, Transactional Databases,
Relational Databases, Multimedia Databases, Spatial Databases, Time-series Databases, World Wide
Web.
Applications of Data Mining
1. Financial Analysis
2. Biological Analysis
3. Scientific Analysis
4. Intrusion Detection
5. Fraud Detection
6. Research Analysis

Real life example of Data Mining – Market Basket Analysis

Market Basket Analysis is a technique which gives the careful study of purchases done by a customer
in a super market. The concept is basically applied to identify the items that are bought together by a
customer. Say, if a person buys bread, what are the chances that he/she will also purchase butter. This
analysis helps in promoting offers and deals by the companies. The same is done with the help of data
mining

9
Text Mining

RapidMiner text mining, also referred to as text data mining, roughly equivalent to text
analytics, is the process of deriving high-quality information from text. High-quality information is
typically derived through the devising of patterns and trends through means such as statistical pattern
learning. Text mining usually involves the process of structuring the input text (usually parsing, along
with the addition of some derived linguistic features and the removal of others, and subsequent insertion
into a database), deriving patterns within the structured data, and finally evaluation and interpretation
of the output.
The term text analytics describes a set of linguistic, statistical, and machine
learning techniques that model and structure the information content of textual sources for business
intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with
text mining; indeed, Ronen Feldman modified a 2000 description of "text mining" in 2004 to describe
"text analytics". The latter term is now used more frequently in business settings while "text mining" is
used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and
government intelligence.
Applications
Security applications
Many text mining software packages are marketed for security applications, especially
monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national
security purposes. It is also involved in the study of text encryption/decryption.
Software applications
Text mining methods and software is also being researched and developed by major firms,
including IBM and Microsoft, to further automate the mining and analysis processes, and by different
firms working in the area of search and indexing in general as a way to improve their results. Within
public sector much effort has been concentrated on creating software for tracking and
monitoring terrorist activities.
Online media applications
Text mining is being used by large media companies, such as the Tribune Company, to clarify
information and to provide readers with greater search experiences, which in turn increases site
"stickiness" and revenue. Additionally, on the back end, editors are benefiting by being able to share,
associate and package news across properties, significantly increasing opportunities to monetize
content.
Business and marketing applications
Text mining is starting to be used in marketing as well, more specifically in analytical customer
relationship management. Coussement and Van den Poel (2008) apply it to improve predictive
analytics models for customer churn (customer attrition). Text mining is also being applied in stock
returns prediction.
Sentiment analysis
Sentiment analysis may involve analysis of movie reviews for estimating how favorable a
review is for a movie. Such an analysis may need a labeled data set or labeling of the affectivity of
words. Resources for affectivity of words and concepts have been made for WordNet and ConceptNet.

10
RapidMiner

RapidMiner is a data science software platform developed by the company of the same name
that provides an integrated environment for data preparation, machine learning, deep learning, text
mining, and predictive analytics. It is used for business and commercial applications as well as for
research, education, training, rapid prototyping, and application development and supports all steps of
the machine learning process including data preparation, results visualization, model validation and
optimization. RapidMiner is developed on an open core model. The RapidMiner Studio Free Edition,
which is limited to 1 logical processor and 10,000 data rows is available under
the AGPL license. Commercial pricing starts at $5,000 and is available from the developer.
RapidMiner uses a client/server model with the server offered either on-premises or in public
or private cloud infrastructures.
According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution
through template-based frameworks that speed delivery and reduce errors by nearly eliminating the
need to write code. RapidMiner provides data mining and machine learning procedures including: data
loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization,
predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the
Java programming language. RapidMiner provides a GUI to design and execute analytical workflows.
Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each
operator performs a single task within the process, and the output of each operator forms the input of
the next one. Alternatively, the engine can be called from other programs or used as an API. Individual
functions can be called from the command line. RapidMiner provides learning schemes, models and
algorithms and can be extended using R and Python scripts.
RapidMiner functionality can be extended with additional plugins which are made available via
RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create
data analysis algorithms and publish them to the community.

11
Support Vector Machine

In machine learning, support-vector machines (SVMs, also support-vector networks)


are supervised learning models with associated learning algorithms that analyze data used
for classification and regression analysis. Given a set of training examples, each marked as belonging
to one or the other of two categories, an SVM training algorithm builds a model that assigns new
examples to one category or the other, making it a non-probabilistic binary linear classifier (although
methods such as Platt scaling exist to use SVM in a probabilistic classification setting). An SVM model
is a representation of the examples as points in space, mapped so that the examples of the separate
categories are divided by a clear gap that is as wide as possible. New examples are then mapped into
that same space and predicted to belong to a category based on the side of the gap on which they fall.
In addition to performing linear classification, SVMs can efficiently perform a non-linear
classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional
feature spaces.
When data are unlabelled, supervised learning is not possible, and an unsupervised
learning approach is required, which attempts to find natural clustering of the data to groups, and then
map new data to these formed groups. The support-vector clustering algorithm, created by Hava
Siegelmann and Vladimir Vapnik, applies the statistics of support vectors, developed in the support
vector machines algorithm, to categorize unlabeled data, and is one of the most widely used clustering
algorithms in industrial applications.

12
Process

Step 1:-

i. Retrieve the dataset


ii. Remove unnecessary attributes
iii. Set label to ‘Type’
iv. Convert the attributes from type Nominal to Text
v. Apply pre-processing operations such as Tokenization, Lowercase Conversion & Removal of
Stop words.

Step 2:-

i. Give the sentence vector generated from Step 1 to cross validation


ii. Here, train the training dataset using Support Vector Machine & check performance of testing
dataset using Apply Model & Performance operator.

Step 3:-

i. Create a document
ii. Process document using tokenization, transform to lowercase & removing stop words
iii. Initial wordlist is the additional input.

Step 4:-

i. The model trained with the old texts is applied to the new document.

13
Snapshots

Fig 1: Process

14
Fig 2: Cross Validation

15
Fig 3: Testing Dataset

16
Fig 4: Prediction

17
Fig 4: Accuracy and Performance Vector

18
CONCLUSION

 The accuracy of the model is 79.30 +/- 2.02%


 The precision of the model is 80.92% +/- 0.82%
 The recall of the model is 78.10% +/- 5.85%
 Confusion Matrix:
True: positive negative
ham: 814 232
spam: 182 772

19
REFERENCES

 https://www.youtube.com/watch?v=g5hI6wmdijM
 https://www.youtube.com/watch?v=kq61oFXD4YI
 https://www.youtube.com/watch?v=tI_0ZexuHvY
 https://www.youtube.com/watch?v=_zEnAWAUesQ
 https://www.youtube.com/watch?v=27RQRUR7Ubc
 https://medium.com/machine-learning-101/chapter-2-svm-support-vector-machine-theory-
f0812effc72
 https://www.kaggle.com/saurav133/spam-ham-prediction-nlp

20

Вам также может понравиться