Data Crackers YELP

Data Mining on YELP Dataset
Advisor - Duc Tran Thanh
Team - Data Crackers

Prashanth Sandela
Vimal Chandra Gorijala
Parineetha Gandhi Tirumali
1
Table of Contents
1.
Project Vision ........................................................................................................................................ 3
2.
Data Mining Task................................................................................................................................... 3

2.1.
Data Mining Problem: ................................................................................................................... 3
2.2.
Evaluation Metrics: ....................................................................................................................... 3
3.
Hypothesis............................................................................................................................................. 3
4.
Data Processing ..................................................................................................................................... 4

4.1.
Data ............................................................................................................................................... 4
4.2.
Initial Dataset ................................................................................................................................ 4
4.3.
Data quality problems ................................................................................................................... 5
4.4.
Data Processing Tasks ................................................................................................................... 5
4.5.
Resulting Dataset .......................................................................................................................... 6
5.
Feature Selection .................................................................................................................................. 7

5.1.
Dataset .......................................................................................................................................... 7
5.2.
Rationales behind feature selection ............................................................................................. 7
5.3.
Feature Selection Tasks: ............................................................................................................... 8
5.4 Selected Features ............................................................................................................................. 10

6.
Model Development and Tuning by Prashanth Sandela .................................................................... 10

6.1.
Implementation of own model(Navie Bayes) ............................................................................. 10
6.2.
Nave Base Multinomial Classification Model ............................................................................. 13
6.3.
Experimental Results .................................................................................................................. 14
7.
Model Development and Tuning by Vimal Chandra Gorijala ............................................................. 15

7.1.
Nave Bayes Multinomial Model ................................................................................................. 15
7.2.
Nave Bayes Multinomial Text Model ......................................................................................... 16
7.3.
Results Comparison .................................................................................................................... 16
8.
Model Development and Tuning by Parineetha Gandhi .................................................................... 17

8.1.
K Nearest Neighbors Model ........................................................................................................ 17
8.2.
Decision Tree............................................................................................................................... 19
9.
Main Findings in the Project ............................................................................................................... 20
10.
Results and Comparison ................................................................................................................. 21
11.
Project Management ...................................................................................................................... 22
12.
List of Queries: ................................................................................................................................ 23
1. Project Vision
In todays fast growing world, there are many businesses which are startups, growing and well
established. For every business, rating holds a stand ground for its survival in market. This rating is given
by users who enjoys goods and services from a business. User expresses his experience towards a
business in the form of review and star ratings through many platforms and the famous one among
them is YELP. The review can be positive, negative or a neutral review. The aim of our Project is to build
a classifier that classifies any given review into labels of Star ratings (-1, 0 and 1). We planned to use
various Data Mining models to classify reviews into user star ratings labels by applying various model
tuning techniques to attain optimal accuracy of classifying.
2. Data Mining Task

2.1. Data Mining Problem:
The data mining task we are trying to solve is multi-class classification. The classes we used in this
project are -1, 0 and 1 (-1 - negative, 0 - neutral, 1 - Positive).
2.2. Evaluation Metrics:

The following are some of the evaluation metrics we have used to assess the quality of the solution.
1) Percentage or accuracy of correctly classified instances: This metric is appropriate because through
it we can exactly know how our model is performing. But we cannot rely on this metric completely.
2) ROC value: This value gives the ratio of the true positives to false positives. ROC value measurement
is one of the most important values output by WEKA. An "optimal" classifier will have ROC values
approaching 1, with 0.5 being comparable to "random guessing"
Through the combination of the above metrics we can assess the performance of the model and attain
the best results.
3. Hypothesis
As it is a classification based on text, words in the reviews are features we should consider to classify
them correctly. For example a review has words like good, excellent, awesome, yumm food etc..,
that review should be classified into Positive class label. We planned to concentrate on them and
apply transformations like stop words removal, stemming etc.., to make use of those words in best
way possible with the help of different tools, so that the review can be classified correctly. We
intended to concentrate mainly on Bayesian algorithms as they have better performances in the
case of text classification.
We intended to use combination of words called bigrams. For example words like very good, yum
yum etc. In general the view of a user about a business is expressed mostly in combination of words,
so we thought using bigrams could give good accuracy to the model.
Use of other features like business id, user id individually can improve the accuracy and they should
not be used as a combination.
We will discuss in the results below how the model learning is affected by approaches in the hypothesis.
4. Data Processing
4.1. Data
We obtained data from http://www.yelp.com/dataset_challenge. It has 40,000 businesses, 1.3 Million
reviews and 250,000 users. The data was in JSON format and we had to do some pre-processing and
converted into .CSV format to obtain the review text, class labels and other features. There are many
irrelevant fields like neighborhoods, votes, etc.., and we have removed all of them and considered only
the required ones. Initially the reviews have star rating class labels associated to them from 1 to 5 and
we have reduced them as 1,2 to negative(-1), 3 as neutral(0) and 4,5 as Positive(1). The below figure
gives the representation of the all the reviews and the modified class labels associated to them.
Review Stars
748188
213509
163761
NEGATIVE
NEUTRAL
POSITIVE
Count
Graph showing no. of review stars
4.2. Initial Dataset

Initial dataset which has YELP dataset which basically consisted data about Business, Users and
Reviews. Below is snapshot of dataset in JSON format.
Business
User
Reviews
{ 'type': 'business',
'business_id': (encrypted business id),
'type': 'user',
'type': 'review',
'name': (business name),
'user_id': (encrypted user id),
'business_id': (encrypted business id),
'neighborhoods': [(hood names)],
'name': (first name),
'user_id': (encrypted user id),
'full_address': (localized address),
'review_count': (review count),
'stars': (star rating, rounded to half-stars),
'city': (city),
'average_stars': (floating point average,
'text': (review text),
'state': (state),
like 4.31),
'latitude': latitude,
'votes': {(vote type): (count)},
'longitude': longitude,
'friends': [(friend user_ids)],
'stars': (star rating, rounded to half-stars),
'elite': [(years_elite)],
'review_count': review count,

'categories': [(localized category names)]
'date': (date, formatted like '2012-03-14'),

'votes': {(vote type): (count)},
}
'yelping_since': (date, formatted like

'2012-03'),
'open': True / False,
'compliments': {
}}
(compliment_type):
}
4.3. Data quality problems

In the dataset weve many quality issues like specified below:
1)
2)
3)
4)
5)
6)
Presence of unwanted columns and merging the files in dataset

Special Characters
Numeric Data
Other language Characters
Stop words.
Business_id, review_id, user_id has hash indexes which occupy a lot of space.
4.4. Data Processing Tasks

4.4.1. 1.3.1 Removing Unwanted Columns and merging all the files
Among all the columns, we considered only business_id, user_id, review_id, review_text,
review_count and stars. Furthermore, three files are combined to make a single dataset with only
considered attributes. For accomplishing this task, first the entire dataset (Includes all the 3 files) was
converted from JSON to CSV using PYTHON script. Next, all the datasets are combined using ETL Tool
Pentaho. Below is screenshot of ETL Mapping.
4.4.2. 1.3.2 Removing Special Characters, Numeric and Other language words
The main consideration is review_text which is the text data that users entered as a review to
the business which also has stars. There are special characters, new line character, other language
characters. These have been removed by below PHP script.
4.4.3. 1.3.3 Removing Stop Word and convert to lower case

Stop words are meaningless words, by removing which the meaning or weightage of sentence
doesnt change. So, by decreasing these stop words the number of tokens are decreased. Converting
complete context to lower case makes it easy for comparing two words when both words are in
lowercase. Below is the Algorithm in PHP Script to perform the operations 1.3.2 and 1.3.3.

$result = mysql_query("select business_id, user_id, review_id, text, stars, review_count from reviews");
while($rows = mysql_fetch_array($result)){
$i++;
$text = preg_replace("/[^A-z| ]/i", " ", $rows['text']);
$text = str_replace("\n", " ", $text);
//Process text to remove stop words.
$text = explode(" ", $text);
$processed_text = "";
foreach ($text as $s){
$s = strtolower($s);
if($s != null && array_search($s, $stopWords) == false)
$processed_text .= $s." ";
}
$optString .= "'".$rows['business_id']."',";
$optString .= "'".$rows['user_id']."',";
$optString .= "'".$rows['review_id']."',";
$optString .= "'".$processed_text."',";
$optString .= $rows['stars'];
$optString .= $rows[review_count];
$optString .= "\n";
if($i % 1000 == 0) {
$fd = fopen("reviews_DetailedStopWords.csv", "a+");
fwrite($fd, $optString);
$optString = "";
echo "$i\n";
}
}
4.5. Resulting Dataset

The resulting dataset consist of business_id, review_id, user_id, stars and review_text in CSV
format. A sample of the file is below.
nYer89hXYAoddMEKTxw7kA,k2u1F6spBGhgk2JtAe97QA,HeDqdFYkKaeDvPtiFy6Xmw,event favorite event a long time lindsey a fabulous job setting up keeping
movie played completely hush hush absolutely love filmbar always a great beer wine selection wonderful staff a wacky selection art film movie night wayne
world one time favorites naturally beyond thrilled invited a super foxey date a guy a delicious dog short leash a fabulous time thank lindsey filmbar yelp a
fantastic evening party time excellent ,5
nYer89hXYAoddMEKTxw7kA,hdZ3rlgFXctCOUhzoOebvA,XXblLOSqYlq0tXhxHfXUHQ,great time funny movie loved going film bar first time tables eat fantastic
,4
nYer89hXYAoddMEKTxw7kA,usQTOj7LQ9v0Fl98gRa3Iw,2fPxXAysOrZLrahZQyJCNg,wayne world short leash filmbar need more a great tuesday night
adventure wayne world took waaaaaay back favorite aiko a chicken dog short leash ages great yelp coordinated outing usual thanks lindsey yelp crew thanks
kelly a staff filmbar place bomb day week thanks brad kat short leash bunch having trailer available event ,5
nYer89hXYAoddMEKTxw7kA,XTFE2ERq7YvaqGUgQYzVNA,OO6prfuGEMalQcQcU3WCaw,fab concept test out a new well new hadn t film bar previously
independent cinema drink a generously gratis beer wine choosing short lease hot dogs oh conveniently parked outside plus samples appetizers heck yes add
excitement anticipation knowing filmed az movie going shown a perfect weeknight event now know movies don t actually based arizona awesome ideas next
mention great filmbar date ideas online dating attempts pan out thanks film bar yelp short lease fun fellow yelpers a great time ps post movie trivia answers ,5
5. Feature Selection
5.1. Dataset
After data preprocessing, the dataset is in pure csv structured format with required columns.
This table is loaded into HDFS. A `reviews` table is created on the dataset. This table is used for feature
selection.
$> Hadoop fs put review.csv
$>hive
HIVE>create table reviews(business_id String, review_id String, user_id String, review_text String, stars int)
row format delimited
fields terminated by ,
lines terminated by \n;
HIVE> load data inpath review.csv into table reviews;
HIVE> select * from reviews LIMIT 10;
/* Displays list of columns in correct format */
5.2. Rationales behind feature selection

Now, as content is processed, the next task is to reduce the size of dataset by replacing the hash
value of business_id, review_id and user_id with unique identifier. We create a new table to store those
values, so that the initial id value will not be lost.
Our final aim is to classify stars based on the reviews, so we narrowed down stars 1 5 in three
classifications i.e., Positive, Negative and Neutral reviews. So, 1 and 2 star review fall under
Negative reviews, 4 and 5 fall under Positive review and 3 star rating fall under Neutral review.
Essential feature is review_text for classifying the review. We have tested to classify 25,000 reviews in
WEKA on unigram, bigram and trigram considering 66% data for training the model and 34% for testing.
As per the result, use of ngrams gave significantly high correctness. Based on this experiment weve
planned to use ngrams to train the model.
Weve planned to use 67% of the data to train model, and 33% for testing. Weve removed stop words
in data preprocessing step. Text review is given by end user of YELP application. So, there is high
possibility of having spelling mistakes in the review text. For example, users express their feelings in
various ways. Some users may type gooood instead of good, coooool instead of cool. So, when we
calculate the term frequency, there is high possibility of ignoring these words. This is called
Lemmatization. And Stemming is to identify root word.
For this phase weve implemented LovinsStemmer Algorithm. A UDF is created for this algorithm, which
takes complete text as input, processes it and gives the output accordingly. After this phase, we have
divided the table into unigram, bigram and trigram and calculated the frequency of the words.
5.3. Feature Selection Tasks:

5.3.1. Assigning numeric id to key attributes
Below are the queries used to assign numeric id to key attributes: business_id, user_id,
review_id
HIVE> CREATE TABLE business AS
SELECT DISTINCT id, business_id,
(SELECT RANK() OVER(ORDER BY business_id) as id,
RANK() OVER(ORDER BY user_id) as user_id,
RANK() OVER(ORDER BY review_id) as review_id
business_id
from reviews) a;
HIVE> CREATE TABLE users AS
SELECT DISTINCT id, user_id,
(SELECT RANK() OVER(ORDER BY business_id) as business_id,
RANK() OVER(ORDER BY user_id) as id,
user_id
from reviews) a;
HIVE> CREATE TABLE process_reviews AS
SELECT RANK() OVER(ORDER BY business_id) as business_id,
RANK() OVER(ORDER BY user_id) as user_id,
review_text,
stars
from reviews;
5.3.2. Narrowing stars

Narrowing stars implies converting 1 and 2 stars as negative i.e -1. Converting 3 stars to Neutral
i.e., 0. Converting 4, 5 to Positive i.e 1.
HIVE> CREATE TABLE processed_stars_reviews as
SELECT business_id, review_id, user_id, review_text,
CASE WHEN stars = 1 or stars = 2 THEN -1
WHEN stars = 3 THEN 0
WHEN stars = 4 or stars = 5 THEN 1
END AS stars
FROM processed_reviews
5.3.3. Removing Stop Words:

Stop Words were removed in data processing phase.
5.3.4. Stemming review text

We created a UDF `lovinsStemmer()` based on the Lovins Stemming algorithm provided by WAIKATO.
After applying stemming, we removed some newly generated stop words using UDF `stopWords()`.
HIVE> CREATE TABLE stemmed_stars_reviews as
SELECT review_id,
stopWords(lovinsStemmer(review_text)) as review_text,
stars
FROM processed_stars_reviews
Above is query for doing these tasks.
5.3.5. Dividing training and Test data

Total number of reviews in the dataset are 1,125,458. So 67% of it, i.e 754,056 are for training the model
and remaining 371,408 reviews are for testing the model. As weve already created unique review_id
from 1 to 1,125,458. Above is the query to split the dataset.
HIVE> CREATE TABLE test_data as
SELECT * FROM processed_stars_reviews
WHERE review_id >= 754056;
HIVE> CREATE TABLE training_data as
SELECT * FROM processed_stars_reviews
WHERE review_id < 754056;
5.3.6. Classification Using WEKA

In the initial report we have used StringToWord vector filter which uses WordTokenizer to convert the
review text into vectors and applied Navie-Bayes Multinomial classification algorithm. We got 66.7%
instances as correctly classified. The data sample had 5000 instances and 66% of it is used as training
data and 34% as testing data.
In this phase we have applied NGRAM tokenizer which converts the text into NGRAMS (Unigarms,
Bigrams, Trigrams). We also applied Attribute Selection Filter on top of this which uses
InfoGainAttributeEval function to evaluate the worth of the attribute by measuring the information gain
with respect to the class and got 70.02% instances correctly classified. The Data sample has 25000
instances and 66% of it is training data and remaining is testing data.
HIVE> CREATE TABLE unigrams as

SELECT ngrams(sentences(review_text), 1, 2000, 2000) from training_data;
HIVE> CREATE TABLE bigrams as
HIVE> CREATE TABLE trigrams as
5.3.7. Generating N-grams

Ngrams generation is predefined function in HIVE. We are using function to make unigrams, bigrams and
trigrams and calculate their frequency. Below are the queries to create unigram, bigram and trigram. We
are selecting top 2000 word list.
5.4. Selected Features

These are selected features for our classification
1)
2)
3)
4)
Business Id
User Id
Bigrams
Review Text
6. Model Development and Tuning by Prashanth Sandela

6.1. Implementation of own model (Nave Bayes)
6.1.1. Idea to develop model in Hive
I developed my own model for classifying star ratings based on text. For developing model, I considered
features business id, user id, review id, review text and stars. The implementation of the model is based
on the probability of the ngram based on the either business id or user id applied on training data and
10
apply the model on test data to classify stars of the review as -1, 0, 1 which implies Negative, neutral or
positive review. This model has shown an accuracy of 69.5%.
6.1.2. Model Development & Description

This model is purely developed using HIVE Queries using Amazon Web Services storing data in S3,
development and deploying in Elastic Map Reduce with 3 EC2 Instances. This model serves on entire
dataset of YELP which consists of 1.3 Million records with 67% of Training data and 33% of Test Data.
Steps followed to develop the model:
1.
2.
3.
4.
5.
6.
Divide Training and Test Data

Find ngram, frequency, star and probability from Training data
Find review id, ngram, frequency in test data
Train the with test data
Compare test and training data set words and including few other features
Retrieve the percentage match of training and test data
Queries Bigrams1, Numerics, Bigrams_1, Bigrams_stag_1, Bigrams_stag_2 have been used to design the
model. This is the final model which has been tuned. Below is the example to see the model
implementation.
Below is the example of classification in unigrams.
Training Data
Word
Good
Frequency
Total Stars
Star
Probability
Stars
Total Count
100
0.33
300
50
0.16
-1
200
100
-1
0.5
150
Good
10
-1
0.05
Nice
15
0.1
Excellent
Bad
Queries bigrams_test_1_1, stats are used to compute the results. Here in the above example, I
calculated the probability of the word based on the frequency of the word and total word count. Word
Good is available in both 1 and -1 star ratings. So, based on the probability Good will be classified as
+1 star. Below is how the review will be classified based on the text.
Review_id
Word
Good
Test Data
Count in
reviews
New_Star
Original_Star
10
Bad
-1
Bad
30
-1
-1
Worst
40
-1
-1
Here review id 1, count of word will be considered and it will be classified as a review with star rating 1.
Similarly review id 2 will be classified as star with rating -1.
11
I considered only business which had at least 10 reviews and users who have at least 10 reviews. Ive
divided data at each business and user level. For E.g.: If there are 100 reviews for a specific business,
then 66 reviews are supplied to training and rest to test data, when I consider business id as a feature.
This division happens at every business id. Likewise, similar process is repeated for user id and also when
both the features business id and user id are considered together.
In hive, I was not able to develop ROC measure for result metrics.
6.1.3. Model Tuning

Data set supplied to this model has been removed with stop words and text enrichment. Below are
model tuning procedures:
1) Refining and sampling of training and test data
Initially I just stripped my dataset into 100,000 records in which I stripped first 67,000 as training
and 37,000 as test data. I realized that in the training data more than 90% of the records were
positive. So, I divided data in random sample between few parts and which improved the accuracy
nearly by 7% and which improved my accuracy from 43% to ~51%.
2) Change of stemmer
Initially I used lovins Stemmer, on doing some research I found that Porter Stemmer is better than
Lovins Stemmer. I used a java program to implement this Stemmer which resulted in improving
accuracy by ~0.5%.
3) NGrams and identifying frequency count
Use of different grams has changed the accuracy. Use of bigrams has shown better accuracy.
Furthermore, there is slight increase in accuracy when I considered term frequency as 5000. Using
this tuning, there is increase of 4% accuracy.
4) Determine best approach to increase accuracy
Before arriving to the procedure of Probability Model, I used various other models approaches like
sum and count model which didnt help me much to determine accurate results. But use of
Probability Model has increased accuracy significantly.
5) Change of features
We have two more features, business id and user id. When I used business id or user id as an extra
features, accuracy increased significantly. But when I used business id and user id together, there
was actually reduce in accuracy and it makes sense that a use of both features implies that the
model will search for business id and user id for same business id and tries to classify stars. As this
combination will be unique, the accuracy got reduced.
6) Applying on overall Dataset
When I supplied overall dataset on the ration of 67% as training and 33% as test dataset, then I got
accuracy of ~74%. Accuracy on sample data was ~71%, but on overall data its a bit higher.
6.1.4. Pros and Cons

a)
b)
c)
d)
e)
It is an SQL like language. So easy to implement.

Main advantage of using this model is, we can tune the model to any extent.
There wont be any limitation on size of data or number of fields.
Never run out of memory.
Can implement this model in a cluster using all the required resources.
12
f)
It is difficult to design and implement this model any change requires lot of implementations to be
considered like if a query is changed, what might be the effect on the result. Should be very careful
while making changes.
g) There are lot of predefined function already defined by HIVE, any new extensions can be easily
accommodated by designing a UDF(User Defined Function)
6.2. Nave Base Multinomial Classification Model

6.2.1. About Model
I used WEKA Data Modeling Tool to classify stars using Nave Bayes Multinomial Model which is
available in list of Bayes models. WEKA has pre-defined models implemented with many filters and
features.
6.2.2. Model Tuning

This has been performed on the dataset with Training data of 67% and 33% as Test data and has been
implemented on 100,000 records.
1) Initial accuracy was about around 47% without any tuning.
2) I supplied new list of stop words list rather than using default stop word list. There was slight
increment of accuracy but it was nearly ~0.5%
3) Use of ngrams instead of Word Tokenizer improved has shown better accuracy.
4) In ngrams, the accuracy was even better when bigrams have been used on top of my dataset.
5) Default NullStemmer was replaced by LovinsStemmer which gave slight increase in accuracy.
6) Use of words to keep also improves the accuracy of the result. Increase in word to keep from 1000
to 5000 has shown me change in accuracy ~2%.
7) I used Attribute Selection filter with search strategy of Ranker Algorithm with threshold of 0,
generateRanking: True, numToSelect: -1 and leaving starStar to null which has shown me an
increase in accuracy by 1.5%
8) Using different features change the accuracy of output. I used business id and user id together to
see an increase in accuracy. But use of these two attributes reduced the accuracy, which is
expected. Coz, a user will give one or two reviews based on his experience in a business. When we
user both features together, then number of instance per review will be narrowed down to either 1
or 2 which implies there is definitely decrease in probability and accuracy. So, I used only one
attribute at a time. User of user id as a feature gave be a better increase in accuracy. There was
increase in accuracy by 3%.
9) Overall accuracy is 76%
6.2.3. Pros and Cons

1) Using this model with WEKA give the flexibility to use many filters and attributes both for supervised
and unsupervised learning.
2) Using WEKA, only works on small datasets. Working with larger datasets is not possible.
3) This algorithm has already been designed, so effort to change any specific task is not required.
4) If we want to add a new functionality which is not available, it is difficult to implement.
13
6.3. Experimental Results

Sl.
No
Action
*Nave
Model
*Nave Bayes
Multinomial
44%
46%
1
Initial Dataset
2
0.54
Refining of
Training and Test
Data
+7%
N/A
0.55
Change of
Stemmer
+0.5%
4
Unigrams
Bigrams
Trigrams
+3%
+7%
+2%
Business id and
User id
+2%
Business id
+2%
User id
5%
Bag of words
Overall Accuracy
on 100,000 records
Accuracy on
complete dataset
5%
6
7
8
ROC
for
NBM
0.47
~73%
~75%
N/A
Discuss Results
In this stage, no filters are applied and results show
a initial model results without applying any filters
Default set of training data I selected was more
positive, so sampling of training data helped me to
increase accuracy in my model. But where in Nave
Bayes Multinomial, randomization is automatically
handled by WEKA (using randomizer).
Change of stemmer from Lovins Stemmer to Porter
Stemmer has shown slight increase in accuracy. I
don't have Lovins Stemmer in default stemmers list
in WEKA to implement in Nave Bayes Multinomial
Ngrams:
0.59
Of these 3 ngrams, in both the cases bigrams gave
+3.5%
me optimal results. So, I went ahead implemented
0.68
+7.5%
bigrams.
0.60
+2%
Including Features
0.70
Use of both together in Probability Model reduced
accuracy, it might be because of full outer join which
joins records depending on business id and user id
+1%
and might be searching for specific instances where
in instances which are in train dataset might not be
available in test dataset. In this case, I found that
Naive Bayes Multinomial gave good result.
0.74
Use of business id and user id increased accuracy.
+3%
But accuracy was more when user id alone was
0.74
considered. According to this, I can understand that
it is similar to users sentiment analysis, because a
user will use same sort of text to express his
feelings. I saw that there are small number of users
who gave lot of reviews. So, use of user id as a
+5%
feature definitely explains increase in accuracy
0.75
Initially I used 1000 words for the frequency count.
4%
But use of 3000 words increased accuracy.
0.78
~76%
I was not able to fit everything in WEKA memory
N/A
even if I allocated 6GB of memory to WEKA.
* All the accuracy rates are rounded to nearest value
14
7. Model Development and Tuning by Vimal Chandra Gorijala

7.1. Nave Bayes Multinomial Model
7.1.1. About Model
Multinomial Nave Bayes is a special version of Nave Bayes that is designed more for the text
documents. This model is mainly useful for multiclass classification. Initially we have 5 classes to classify
the reviews but we have reduced them into three (positive, neutral, negative), so that we can train the
model in a better manner. Here the probability of a review d being in a class c is computed as
Where P(tk | c) is the conditional probability of a term tk occurring in a review of class c.We interpret
P(tk | c) as measure of how much evidence tk contributes that c is a correct class. P(c) is the Prior
probability of a review occurring in class c. If a review terms do not provide clear evidence for one class
versus another, we choose the one that has higher probability.
We used WEKA to implement the model. Initially the dataset containing the features review text,
business id, review id, user id and the class label are fed to the tool. The preprocessing is done and they
are converted into word Vectors or NGrams based on the filters applied. Now we implement the model
on them.
7.1.2. Model Tuning

In WEKA we can change various properties to increase the performance of the model. The data sample
has 60,000 records. The following are some of them.
1. Using NGrams rather than Word vectors, but bigrams usage increased the efficiency.
2. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the
dataset.
3. Increasing the minimum term frequency from 1 to 10, which indicates a term with less than 10
occurrences is not considered.
4. Converting all the text to lower tokens
5. Using Attribute selection filter InfoGainAttribute Eval on top of Ngrams so that only top ranked
attributes are taken into account and fed to the model.
6. Using Cross fold option instead of percentage split option.
7. Utilizing additional features like business_id or user_id
15
7.1.3. Experimental Results

Features and Parameters
Percentage of
Accuracy
ROC
Initial % with review text
48
0.46
Stopwords
53
0.51
Stemmer
54
0.52
Unigrams
59
0.58
Min word Frequency from 5-10
65
0.64
65.23
0.64
Bigrams
72
0.73
Trigrams
63
0.62
User id
74
0.75
Business id
74
0.75
Attribute selection Filter
78
0.79
Bag of words count to 5000
79
0.81
79.49
0.83
Business id and User id
OverAll Accuracy
Observation: Here in Naive Bayes Multinomial model varying the minimum term frequency and usage
of bigrams feature has improved the performance drastically. The reason behind this is the data
contains many bigrams and concentrating mainly on highly frequent words in the reviews.
7.2. Nave Bayes Multinomial Text Model

7.2.1. About Model
Multinomial Nave Bayes Text model operates directly on string attributes. Other types of input
attributes are accepted but ignored during training and classification. It uses word frequencies rather
than binary bag of words representation. This model will be useful mainly with the text data.
We used WEKA to implement the model. A data sample of 60,000 instances has been used.
7.2.2. Model Tuning

In WEKA we can change various properties to increase the performance of the model. The following are
some of them:
1.
2.
3.
4.
5.
Converting all the text to lower case tokens.

Varying the minimum word frequency.
Using Ngrams instead of word vectors.
Utilizing additional features like business_id or user_id
Using Cross fold option instead of percentage split option.
16
6. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the
dataset.
7.2.3. Experiment Results

Features and Parameters
Percentage of
Accuracy
ROC
Initial % with review text
54
0.53
Stopwords
56
0.57
Stemmer
59
0.60
Unigrams
61
0.62
Min word Frequency from 5-10
66
0.70
Business id and User id
65
0.64
Bigrams
73
0.75
Trigrams
64
0.66
User id
77
0.79
Business id
77
0.79
79.6
0.84
OverAll Accuracy
Observation: The same reason mentioned in the Naive Bayes Multinomial model is responsible for the
drastic increase in the accuracy of the model. From the above comparison of results we can say that
Nave Bayes Multinomial Text model has slightly higher efficiency (about 0.11%) than Nave Bayes
Multinomial model. The reason behind this could be in the Nave Bayes Multinomial Text model some
extra processing is carried out which gives it slightly higher efficiency than Nave Bayes Multinomial
model.
8. Model Development and Tuning by Parineetha Gandhi

Dataset fed to the tool has 25000 reviews consisting 16576 are positive, 5650 are negative and 2772 are
neutral reviews.
8.1. K Nearest Neighbors Model

8.1.1. About Model
k is a constant given by the user, and an unlabeled vector is classified by assigning the label which is
most frequent among the k nearest training samples to that vector. The following formula defines the
nearest neighbors
17
K value should be chosen according to the data, value of k reduces the effect of noise on the
classification, but make boundaries between the classes less distinct.
8.1.2. Model Tuning

Tuned the model by varying the k value. Initial I applied k=1 and observed that the tokenizer does not
make much difference on changing this attribute which gave me the result as 67.2314%
In the second attempt on tuning the model more by applying k=15 and applying tokenizer as bigram I
observed that the accuracy increased to 69.125%.
Parameters Tuned:
Following are the parameters which I changed and tuned the model. Table in the section 8.1.3 clearly
shows the results obtained by varying the parameters.
TFTransform and IDFTransform
minTermFreq
outputwordCounts
lowercasetokens
Stemmer
stopwords
tokenizer
Also tuned the model with the model specific parameters. For example the parameters like Euclidean
distance, number of nearest neighbors, etc have been changed. For decision tree parameters like laPlace
value, binary split options etc have been changed.
8.1.3. Experiment Results
18
Observation: The results obtained are good when k=5, the reason behind this could be when k is even,
when classifying to more than two groups or when using an even value for k, it might be necessary to
break a tie in the number of nearest neighbors.
When considered KNN specific parameters like Euclidean distance and Manhattan distance, it is
observed that the Euclidean distance gave better results.
Results obtained when used KNN specific parameters
8.2. Decision Tree

8.2.1. About Model
Decision tree classifies instances by sorting them down the tree from root to some leaf node which
provides the classification of instances. Each internal node represents an attribute of the instance, each
branch represents the node corresponds to one of the possible values for this attribute
8.2.2. Model Tuning

Parameters Tuned:
Applied the same parameters for this model as well and got the accuracy 74.25%
Performed percentage split in most of the cases as cross fold validation was taking quite a long time for
each experiment.
Initially I tried to run the model without applying any parameters on the dataset and observed that the
accuracy is 72.41% and ROC is 0.72.
The best results are obtained when set the Laplace value to true which is about 0.82 ROC. The reason
could be the Laplace correction method biases the probability towards a uniform distribution.
Decision Tree Specific Parameters:
Parameters
Value
Binary Split
TRUE
71.54
0.72
10
69.93
0.73
TRUE
69.98
0.82
numFolds
useLaplace
Accuracy
ROC
Results obtained when used Decision Tree specific parameters
19
Result comparison between KNN and Decision Tree
Observation:
The results obtained are good when k=5, the reason behind this could be classifying to more than two
groups or when using an even value for k, it might be necessary to break a tie in the number of nearest
neighbors.
For decision tree the best results are obtained when Laplace value is set to true and that showed the
increase in ROC which is 0.819.
9. Main Findings in the Project

Naive Bayes Multinomial Text model has performed the best among all models we have tried. The
reasons are listed below.
Varying the parameter min term frequency has drastically affected the performance. The words
which are not repeated frequently and not useful to the classification are ignored.
The review text mostly has Bigrams like very good, feeling awesome etc. So, using these feature for
classifying the reviews has helped a lot.
Use of additional features like user id, business id increased the performance. For example a user
gives most of his reviews as positive for different businesses, most likely the next review given by
him for any other business would be positive. If a business has most of its reviews as positive, most
likely the next incoming review would be positive. For these features to work the reviews of user
must be present in both training and the test set and same with businesses.
Nave Bayes Multinomial model has almost same accuracy as the above model due to same reasons.
But, the Multinomial Nave Bayes Text model has some extra processing to it which increases the
accuracy.
20
10.
Results and Comparison
Graph shows accuracy results obtained by different models
From the above graph it clearly implies that Nave Bayes Multinomial Text gives optimal accuracy of
~80%. It can be observed that from left to right all the Nave Bayes models started with the accuracy of
40% and 55%. Where in KNN and Decision tree started with better accuracy, we were expecting better
accuracy with these models, but it didnt turn out to what we expected. Accuracy increased as we
included features and filters. From ngrams, it can be observed that bigrams has shown good results, so
we considered bigrams for further processing. In features of business id, user id and both together we
have observed better accuracy when we considered only user id. For KNN the accuracy was good with
k=5 and also by considering the KNN specific features like Euclidean distance the results were
considerably high. For Decision Tree setting the laplace value has increased ROC. Including these special
functions for these two models improved the accuracy slightly.
So, Nave Bayes Multinomial Text Classification gave a good accuracy when compared with other
models.
21
11.
Project Management
11.1. Task Allocation and Timelines

We used Project Management website www.Asana.com to manage entire project and workload. Below
is timeline allocation of work load for each team member. We used this tool to store all intermediate
files, reports and scripts or snippets.
11.2. Self-Assessment:
-
Everyone on the team contributed equally. There was no total dependency or delay from
anyone in the team.
Everyone was equally active and enthusiastic to learn something new.
Before taking any decision, we made sure that everyone is clear about the requirements and
expected output. We followed the process of Knowledge Transfer and Reverse Knowledge
Transfer to make sure that everyone is on same page.
There were lot of discussions in the initial phase of project so that everything goes without any
hurdle in the end.
Everyone in the team has decent knowledge on different tools and technologies like Pentaho
Data Integration, WEKA, MYSQL, JAVA, PHP and Big Data Components. So, if any sort of decision
had to be made, there was always someone to address.
Everyone used Asana Project Management tool actively.
11.3. What can be improved?

-
Domain knowledge
Increasing awareness and usability of tools to all the members of the team.
22
12.
List of Queries:
12.1. Bigrams1
CREATE TABLE bigrams as
SELECT word, star, frequency
FROM
(
SELECT word, star, frequency, rank() over(order by frequency desc) as slno
FROM
(
SELECT word,
CASE
WHEN pos_count >= neg_count AND pos_count >= nut_count THEN 1
WHEN neg_count >= nut_count THEN -1
ELSE 0
END AS star,
CASE
WHEN pos_count >= neg_count AND pos_count >= nut_count THEN pos_count
WHEN neg_count >= nut_count THEN neg_count
ELSE nut_count
END AS frequency
FROM (
SELECT
distinct
CASE WHEN neg.gram.ngram[0] IS NOT NULL THEN concat(neg.gram.ngram[0]," ",neg.gram.ngram[1])
WHEN nut.gram.ngram[0] IS NOT NULL THEN concat(nut.gram.ngram[0]," ",nut.gram.ngram[1])
ELSE concat(nut.gram.ngram[0]," ",nut.gram.ngram[1])
END AS word,
CASE WHEN pos.gram.estfrequency IS NULL THEN 0 ELSE pos.gram.estfrequency END AS pos_count,
CASE WHEN neg.gram.estfrequency IS NULL THEN 0 ELSE neg.gram.estfrequency END AS neg_count,
CASE WHEN nut.gram.estfrequency IS NULL THEN 0 ELSE nut.gram.estfrequency END AS nut_count
FROM
bigrams_neg as neg FULL OUTER JOIN bigrams_nut as nut
on neg.gram.ngram = nut.gram.ngram
FULL OUTER JOIN bigrams_pos as pos
on pos.gram.ngram = nut.gram.ngram
) as a
) as b
) as c where slno < 5000 and word is not null;
12.2. Numerics
CREATE TABLE numberics AS
SELECT star, count(*) as count from bigrams
GROUP BY star;
12.3. Bigrams_1
CREATE TABLE bigrams_1 AS
SELECT word, bi.star, frequency, frequency/count AS prob
FROM
bigrams as bi INNER JOIN numberics AS num
ON bi.star = num.star;
12.4. Bigrams_stag_1
CREATE TABLE bigrams_stag_1 AS
SELECT
test_bi.review_id as review_id
, bi.star as star
, sum(prob) as prob_sum
FROM
bigrams_1 bi INNER JOIN test_data_bigrams_0 test_bi
on bi.word = test_bi.word
GROUP BY
test_bi.review_id, bi.star;
23
12.5. Bigrams_stag_2
CREATE TABLE bigrams_stag_2 AS
SELECT review_id, max(prob_sum) as prob_max FROM bigrams_stag_1
GROUP BY review_id;
12.6. Bigrams_test_1_1
CREATE TABLE bigram_test_1_1 AS
SELECT test.review_id, new_star, test.stars as original_star
FROM
(
SELECT
stag1.review_id AS review_id,
star AS new_star
FROM
bigrams_stag_1 AS stag1 INNER JOIN bigrams_stag_2 AS stag2
on
stag1.review_id = stag2.review_id
and stag1.prob_sum = stag2.prob_max
) a INNER JOIN test_data test
on a.review_id = test.review_id;
12.7. Stats
This gives final statistics which shows number of correctly classified instances and wrongly classified
instances.
SELECT
stats, COUNT(*)
FROM(
SELECT
CASE
WHEN new_star = original_star THEN 1
ELSE 0
END as stats,
new_star, original_star
FROM bigram_test_1_1) res
GROUP BY res.stats;
select original_star, count(*) from bigram_test_1_1 group by original_star;
select new_star, count(*) from bigram_test_1_1 group by new_star;
24

Data Crackers YELP

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Crackers YELP

Загружено:

Авторское право:

Доступные форматы

Data Mining on YELP Dataset

Advisor - Duc Tran Thanh

Team - Data Crackers

Project Vision ........................................................................................................................................ 3

Data Mining Task................................................................................................................................... 3

Data Mining Problem: ................................................................................................................... 3

Evaluation Metrics: ....................................................................................................................... 3

Data Processing ..................................................................................................................................... 4

Initial Dataset ................................................................................................................................ 4

Data quality problems ................................................................................................................... 5

Data Processing Tasks ................................................................................................................... 5

Resulting Dataset .......................................................................................................................... 6

Feature Selection .................................................................................................................................. 7

Rationales behind feature selection ............................................................................................. 7

Feature Selection Tasks: ............................................................................................................... 8

5.4 Selected Features ............................................................................................................................. 10

Model Development and Tuning by Prashanth Sandela .................................................................... 10

Implementation of own model(Navie Bayes) ............................................................................. 10

Nave Base Multinomial Classification Model ............................................................................. 13

Experimental Results .................................................................................................................. 14

Model Development and Tuning by Vimal Chandra Gorijala ............................................................. 15

Nave Bayes Multinomial Model ................................................................................................. 15

Nave Bayes Multinomial Text Model ......................................................................................... 16

Results Comparison .................................................................................................................... 16

Model Development and Tuning by Parineetha Gandhi .................................................................... 17

K Nearest Neighbors Model ........................................................................................................ 17

Main Findings in the Project ............................................................................................................... 20

Results and Comparison ................................................................................................................. 21

Project Management ...................................................................................................................... 22

List of Queries: ................................................................................................................................ 23

2. Data Mining Task

2.2. Evaluation Metrics:

4.2. Initial Dataset

'business_id': (encrypted business id),

'name': (business name),

'user_id': (encrypted user id),

'business_id': (encrypted business id),

'neighborhoods': [(hood names)],

'name': (first name),

'user_id': (encrypted user id),

'full_address': (localized address),

'review_count': (review count),

'stars': (star rating, rounded to half-stars),

'average_stars': (floating point average,

'text': (review text),

'votes': {(vote type): (count)},

'friends': [(friend user_ids)],

'stars': (star rating, rounded to half-stars),

'review_count': review count,

'date': (date, formatted like '2012-03-14'),

'yelping_since': (date, formatted like

'open': True / False,

4.3. Data quality problems

Presence of unwanted columns and merging the files in dataset

4.4. Data Processing Tasks

4.4.3. 1.3.3 Removing Stop Word and convert to lower case

4.5. Resulting Dataset

5.2. Rationales behind feature selection

5.3. Feature Selection Tasks:

5.3.2. Narrowing stars

5.3.3. Removing Stop Words:

5.3.4. Stemming review text

Above is query for doing these tasks.

5.3.5. Dividing training and Test data