Академический Документы
Профессиональный Документы
Культура Документы
Table of Contents
1.
2.
2.2.
3.
Hypothesis............................................................................................................................................. 3
4.
Data ............................................................................................................................................... 4
4.2.
4.3.
4.4.
4.5.
5.
Dataset .......................................................................................................................................... 7
5.2.
5.3.
6.2.
6.3.
7.
7.2.
7.3.
8.
8.2.
Decision Tree............................................................................................................................... 19
9.
10.
11.
12.
1. Project Vision
In todays fast growing world, there are many businesses which are startups, growing and well
established. For every business, rating holds a stand ground for its survival in market. This rating is given
by users who enjoys goods and services from a business. User expresses his experience towards a
business in the form of review and star ratings through many platforms and the famous one among
them is YELP. The review can be positive, negative or a neutral review. The aim of our Project is to build
a classifier that classifies any given review into labels of Star ratings (-1, 0 and 1). We planned to use
various Data Mining models to classify reviews into user star ratings labels by applying various model
tuning techniques to attain optimal accuracy of classifying.
3. Hypothesis
As it is a classification based on text, words in the reviews are features we should consider to classify
them correctly. For example a review has words like good, excellent, awesome, yumm food etc..,
that review should be classified into Positive class label. We planned to concentrate on them and
apply transformations like stop words removal, stemming etc.., to make use of those words in best
way possible with the help of different tools, so that the review can be classified correctly. We
intended to concentrate mainly on Bayesian algorithms as they have better performances in the
case of text classification.
We intended to use combination of words called bigrams. For example words like very good, yum
yum etc. In general the view of a user about a business is expressed mostly in combination of words,
so we thought using bigrams could give good accuracy to the model.
Use of other features like business id, user id individually can improve the accuracy and they should
not be used as a combination.
We will discuss in the results below how the model learning is affected by approaches in the hypothesis.
4. Data Processing
4.1. Data
We obtained data from http://www.yelp.com/dataset_challenge. It has 40,000 businesses, 1.3 Million
reviews and 250,000 users. The data was in JSON format and we had to do some pre-processing and
converted into .CSV format to obtain the review text, class labels and other features. There are many
irrelevant fields like neighborhoods, votes, etc.., and we have removed all of them and considered only
the required ones. Initially the reviews have star rating class labels associated to them from 1 to 5 and
we have reduced them as 1,2 to negative(-1), 3 as neutral(0) and 4,5 as Positive(1). The below figure
gives the representation of the all the reviews and the modified class labels associated to them.
Review Stars
748188
213509
163761
NEGATIVE
NEUTRAL
POSITIVE
Count
Graph showing no. of review stars
User
Reviews
{ 'type': 'business',
'type': 'user',
'type': 'review',
'city': (city),
'state': (state),
like 4.31),
'latitude': latitude,
'longitude': longitude,
'elite': [(years_elite)],
'compliments': {
}}
(compliment_type):
}
4.4.2. 1.3.2 Removing Special Characters, Numeric and Other language words
The main consideration is review_text which is the text data that users entered as a review to
the business which also has stars. There are special characters, new line character, other language
characters. These have been removed by below PHP script.
$result = mysql_query("select business_id, user_id, review_id, text, stars, review_count from reviews");
while($rows = mysql_fetch_array($result)){
$i++;
$text = preg_replace("/[^A-z| ]/i", " ", $rows['text']);
$text = str_replace("\n", " ", $text);
//Process text to remove stop words.
$text = explode(" ", $text);
$processed_text = "";
foreach ($text as $s){
$s = strtolower($s);
if($s != null && array_search($s, $stopWords) == false)
$processed_text .= $s." ";
}
$optString .= "'".$rows['business_id']."',";
$optString .= "'".$rows['user_id']."',";
$optString .= "'".$rows['review_id']."',";
$optString .= "'".$processed_text."',";
$optString .= $rows['stars'];
$optString .= $rows[review_count];
$optString .= "\n";
if($i % 1000 == 0) {
$fd = fopen("reviews_DetailedStopWords.csv", "a+");
fwrite($fd, $optString);
$optString = "";
echo "$i\n";
}
}
5. Feature Selection
5.1. Dataset
After data preprocessing, the dataset is in pure csv structured format with required columns.
This table is loaded into HDFS. A `reviews` table is created on the dataset. This table is used for feature
selection.
$> Hadoop fs put review.csv
$>hive
HIVE>create table reviews(business_id String, review_id String, user_id String, review_text String, stars int)
row format delimited
fields terminated by ,
lines terminated by \n;
HIVE> load data inpath review.csv into table reviews;
HIVE> select * from reviews LIMIT 10;
/* Displays list of columns in correct format */
In this phase we have applied NGRAM tokenizer which converts the text into NGRAMS (Unigarms,
Bigrams, Trigrams). We also applied Attribute Selection Filter on top of this which uses
InfoGainAttributeEval function to evaluate the worth of the attribute by measuring the information gain
with respect to the class and got 70.02% instances correctly classified. The Data sample has 25000
instances and 66% of it is training data and remaining is testing data.
Business Id
User Id
Bigrams
Review Text
10
apply the model on test data to classify stars of the review as -1, 0, 1 which implies Negative, neutral or
positive review. This model has shown an accuracy of 69.5%.
Queries Bigrams1, Numerics, Bigrams_1, Bigrams_stag_1, Bigrams_stag_2 have been used to design the
model. This is the final model which has been tuned. Below is the example to see the model
implementation.
Below is the example of classification in unigrams.
Training Data
Word
Good
Frequency
Total Stars
Star
Probability
Stars
Total Count
100
0.33
300
50
0.16
-1
200
100
-1
0.5
150
Good
10
-1
0.05
Nice
15
0.1
Excellent
Bad
Queries bigrams_test_1_1, stats are used to compute the results. Here in the above example, I
calculated the probability of the word based on the frequency of the word and total word count. Word
Good is available in both 1 and -1 star ratings. So, based on the probability Good will be classified as
+1 star. Below is how the review will be classified based on the text.
Review_id
Word
Good
Test Data
Count in
reviews
New_Star
Original_Star
10
Bad
-1
Bad
30
-1
-1
Worst
40
-1
-1
Here review id 1, count of word will be considered and it will be classified as a review with star rating 1.
Similarly review id 2 will be classified as star with rating -1.
11
I considered only business which had at least 10 reviews and users who have at least 10 reviews. Ive
divided data at each business and user level. For E.g.: If there are 100 reviews for a specific business,
then 66 reviews are supplied to training and rest to test data, when I consider business id as a feature.
This division happens at every business id. Likewise, similar process is repeated for user id and also when
both the features business id and user id are considered together.
In hive, I was not able to develop ROC measure for result metrics.
f)
It is difficult to design and implement this model any change requires lot of implementations to be
considered like if a query is changed, what might be the effect on the result. Should be very careful
while making changes.
g) There are lot of predefined function already defined by HIVE, any new extensions can be easily
accommodated by designing a UDF(User Defined Function)
13
Action
*Nave
Model
*Nave Bayes
Multinomial
44%
46%
1
Initial Dataset
2
0.54
Refining of
Training and Test
Data
+7%
N/A
0.55
Change of
Stemmer
+0.5%
4
Unigrams
Bigrams
Trigrams
+3%
+7%
+2%
Business id and
User id
+2%
Business id
+2%
User id
5%
Bag of words
Overall Accuracy
on 100,000 records
Accuracy on
complete dataset
5%
6
7
8
ROC
for
NBM
0.47
~73%
~75%
N/A
Discuss Results
In this stage, no filters are applied and results show
a initial model results without applying any filters
Default set of training data I selected was more
positive, so sampling of training data helped me to
increase accuracy in my model. But where in Nave
Bayes Multinomial, randomization is automatically
handled by WEKA (using randomizer).
Change of stemmer from Lovins Stemmer to Porter
Stemmer has shown slight increase in accuracy. I
don't have Lovins Stemmer in default stemmers list
in WEKA to implement in Nave Bayes Multinomial
Ngrams:
0.59
Of these 3 ngrams, in both the cases bigrams gave
+3.5%
me optimal results. So, I went ahead implemented
0.68
+7.5%
bigrams.
0.60
+2%
Including Features
0.70
Use of both together in Probability Model reduced
accuracy, it might be because of full outer join which
joins records depending on business id and user id
+1%
and might be searching for specific instances where
in instances which are in train dataset might not be
available in test dataset. In this case, I found that
Naive Bayes Multinomial gave good result.
0.74
Use of business id and user id increased accuracy.
+3%
But accuracy was more when user id alone was
0.74
considered. According to this, I can understand that
it is similar to users sentiment analysis, because a
user will use same sort of text to express his
feelings. I saw that there are small number of users
who gave lot of reviews. So, use of user id as a
+5%
feature definitely explains increase in accuracy
0.75
Initially I used 1000 words for the frequency count.
4%
But use of 3000 words increased accuracy.
0.78
~76%
I was not able to fit everything in WEKA memory
N/A
even if I allocated 6GB of memory to WEKA.
14
Where P(tk | c) is the conditional probability of a term tk occurring in a review of class c.We interpret
P(tk | c) as measure of how much evidence tk contributes that c is a correct class. P(c) is the Prior
probability of a review occurring in class c. If a review terms do not provide clear evidence for one class
versus another, we choose the one that has higher probability.
We used WEKA to implement the model. Initially the dataset containing the features review text,
business id, review id, user id and the class label are fed to the tool. The preprocessing is done and they
are converted into word Vectors or NGrams based on the filters applied. Now we implement the model
on them.
15
Percentage of
Accuracy
ROC
48
0.46
Stopwords
53
0.51
Stemmer
54
0.52
Unigrams
59
0.58
65
0.64
65.23
0.64
Bigrams
72
0.73
Trigrams
63
0.62
User id
74
0.75
Business id
74
0.75
78
0.79
79
0.81
79.49
0.83
OverAll Accuracy
Observation: Here in Naive Bayes Multinomial model varying the minimum term frequency and usage
of bigrams feature has improved the performance drastically. The reason behind this is the data
contains many bigrams and concentrating mainly on highly frequent words in the reviews.
6. Increasing the WordsToKeep count from 1000 to 5000 or 10000 depending on the size of the
dataset.
Percentage of
Accuracy
ROC
54
0.53
Stopwords
56
0.57
Stemmer
59
0.60
Unigrams
61
0.62
66
0.70
65
0.64
Bigrams
73
0.75
Trigrams
64
0.66
User id
77
0.79
Business id
77
0.79
79.6
0.84
OverAll Accuracy
Observation: The same reason mentioned in the Naive Bayes Multinomial model is responsible for the
drastic increase in the accuracy of the model. From the above comparison of results we can say that
Nave Bayes Multinomial Text model has slightly higher efficiency (about 0.11%) than Nave Bayes
Multinomial model. The reason behind this could be in the Nave Bayes Multinomial Text model some
extra processing is carried out which gives it slightly higher efficiency than Nave Bayes Multinomial
model.
17
K value should be chosen according to the data, value of k reduces the effect of noise on the
classification, but make boundaries between the classes less distinct.
18
Observation: The results obtained are good when k=5, the reason behind this could be when k is even,
when classifying to more than two groups or when using an even value for k, it might be necessary to
break a tie in the number of nearest neighbors.
When considered KNN specific parameters like Euclidean distance and Manhattan distance, it is
observed that the Euclidean distance gave better results.
Value
Binary Split
TRUE
71.54
0.72
10
69.93
0.73
TRUE
69.98
0.82
numFolds
useLaplace
Accuracy
ROC
19
Observation:
The results obtained are good when k=5, the reason behind this could be classifying to more than two
groups or when using an even value for k, it might be necessary to break a tie in the number of nearest
neighbors.
For decision tree the best results are obtained when Laplace value is set to true and that showed the
increase in ROC which is 0.819.
Varying the parameter min term frequency has drastically affected the performance. The words
which are not repeated frequently and not useful to the classification are ignored.
The review text mostly has Bigrams like very good, feeling awesome etc. So, using these feature for
classifying the reviews has helped a lot.
Use of additional features like user id, business id increased the performance. For example a user
gives most of his reviews as positive for different businesses, most likely the next review given by
him for any other business would be positive. If a business has most of its reviews as positive, most
likely the next incoming review would be positive. For these features to work the reviews of user
must be present in both training and the test set and same with businesses.
Nave Bayes Multinomial model has almost same accuracy as the above model due to same reasons.
But, the Multinomial Nave Bayes Text model has some extra processing to it which increases the
accuracy.
20
10.
From the above graph it clearly implies that Nave Bayes Multinomial Text gives optimal accuracy of
~80%. It can be observed that from left to right all the Nave Bayes models started with the accuracy of
40% and 55%. Where in KNN and Decision tree started with better accuracy, we were expecting better
accuracy with these models, but it didnt turn out to what we expected. Accuracy increased as we
included features and filters. From ngrams, it can be observed that bigrams has shown good results, so
we considered bigrams for further processing. In features of business id, user id and both together we
have observed better accuracy when we considered only user id. For KNN the accuracy was good with
k=5 and also by considering the KNN specific features like Euclidean distance the results were
considerably high. For Decision Tree setting the laplace value has increased ROC. Including these special
functions for these two models improved the accuracy slightly.
So, Nave Bayes Multinomial Text Classification gave a good accuracy when compared with other
models.
21
11.
Project Management
11.2. Self-Assessment:
-
Everyone on the team contributed equally. There was no total dependency or delay from
anyone in the team.
Everyone was equally active and enthusiastic to learn something new.
Before taking any decision, we made sure that everyone is clear about the requirements and
expected output. We followed the process of Knowledge Transfer and Reverse Knowledge
Transfer to make sure that everyone is on same page.
There were lot of discussions in the initial phase of project so that everything goes without any
hurdle in the end.
Everyone in the team has decent knowledge on different tools and technologies like Pentaho
Data Integration, WEKA, MYSQL, JAVA, PHP and Big Data Components. So, if any sort of decision
had to be made, there was always someone to address.
Everyone used Asana Project Management tool actively.
Domain knowledge
Increasing awareness and usability of tools to all the members of the team.
22
12.
List of Queries:
12.1. Bigrams1
CREATE TABLE bigrams as
SELECT word, star, frequency
FROM
(
SELECT word, star, frequency, rank() over(order by frequency desc) as slno
FROM
(
SELECT word,
CASE
WHEN pos_count >= neg_count AND pos_count >= nut_count THEN 1
WHEN neg_count >= nut_count THEN -1
ELSE 0
END AS star,
CASE
WHEN pos_count >= neg_count AND pos_count >= nut_count THEN pos_count
WHEN neg_count >= nut_count THEN neg_count
ELSE nut_count
END AS frequency
FROM (
SELECT
distinct
CASE WHEN neg.gram.ngram[0] IS NOT NULL THEN concat(neg.gram.ngram[0]," ",neg.gram.ngram[1])
WHEN nut.gram.ngram[0] IS NOT NULL THEN concat(nut.gram.ngram[0]," ",nut.gram.ngram[1])
ELSE concat(nut.gram.ngram[0]," ",nut.gram.ngram[1])
END AS word,
CASE WHEN pos.gram.estfrequency IS NULL THEN 0 ELSE pos.gram.estfrequency END AS pos_count,
CASE WHEN neg.gram.estfrequency IS NULL THEN 0 ELSE neg.gram.estfrequency END AS neg_count,
CASE WHEN nut.gram.estfrequency IS NULL THEN 0 ELSE nut.gram.estfrequency END AS nut_count
FROM
bigrams_neg as neg FULL OUTER JOIN bigrams_nut as nut
on neg.gram.ngram = nut.gram.ngram
FULL OUTER JOIN bigrams_pos as pos
on pos.gram.ngram = nut.gram.ngram
) as a
) as b
) as c where slno < 5000 and word is not null;
12.2. Numerics
CREATE TABLE numberics AS
SELECT star, count(*) as count from bigrams
GROUP BY star;
12.3. Bigrams_1
CREATE TABLE bigrams_1 AS
SELECT word, bi.star, frequency, frequency/count AS prob
FROM
bigrams as bi INNER JOIN numberics AS num
ON bi.star = num.star;
12.4. Bigrams_stag_1
CREATE TABLE bigrams_stag_1 AS
SELECT
test_bi.review_id as review_id
, bi.star as star
, sum(prob) as prob_sum
FROM
bigrams_1 bi INNER JOIN test_data_bigrams_0 test_bi
on bi.word = test_bi.word
GROUP BY
test_bi.review_id, bi.star;
23
12.5. Bigrams_stag_2
CREATE TABLE bigrams_stag_2 AS
SELECT review_id, max(prob_sum) as prob_max FROM bigrams_stag_1
GROUP BY review_id;
12.6. Bigrams_test_1_1
CREATE TABLE bigram_test_1_1 AS
SELECT test.review_id, new_star, test.stars as original_star
FROM
(
SELECT
stag1.review_id AS review_id,
star AS new_star
FROM
bigrams_stag_1 AS stag1 INNER JOIN bigrams_stag_2 AS stag2
on
stag1.review_id = stag2.review_id
and stag1.prob_sum = stag2.prob_max
) a INNER JOIN test_data test
on a.review_id = test.review_id;
12.7. Stats
This gives final statistics which shows number of correctly classified instances and wrongly classified
instances.
SELECT
stats, COUNT(*)
FROM(
SELECT
CASE
WHEN new_star = original_star THEN 1
ELSE 0
END as stats,
new_star, original_star
FROM bigram_test_1_1) res
GROUP BY res.stats;
select original_star, count(*) from bigram_test_1_1 group by original_star;
select new_star, count(*) from bigram_test_1_1 group by new_star;
24