Вы находитесь на странице: 1из 9

Mandrill Tweet Analysis

Agenda
• Introduction
• Approach
– Removing Extraneous Punctuation
– Splitting on Spaces
– Counting Tokens and Calculating Probabilities
– Model formulation and usage
• Advantages of Naïve Bayes
Introduction
• Mandrill is a transactional e-mail
product for software developers
• Sends one-to-one e-mails, receipts and
password resets
• Can help track the emails that have
been opened and viewed

• Twitter API used to get tweets mentioning


the product Mandrill
• First tweet is about Spark Mandrill from a
Super Nintendo game -- Irrelevant
• The second tweet is about a band called
Mandrill -- Irrelevant
• The third tweet is about the product
Mandrill -- Relevant
Removing Extraneous Punctuation
• Before creating a bag of words from a tweet, we
lowercase everything and replace most of punctuation
with spaces.
• For lowercase – LOWER(A2)
• For removing extraneous punctuation -
• SUBSTITUTE(B2,”. “,” “)
• SUBSTITUTE(C2,”: “,” “)
• SUBSTITUTE(D2,”?“,” “)
• SUBSTITUTE(E2,”!“,” “)
• SUBSTITUTE(F2,”;“,” “)
• SUBSTITUTE(G2,”,“,” “)
Splitting on Spaces
• To count how many times each word is used, we need all tweet
words in a single column.
• Assuming max 30 words, copy paste the 150 processed tweets in
4500 rows.
• First word from first tweet in row 2, second word in row 152 and
so on
• Column B to indicate the position of successive space between
words:
• till 151- place 0 in A2:A151
• =IFERROR(FIND(“ “,A152,B2+1),LEN(A152)+1)
• Column C to extract single tokens from tweets:
• IFERROR(MID(A2,B2+1,B152-B2-1),”.”)
• Column D for length of each token:
• LEN(C2)
Counting Tokens and Calculating Probabilities
For app tokens and non-app tokens :
Before we can find the conditional probability of the tokens, we need to follow
these steps:
1. Create a pivot table ->
2. Uncheck the lengths 0,1,2,3 to remove the smaller tokens which are
essentially adding no value (here, for simplicity we are not going with
removal of stop words)
3. Additive smoothening to be done : Add 1 to the count of all tokens. This is
done in order to accommodate the assumption of having seen the rare word
(if any) once.
4. Find the sum of the new column (Column C) created for adding one to the
count, gives the new grand total token count.
5. Find the probability of the tokens in a column (Column D):
Probability of the token = (value of column C)/(Grand total token
count)
6. Find the natural log of the probability(Column E)
LN(P)= LN (value in column D)
Model formulation and usage
Take the test tweets consisting of 10 about app and 10 others:
1. Remove the Extraneous Punctuation in the tweets
2. Convert the prepped tweet text to column using text to column wizard.
3. Apply VLOOKUP on the tweet token to get their individual probability of
occurrence in app tweets and other tweets.
1. To handle rare words wrap VLOOKUP in ISNA. And the probability of
occurrence is calculated as 1/(total token count).
2. Wrap the whole formula in If statement which checks length to filter
out tokens with length of 3 or less.
4. Calculate the sum of conditional probabilities for app and other.
5. Classify the test tweets by comparing the sum of probability assigned to the
both sets (app and other)
6. Compare the original class value and predicted class value to check the
accuracy of the model developed.
Naïve Bayes Algorithm

Advantages
• Easy to implement
• Fast Disadvantages
• If the independence assumption • The strong assumption about the
holds then it works more features to be independent which
efficiently than other algorithms is hardly true in real life
• It requires less training data. applications
• It is highly scalable • Chances of loss of accuracy
• Best suited for text classification
problems
THANK YOU

Вам также может понравиться