Вы находитесь на странице: 1из 9

I'm Jacob

Data science graduate student by day, pretty much the same


thing at night.
Home
About Me
Blog
GitHub
Twitter
LinkedIn
My CV
Syndicated on R-Bloggers

Last Updated: 02/2017

Poor Donald - his tweets keep


getting more negative
10 Feb 2017

Last summer, David Robinson did this interesting text analysis of Donald Trumps tweets and found
that they more angry ones came from Android (which Trump is known to use). But he didnt consider
how Trumps emotional state varies over time and he certainly couldnt have considered what the
impact of the election and recent events would have been on Trump.

Using the twitteR package and the tidyverse ecosystem (plus tidytext ) this is actually a very simple
analysis.

For starters, pulling Trumps tweets (the last 3,200) is very simple:

library(twitteR)
library(tidyverse)
library(tidytext)

source("~/twitter_key.R")

setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret)


## [1] "Using direct authentication"

trump <- userTimeline("realDonaldTrump",


n = 3100,
includeRts = TRUE,
excludeReplies = FALSE) %>%
twListToDF () %>%
as_tibble()

And then we have a tidy tibble with Trumps tweets:

glimpse(trump)

## Observations: 3,099
## Variables: 16
## $ text <chr> "Heading to Joint Base Andrews on #MarineOne wit...
## $ favorited <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 77699, 85576, 71312, 220083, 64348, 84125, 62284...
## $ replyToSN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created <time> 2017-02-10 23:24:51, 2017-02-10 13:35:50, 2017-...
## $ truncated <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ replyToSID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id <chr> "830195857530183684", "830047626414477312", "830...
## $ replyToUID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource <chr> "<a href=\"http://twitter.com/download/iphone\" ...
## $ screenName <chr> "realDonaldTrump", "realDonaldTrump", "realDonal...
## $ retweetCount <dbl> 21473, 19779, 15069, 64363, 10082, 14185, 11294,...
## $ isRetweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

Using tidytext , it is straightforward to unnest and tokenize the words in the body of the tweets:

words <- trump %>%


select(id, statusSource, retweetCount, favoriteCount, created, isRetweet,
text) %>%
unnest_tokens(word, text)
words

## # A tibble: 57,239 x 7
## id
## <chr>
## 1 830195857530183684
## 2 830195857530183684
## 3 830195857530183684
## 4 830195857530183684
## 5 830195857530183684
## 6 830195857530183684
## 7 830195857530183684
## 8 830195857530183684
## 9 830195857530183684
## 10 830195857530183684
## # ... with 57,229 more rows, and 6 more variables: statusSource <chr>,
## # retweetCount <dbl>, favoriteCount <dbl>, created <time>,
## # isRetweet <lgl>, word <chr>

Given what David Robinson found, we might want to convert the statusSource variable into a flag for
whether it was posted via an Android device:

words <- words %>%


mutate(android = stringr::str_detect(statusSource, "Android")) %>%
select(- statusSource)
words

## # A tibble: 57,239 x 7
## id retweetCount favoriteCount created
## <chr> <dbl> <dbl> <time>
## 1 830195857530183684 21473 77699 2017-02-10 23:24:51
## 2 830195857530183684 21473 77699 2017-02-10 23:24:51
## 3 830195857530183684 21473 77699 2017-02-10 23:24:51
## 4 830195857530183684 21473 77699 2017-02-10 23:24:51
## 5 830195857530183684 21473 77699 2017-02-10 23:24:51
## 6 830195857530183684 21473 77699 2017-02-10 23:24:51
## 7 830195857530183684 21473 77699 2017-02-10 23:24:51
## 8 830195857530183684 21473 77699 2017-02-10 23:24:51
## 9 830195857530183684 21473 77699 2017-02-10 23:24:51
## 10 830195857530183684 21473 77699 2017-02-10 23:24:51
## # ... with 57,229 more rows, and 3 more variables: isRetweet <lgl>,
## # word <chr>, android <lgl>

Lets now code the tweets using the afinn sentiment set:

words <- words %>%


inner_join(get_sentiments("afinn"))

## Joining, by = "word"

words
## # A tibble: 5,093 x 8
## id retweetCount favoriteCount created
## <chr> <dbl> <dbl> <time>
## 1 830047626414477312 19779 85576 2017-02-10 13:35:50
## 2 830047626414477312 19779 85576 2017-02-10 13:35:50
## 3 830042498806460417 15069 71312 2017-02-10 13:15:27
## 4 829721019720015872 10082 64348 2017-02-09 15:58:01
## 5 829721019720015872 10082 64348 2017-02-09 15:58:01
## 6 829689436279603206 14185 84125 2017-02-09 13:52:31
## 7 829689436279603206 14185 84125 2017-02-09 13:52:31
## 8 829689436279603206 14185 84125 2017-02-09 13:52:31
## 9 829689436279603206 14185 84125 2017-02-09 13:52:31
## 10 829689436279603206 14185 84125 2017-02-09 13:52:31
## # ... with 5,083 more rows, and 4 more variables: isRetweet <lgl>,
## # word <chr>, android <lgl>, score <int>

And now lets see how the typical sentiment of those tweets has varied since April 2016 (midsts of
the Republican primary) to present:

words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
ggplot(aes(x = created, y = sentiment)) +
geom_smooth() +
geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) +
labs(x = "Date", y = "Mean Afinn Sentiment Score")

The vertical lines denote the date he was named as the Republican candidate (May 3rd 2016), the
date of the election (Nov 8th 2016) and inauguration day. Thing arent looking up for Trump. He
seems to be more angry/sad/negative now than any prior point during the past year.

What if we consider the grouping by using Android vs not:

words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
ggplot(aes(x = created, y = sentiment, color = android)) +
geom_smooth() +
geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) +
labs(x = "Date", y = "Mean Afinn Sentiment Score")

We see the general trend that David Robinson identified - the Android tweets tended to be more
negitive than the other platforms. It is interesting that they were more positive than the tweets
presumed to be by staff right before the election. Also, we can see the non-Android tweets were
more positive during the transition than the Android tweets that clearly became more negitive.
Perhaps the limits of Presidential powers are stricter than he expected. It is interesting that the
Android tweets are now more negitive than positive, the first time this has occurred.

Interestingly, there seems to be no effect of being positive/negitive on the number of retweets

words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
ggplot(aes(x = sentiment, y = retweetCount, color = android)) +
geom_smooth() +
geom_point() +
scale_y_log10() +
labs(x = "Mean Afinn Sentiment Score", y = "Number of Retweets")

## Joining, by = "id"

or the number of favorites

words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
ggplot(aes(x = sentiment, y = favoriteCount, color = android)) +
geom_smooth() +
geom_point() +
scale_y_log10() +
labs(x = "Mean Afinn Sentiment Score", y = "Number of Favorites")

## Joining, by = "id"
that a tweet gets.

Regression analysis suggests that the number of retweets is increased significantly by a more negitive
tweet but that also the effect wears off with time (very very slightly):

words %>%
filter(isRetweet == FALSE, android) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
lm(log(retweetCount) ~ created * (sentiment < 0), data = .) %>%
summary()

## Joining, by = "id"

##
## Call:
## lm(formula = log(retweetCount) ~ created * (sentiment < 0), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7744 -0.3806 0.0005 0.3576 3.2661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.012e+02 3.942e+00 -25.679 < 2e-16 ***
## created 7.488e-08 2.680e-09 27.939 < 2e-16 ***
## sentiment < 0TRUE 1.959e+01 6.086e+00 3.219 0.00132 **
## created:sentiment < 0TRUE -1.313e-08 4.135e-09 -3.175 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5923 on 1198 degrees of freedom
## Multiple R-squared: 0.5195, Adjusted R-squared: 0.5183
## F-statistic: 431.7 on 3 and 1198 DF, p-value: < 2.2e-16

A similar pattern exists for the number of favorites

words %>%
filter(isRetweet == FALSE, android) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
lm(log(favoriteCount) ~ created * (sentiment < 0), data = .) %>%
summary()

## Joining, by = "id"

##
## Call:
## lm(formula = log(favoriteCount) ~ created * (sentiment < 0),
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75782 -0.35691 -0.00795 0.33800 2.48914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.176e+02 3.452e+00 -34.068 < 2e-16 ***
## created 8.689e-08 2.347e-09 37.020 < 2e-16 ***
## sentiment < 0TRUE 1.435e+01 5.329e+00 2.692 0.00721 **
## created:sentiment < 0TRUE -9.648e-09 3.621e-09 -2.664 0.00781 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5187 on 1198 degrees of freedom
## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6517
## F-statistic: 749.9 on 3 and 1198 DF, p-value: < 2.2e-16

The words used by the Android postings that were positive and negitive varied from before the
election, during the transition and after Trump was sworn in:

words %>%
filter(android) %>%
mutate(phase = ifelse(as.POSIXct("2016-11-08") > created, "Before the election",
ifelse(as.POSIXct("2017-01-20") > created, "Transition",
"In the White House"))) %>%
group_by(phase, pos_sentiment = score >= 0, word) %>%
count() %>%
group_by(phase, pos_sentiment) %>%
filter(word != "no") %>%
top_n(3, wt = n) %>%
arrange(pos_sentiment, phase, desc(n))

## Source: local data frame [18 x 4]


## Groups: phase, pos_sentiment [6]
##
## phase pos_sentiment word n
## <chr> <lgl> <chr> <int>
## 1 Before the election FALSE bad 62
## 2 Before the election FALSE dishonest 27
## 3 Before the election FALSE rigged 25
## 4 In the White House FALSE bad 10
## 5 In the White House FALSE fake 10
## 6 In the White House FALSE ban 6
## 7 Transition FALSE bad 13
## 8 Transition FALSE wrong 11
## 9 Transition FALSE dishonest 10
## 10 Before the election TRUE great 175
## 11 Before the election TRUE thank 69
## 12 Before the election TRUE big 54
## 13 In the White House TRUE great 8
## 14 In the White House TRUE big 5
## 15 In the White House TRUE win 5
## 16 Transition TRUE great 56
## 17 Transition TRUE big 16
## 18 Transition TRUE win 14

We have the fake news to thank for the fake debut post-being sworn in. At least the election was no
longer rigged after he worn it.

Tags: r statistics text tidytext

Related Posts
readr::problems() returns tidy data! 23 Jan 2017
Inter-ocular trauma test 17 Nov 2016
Using tidytext to make sentiment analysis easy 15 Nov 2016

Вам также может понравиться