Академический Документы
Профессиональный Документы
Культура Документы
Last summer, David Robinson did this interesting text analysis of Donald Trumps tweets and found
that they more angry ones came from Android (which Trump is known to use). But he didnt consider
how Trumps emotional state varies over time and he certainly couldnt have considered what the
impact of the election and recent events would have been on Trump.
Using the twitteR package and the tidyverse ecosystem (plus tidytext ) this is actually a very simple
analysis.
For starters, pulling Trumps tweets (the last 3,200) is very simple:
library(twitteR)
library(tidyverse)
library(tidytext)
source("~/twitter_key.R")
glimpse(trump)
## Observations: 3,099
## Variables: 16
## $ text <chr> "Heading to Joint Base Andrews on #MarineOne wit...
## $ favorited <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ favoriteCount <dbl> 77699, 85576, 71312, 220083, 64348, 84125, 62284...
## $ replyToSN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ created <time> 2017-02-10 23:24:51, 2017-02-10 13:35:50, 2017-...
## $ truncated <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, ...
## $ replyToSID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ id <chr> "830195857530183684", "830047626414477312", "830...
## $ replyToUID <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ statusSource <chr> "<a href=\"http://twitter.com/download/iphone\" ...
## $ screenName <chr> "realDonaldTrump", "realDonaldTrump", "realDonal...
## $ retweetCount <dbl> 21473, 19779, 15069, 64363, 10082, 14185, 11294,...
## $ isRetweet <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ retweeted <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ longitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ latitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
Using tidytext , it is straightforward to unnest and tokenize the words in the body of the tweets:
## # A tibble: 57,239 x 7
## id
## <chr>
## 1 830195857530183684
## 2 830195857530183684
## 3 830195857530183684
## 4 830195857530183684
## 5 830195857530183684
## 6 830195857530183684
## 7 830195857530183684
## 8 830195857530183684
## 9 830195857530183684
## 10 830195857530183684
## # ... with 57,229 more rows, and 6 more variables: statusSource <chr>,
## # retweetCount <dbl>, favoriteCount <dbl>, created <time>,
## # isRetweet <lgl>, word <chr>
Given what David Robinson found, we might want to convert the statusSource variable into a flag for
whether it was posted via an Android device:
## # A tibble: 57,239 x 7
## id retweetCount favoriteCount created
## <chr> <dbl> <dbl> <time>
## 1 830195857530183684 21473 77699 2017-02-10 23:24:51
## 2 830195857530183684 21473 77699 2017-02-10 23:24:51
## 3 830195857530183684 21473 77699 2017-02-10 23:24:51
## 4 830195857530183684 21473 77699 2017-02-10 23:24:51
## 5 830195857530183684 21473 77699 2017-02-10 23:24:51
## 6 830195857530183684 21473 77699 2017-02-10 23:24:51
## 7 830195857530183684 21473 77699 2017-02-10 23:24:51
## 8 830195857530183684 21473 77699 2017-02-10 23:24:51
## 9 830195857530183684 21473 77699 2017-02-10 23:24:51
## 10 830195857530183684 21473 77699 2017-02-10 23:24:51
## # ... with 57,229 more rows, and 3 more variables: isRetweet <lgl>,
## # word <chr>, android <lgl>
Lets now code the tweets using the afinn sentiment set:
## Joining, by = "word"
words
## # A tibble: 5,093 x 8
## id retweetCount favoriteCount created
## <chr> <dbl> <dbl> <time>
## 1 830047626414477312 19779 85576 2017-02-10 13:35:50
## 2 830047626414477312 19779 85576 2017-02-10 13:35:50
## 3 830042498806460417 15069 71312 2017-02-10 13:15:27
## 4 829721019720015872 10082 64348 2017-02-09 15:58:01
## 5 829721019720015872 10082 64348 2017-02-09 15:58:01
## 6 829689436279603206 14185 84125 2017-02-09 13:52:31
## 7 829689436279603206 14185 84125 2017-02-09 13:52:31
## 8 829689436279603206 14185 84125 2017-02-09 13:52:31
## 9 829689436279603206 14185 84125 2017-02-09 13:52:31
## 10 829689436279603206 14185 84125 2017-02-09 13:52:31
## # ... with 5,083 more rows, and 4 more variables: isRetweet <lgl>,
## # word <chr>, android <lgl>, score <int>
And now lets see how the typical sentiment of those tweets has varied since April 2016 (midsts of
the Republican primary) to present:
words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
ggplot(aes(x = created, y = sentiment)) +
geom_smooth() +
geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) +
labs(x = "Date", y = "Mean Afinn Sentiment Score")
The vertical lines denote the date he was named as the Republican candidate (May 3rd 2016), the
date of the election (Nov 8th 2016) and inauguration day. Thing arent looking up for Trump. He
seems to be more angry/sad/negative now than any prior point during the past year.
words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
ggplot(aes(x = created, y = sentiment, color = android)) +
geom_smooth() +
geom_vline(xintercept = as.numeric(as.POSIXct(("2017-01-20")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-11-08")))) +
geom_vline(xintercept = as.numeric(as.POSIXct(("2016-05-03")))) +
labs(x = "Date", y = "Mean Afinn Sentiment Score")
We see the general trend that David Robinson identified - the Android tweets tended to be more
negitive than the other platforms. It is interesting that they were more positive than the tweets
presumed to be by staff right before the election. Also, we can see the non-Android tweets were
more positive during the transition than the Android tweets that clearly became more negitive.
Perhaps the limits of Presidential powers are stricter than he expected. It is interesting that the
Android tweets are now more negitive than positive, the first time this has occurred.
words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
ggplot(aes(x = sentiment, y = retweetCount, color = android)) +
geom_smooth() +
geom_point() +
scale_y_log10() +
labs(x = "Mean Afinn Sentiment Score", y = "Number of Retweets")
## Joining, by = "id"
words %>%
filter(isRetweet == FALSE) %>%
group_by(id, created, android) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
ggplot(aes(x = sentiment, y = favoriteCount, color = android)) +
geom_smooth() +
geom_point() +
scale_y_log10() +
labs(x = "Mean Afinn Sentiment Score", y = "Number of Favorites")
## Joining, by = "id"
that a tweet gets.
Regression analysis suggests that the number of retweets is increased significantly by a more negitive
tweet but that also the effect wears off with time (very very slightly):
words %>%
filter(isRetweet == FALSE, android) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
lm(log(retweetCount) ~ created * (sentiment < 0), data = .) %>%
summary()
## Joining, by = "id"
##
## Call:
## lm(formula = log(retweetCount) ~ created * (sentiment < 0), data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7744 -0.3806 0.0005 0.3576 3.2661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.012e+02 3.942e+00 -25.679 < 2e-16 ***
## created 7.488e-08 2.680e-09 27.939 < 2e-16 ***
## sentiment < 0TRUE 1.959e+01 6.086e+00 3.219 0.00132 **
## created:sentiment < 0TRUE -1.313e-08 4.135e-09 -3.175 0.00154 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5923 on 1198 degrees of freedom
## Multiple R-squared: 0.5195, Adjusted R-squared: 0.5183
## F-statistic: 431.7 on 3 and 1198 DF, p-value: < 2.2e-16
words %>%
filter(isRetweet == FALSE, android) %>%
group_by(id, created) %>%
summarize(sentiment = mean(score)) %>%
inner_join(select(words, id, retweetCount, favoriteCount) %>%
distinct()) %>%
lm(log(favoriteCount) ~ created * (sentiment < 0), data = .) %>%
summary()
## Joining, by = "id"
##
## Call:
## lm(formula = log(favoriteCount) ~ created * (sentiment < 0),
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.75782 -0.35691 -0.00795 0.33800 2.48914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.176e+02 3.452e+00 -34.068 < 2e-16 ***
## created 8.689e-08 2.347e-09 37.020 < 2e-16 ***
## sentiment < 0TRUE 1.435e+01 5.329e+00 2.692 0.00721 **
## created:sentiment < 0TRUE -9.648e-09 3.621e-09 -2.664 0.00781 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5187 on 1198 degrees of freedom
## Multiple R-squared: 0.6525, Adjusted R-squared: 0.6517
## F-statistic: 749.9 on 3 and 1198 DF, p-value: < 2.2e-16
The words used by the Android postings that were positive and negitive varied from before the
election, during the transition and after Trump was sworn in:
words %>%
filter(android) %>%
mutate(phase = ifelse(as.POSIXct("2016-11-08") > created, "Before the election",
ifelse(as.POSIXct("2017-01-20") > created, "Transition",
"In the White House"))) %>%
group_by(phase, pos_sentiment = score >= 0, word) %>%
count() %>%
group_by(phase, pos_sentiment) %>%
filter(word != "no") %>%
top_n(3, wt = n) %>%
arrange(pos_sentiment, phase, desc(n))
We have the fake news to thank for the fake debut post-being sworn in. At least the election was no
longer rigged after he worn it.
Related Posts
readr::problems() returns tidy data! 23 Jan 2017
Inter-ocular trauma test 17 Nov 2016
Using tidytext to make sentiment analysis easy 15 Nov 2016