Вы находитесь на странице: 1из 8

Decision Support Systems 55 (2013) 919926

Contents lists available at ScienceDirect

Decision Support Systems


journal homepage: www.elsevier.com/locate/dss

The impact of social and conventional media on rm equity value:


A sentiment analysis approach
Yang Yu a, Wenjing Duan b,, Qing Cao a
a
Texas Tech University, United States
b
The George Washington University, United States

a r t i c l e i n f o a b s t r a c t

Available online 30 December 2012 This study aims to investigate the effect of social media and conventional media, their relative importance, and
their interrelatedness on short term rm stock market performances. We use a novel and large-scale dataset
Keywords: that features daily media content across various conventional media and social media outlets for 824 public traded
Sentiment analysis rms across 6 industries. Social media outlets include blogs, forums, and Twitter. Conventional media includes
Social media major newspapers, television broadcasting companies, and business magazines. We apply the advanced senti-
Conventional media
ment analysis technique that goes beyond the number of mentions (counts) to analyze the overall sentiment of
Firm equity value
each media resource toward a specic company on the daily basis. We use stock return and risk as the indicators
of companies' short-term performances. Our ndings suggest that overall social media has a stronger relationship
with rm stock performance than conventional media while social and conventional media have a strong interac-
tion effect on stock performance. More interestingly, we nd that the impact of different types of social media
varies signicantly. Different types of social media also interrelate with conventional media to inuence stock
movement in various directions and degrees. Our study is among the rst to examine the effect of multiple sources
of social media along with the effect of conventional media and to investigate their relative importance and their
interrelatedness. Our ndings suggest the importance for rms to differentiate and leverage the unique impact of
various sources of media outlets in implementing their social media marketing strategies.
2012 Elsevier B.V. All rights reserved.

1. Introduction and quantifying the overwhelmingly large amount and unstructured


set of data. A large body of extant research uses the quantitative
The Internet has enabled an increasing amount of user-generated summaries of UGC, such as overall valence and volume of user review
content (UGC) that potentially becomes the primary source of informa- ratings, to represent the users' opinions [6,810,16,24]. However, recent
tion for both consumers and businesses. The past decade has witnessed research suggests that it is important to extract the multifaceted textual
a dramatic change of the media landscape with digital social media content in UGC, which highlights the need to delve deeper into the con-
channels (e.g., blogs, online forums, and social networking sites) for tent of the online discussions [1,13,35]. In addition, the vast majority of
word-of-mouth (WOM) supplementing traditional media channels previous studies focus purely on the effect of online UGC and social
(e.g., newspapers, television, and magazines). The rise of UGC on the In- media, without considering their interactions with conventional medial
ternet has fueled a fast-growing market in personal opinions [1]. More sources [37].
and more businesses and top executives are recognizing social media In this study, we aim to investigate the effect of social media and con-
as an incredibly rich vein for gaining a better understanding of the online ventional media, their relative importance, and their interrelatedness on
discussions and market opportunities, and for gaining feedback and eval- rm performances. Our choice of stock market performance as the out-
uations of their own and their competitors' products and performances, come variable has the following benets. First, stock market perfor-
the market structure, and the overall competitive landscape [27,40]. mance measures the shareholder value, and is the ultimate concern of
With the increasing availability of social media data sources, the re- the company, which has been increasingly used in Marketing and IS
cent years have seen an emergence of academic and industrial research studies [5,25,36]. Second, in contrast to sales and prots data, which
that taps into these data sources. However, the utilization of these are not easily available at a daily level, stock market performance is read-
data sources remains in an early stage and outcomes are often mixed ily available at this level, allowing for more granular analysis. Third, social
[17,27]. The major challenges are the inherent difculties of tracking media content is updated rapidly and spreads virally, which can provide
the real-time rst-hand information to the investor. Thus social media
Corresponding author.
can provide the timely evaluation of rms' performance, which allow
E-mail addresses: yang.yu@ttu.edu (Y. Yu), wduan@gwu.edu (W. Duan), the investors not only to follow consumers' sentiment but also to predict
qing.cao@ttu.edu (Q. Cao). their future business values [25].

0167-9236/$ see front matter 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.dss.2012.12.028
920 Y. Yu et al. / Decision Support Systems 55 (2013) 919926

We use a novel and large-scale dataset that features daily media such as overall valence and volume of user review ratings, to represent
content across various conventional media and social media outlets the users' opinions [6,810,16]. However, the utilization of these data
for 824 public traded rms across 6 industries. To properly capture sources still remains in an early stage and outcomes are often mixed
both longitudinal and cross-sectional properties of our data set, we [17,27]. To the best of our knowledge, most of previous studies have
apply xed-effects panel data estimators. There are several important only used the numeric data such as count or number of stars, without
differences between the current paper and previous studies. First, un- incorporating semantic information contained in the text. Recent re-
like most of the previous studies, our research focuses not on product search suggests that it is important to extract the multifaceted textual
reviews and ratings, but on less structured media content. Second, we content in UGC, which highlights the need to delve deeper into the con-
apply more advanced sentiment analysis technique that go beyond tent of the online discussions [1,13,35]. In addition, it is still not well un-
the number of mentions (counts) and the simple and discrete classi- derstood with respect to the relative impacts of different media types
cation of positive and negative for media discussion. Third, we ex- on marketing performance (e.g., sales), and how marketing perfor-
tract and analyze the overall sentiment toward a specic rm on a mance inuences word of mouth (WOM). Furthermore, the vast major-
daily basis for a large range of rms across various industries, where- ity of previous studies focus purely on the effect of online UGC and
as most of the earlier studies focus either on one product, one rm, or social media, without considering their interactions with conventional
one site. Fourth, we are among the rst to examine the effect of mul- media sources [37].
tiple sources of social media along with the effect of conventional New research applying text-mining (TM) and natural language pro-
media, investigate their relative importance and their interrelated- cessing (NLP) techniques was developed to help people nd business
ness. Fifth, the panel data we use in this study are particularly bene- intelligence from free-form data; however, these methods lack strength
cial in media research as they not only allow us to study the inter- in detecting people's opinion [15]. In the past decades, both industry
temporal behavior and performance of rms, but they also enable and academia have been trying to nd effective methods and tools to
us to control for the unobserved rm-level heterogeneity and seasonal extract opinion-oriented information automatically from unstructured
factors [20]. data [31]. Sentiment analysis (SA) has evolved from TM and NLP, but
The paper proceeds as follows. Related work is reviewed and dis- aims to determine the sentiment of a speaker or a writer with respect
cussed to provide the theoretical background and foundation for our to some specic topics [23]. More recently, SA has greatly assisted deci-
study in Section 2. We then describe the data, measurement, and provide sion makers in extracting opinions from unstructured human-authored
detailed discussion on sentiment analysis procedures in Section 3. In documents [31], which can be applied in various areas. It reduces the
Section 4, we formulate the econometrics models and present the esti- need for reading huge amount of documents to extract business opin-
mation results. Finally, we conclude the paper by discussing study impli- ions on a variety of topics. There are three main reasons to choose SA
cations and suggesting future research directions. as a research approach. 1) It converts large unstructured content into
a form that allows for specic predictions about particular outcomes,
2. Literature review without institute market mechanisms. 2) It builds models to aggregate
the opinions of the collective population and gains useful insights into
While the number and types of information resources continue group behavior to predict future trends. 3) It applies gathered informa-
to grow exponentially, human beings start to face the difculty of tion on how people react to particular objects and then design market-
transforming the wealth of information to knowledge for more effective ing and advertising campaigns.
use [3]. Especially, social media has been exploded as a category of on- Existing SA approaches are either based on linguistic resources or
line discourse where people create content, share and discuss in com- machine learning. SA based on linguistic resources is centered on
munication network. Social media is changing people's way of life predetermined lists of positive and negative words. The polarity of lan-
dramatically because of its high speed connections, ease of use and guage depends on the frequency of different types of words appearing
great credibility. From a business and marketing perspective, we notice in the document. However, this approach involves a number of linguis-
that the media landscape has dramatically changed in the recent years, tic techniques that are not always robust and are often quite labor in-
with traditional media (e.g., newspapers, magazines, and television) tensive [34]. The other machine learning based approach relies on a
now supplemented or replaced by social media (e.g., blogs, microblogs, computer's ability to automatically learn the language used for express-
and online forums). In contrast to content provided by traditional ing sentiment regardless of how good or normal the language is.
media sources, social media content tend to be more human being However, the computer needs to have some information to learn from
oriented. For example, in a blog post, an author argues against the tradi- (called a training corpus or documents) and the more documents the
tional source news from her perspective and readers can join the dis- computer learns the better. In the case of SA, the training corpus is
cussion and propose their own views freely. Despite author or social always a set of example documents annotated by humans. Once the
group bias, such content is still often considered to be more credible computer has learned from the examples, it can apply the acquired
and trustworthy by people than traditional sources of information [2]. knowledge to new documents (a holdout corpus) and then classify
With an increasing amount of user-generated content (UGC) on the them into sentiment categories.
social media, more and more businesses and top executives are recog- There are two main streams of research in SA domain. One stream
nizing social media as an incredibly rich vein for gaining a better under- has focused on sentiment polarity and the other has focused on features
standing of the online discussion and market opportunities, and for detection. Most of previous SA research focuses on UGC such as product
gaining feedback and evaluations of their own and their competitors' reviews or online comments. Also, we notice that most of the prior
products and performances, the market structure and the overall com- works prefer to explore the performance of a single product only.
petitive landscape [27,40]. Gruhl et al. [18] show how to predict spikes in book sales via analyzing
The object of an information system is bridging the gap between the the correlation between blog and review mentions and performance.
continuous growing information and effectively converting information Joshi et al. [19] mine on text and metadata features to predict the earn-
into knowledge. As for information processing capabilities, the chal- ing of movies. In addition, signicant progress has been made in senti-
lenges of a current information system are two-fold: sufcient accuracy ment tracking techniques that extract indicators of public mood
and high efciency. Narrowing this down to UGC research, the major directly from social media content such as blog content and in particular
challenges are the difculties to track and quantify the overwhelmingly large-scale Twitter feeds [14,24,26,29]. In this research, we extract sen-
large amount and unstructured set of data. Recent years have seen an timent signals from both conventional and social media and calculate
emergence of academic and industrial research that taps into UGC. A the sentiment polarity of each document for different rms based on
large body of extant research uses the quantitative summaries of UGC, SA techniques.
Y. Yu et al. / Decision Support Systems 55 (2013) 919926 921

3. Data, measures, and sentiment analysis where Npd denotes the number of positive sentences in document d and
Npd denotes the number of negative sentences in document d. Previous
3.1. Conventional media and social media data studies use ternary classication to represent sentiment polarity, positive,
negative, and neutral [30]. We only use positive and negative labels in this
We randomly select 824 companies and create a unique dataset. study due to two main reasons. The rst reason is that a sentence that in-
As shown in Table 1, this dataset covers six industries including cludes subjective expressions always implies either positive or negative
pharmaceutical, retail, software, savings institutions, health care, feelings. Neutral has a fairly vague range, which is much less accurate to
and hotel. Table 1 shows the summary of the 824 companies. identify. The second consideration is from the methodological (e.g., ma-
We obtain the nancial-statement and nancial-market data for chine learning) perspective. There are no mature sentiment repertoires
the 824 companies from COMPUSTAT and the Center of Research in yet available to efciently and accurately identify neutral sentiment.
Security Prices (CRSP) that was recorded from July 1st to September One routine step in SA is to train the machine to allows computers to
30th, 2011. We use these data to construct measures of abnormal evolve behaviors based on empirical data such as an external knowl-
returns and cumulative abnormal returns (please see more discussion edge repository. In our procedure, we train the sentiment classication
in the next section). Subsequently, we obtained a collection of blog, system by the Cornell movie-review dataset 1. We then compute the ac-
forum, news and micro blog (e.g., Twitter) content for those three curacy of the classier on the test set and use the F-measure to evaluate
months (from 2011/07/01 to 2011/09/30) related to the 824 companies the performance based on precision and recall.
(Table 2). A web crawler was created to download blogs, forums, and The F-measure is calculated as:
news web pages automatically. Due to the large variety of data source,
the web pages have different layouts, different formatting markups, 2  Precision  Recall
F  measure 2
and different hidden advertisements. As such it is the main challenge Precision Recall
for automatic text extraction, and a customized HTML parser based on
where
Python was designed and imported as a noise lter to remove the ir-
relevant information such as sidebars, advertisements, header, footer, Precision TP=TP FP and Recall TP=TP FN: 3
and then to identify useful and clean text paragraphs from large chunks
of HTML code. The underlying mechanism of this lter is rather simple, In Eq. (3), TP is the number of true positive, TN is the number of
which is to use information about the density of text vs. HTML tags to true negatives, FP is the number of false positives, and FN is the num-
gure out if a line of text is worth outputting. Different with a common ber of false negatives.
html parser, the main advantage of this lter is that it can be applied to The proposed positivenegative classication algorithm can auto-
an arbitrary html code regardless of the page layout or the noise tags matically classify polarity with 79% accuracy and 0.86 F-measure on
used. For each blog post, forum post, and news article, we obtained the test set. On average, accuracy of binary sentiment classication is
the title, date, author, source domain, and the main content. For each around 80% [4].
Tweet, we obtain Twitter username, the datetime of the submission
(GMT+ 0), submission type (Tweet or Retweet), and the text content 3.2. Sentiment analysis
of Tweet which is by design limited to 140 characters. In order to
avoid spam messages and other advertising tweets, we lter tweets We employ an automated sentiment analysis technique to explore
that include URLs only. Table 2 summarizes the four media data the document-level polarity and to gain a sentiment score for the
resources. company in a given day. Sentiment analysis is the computational de-
We employ an automated SA technique to explore the sentence-level tection and study of opinions, sentiments, emotions, and subjectiv-
sentiment polarity and to obtain a sentiment measure for each compa- ities in text [22,23,30].
ny in a given day. The detail of such SA analysis will be discussed in To accomplish the goal of mining opinions, the sentiment analy-
Section 3.2. The sentiment matrix is then derived to show sentiment sis involves two consecutive tasks: detecting which text segments
from each media source (a score from 1 to 1), a score of 1 (1) (e.g., sentences) contain sentiment signals, and determining the polar-
means this media source has the most positive (negative) view for the ity and even the strength of that sentiment [30]. Thus, the main purpose
company. For example, in a given day, a company may have a score of of SA is to determine the sentiment of a speaker or a writer on specic
0.8 from the conventional news and a score of 0.3 from the social topics. The use of sentiment analysis and related approaches has gained
media, which would imply that traditional media has more positive great popularity in the past decade due to several factors, including the
sentiment towards the company than social media does. advance of machine learning methods in natural language processing
For one document d, either a blog or a forum post, the overall sen- and information retrieval, the availability of large and rich datasets for
timent score is calculated by the following formula: machine learning algorithms to be trained on, and the development of
many commercial intelligence applications [7,30,38,39].
Npd Nnd In this study, we apply the Nave Bayes (NB) algorithm to conduct
Sd 1 sentiment analysis. NB is a simple but effective classier that has been
Npd Nnd
used in numerous information processing techniques such as image
recognition, NLP, information retrieval, etc., based on the open-source
Table 1 Natural Language Toolkit (NLTK) [11,21,28,32].
Summary of company characteristics. The underlying theorem for Nave Bayesian text classication is
Industry N
the Bayes Rule:

Pharmaceutical preparation manufacturing 156 P BjA  P A


Retail trade 190 PAjB : 4
P B
Software publishers 155
Savings institutions 146
Accommodation and food services, travel arrangement 82 The Bayes Rule enables the calculation of the likelihood of event A
and reservation services and tour operators given that B has happened. This is used in text classication to determine
Health care and social assistance, direct health and 95
medical insurance carriers 1
Polarity dataset v2.0 URL: http://www.cs.cornell.edu/people/pabo/movie-review-
Total 824
data/.
922 Y. Yu et al. / Decision Support Systems 55 (2013) 919926

Table 2
Four types of media data.

Content Data source Data source description # of


category content

Blog Google Blogs Google Blog Search provides fresh, relevant search results from millions of feed-enabled blogs. Users can search for blogs or blog 11,369
posts, and can narrow their searches by dates and more.
Forum BoardReader BoardReader is developed to address the shortcomings of current search engine technology to accurately nd and display 13,091
information contained on the Web's forums and message boards. It uses proprietary software that allows users to search multiple
message boards simultaneously, allowing users to share information in a truly global sense.
Micro blog Twitter Twitter, a micro blogging service, has emerged as a new medium in the spotlight through recent events, such as the death of Steve 24,505
Jobs and the Libyan uprising. Twitter users follow others or are followed. Unlike most online social networking sites, such as
Facebook or MySpace, the relationship of following and being followed requires no reciprocation. A user can follow any other user,
and the user being followed need not follow back. A common practice of responding to a tweet has evolved into a well-dened
markup culture: RT stands for retweet, @ followed by a user identier addresses the user, and # followed by a word represents a
hashtag. This well-dened markup vocabulary combined with a strict limit of 140 characters per posting conveniences users with
brevity in expression.
Conventional Google Google News is a computer-generated news site that aggregates headlines from news sources worldwide. In this research, we choose 3782
News News 10 big news sources as conventional media sources. There are ABC News, New York Times, Reuters, USA Today, Fox News, Wall Street
Journal, Washington Post, CNN, The Economist and Forbes.

the probability that a document B is of type A just by looking at the fre- calculated as the total number of words in Ci divided by the total
quencies of words in the document. In our classication task, we use the number of words in all the categories put together. Hence, P(Ci|D) is:
Bayes Rule in updating the probability of event A (frequencies of words
or terms) happening given that we've observed B (positive or negative P W 0 jC i  P W 1 jC i   P W m1 jC i  P C i : 8
sentiment).
For the purposes of text classication, the Bayes Rule is used to deter- Then we pick the highest probability category as the label of docu-
mine the category a document falls into by determining the most prob- ment D.
able category. That is, given this document with these words in it, which A common criticism of Nave Bayesian text classiers is that they
category does it fall into? A category is represented by a collection of make the nave assumption that words are independent of each
words and their frequencies while the frequency is the number of other and are, therefore, less accurate than a more complex model.
times that each word has been seen in the documents used to train There are many more complex text classication techniques, such as
the classier. Suppose there are n categories C0 to C(n1) (in our case, Support Vector Machines, K-nearest Neighbor, and so on. In practice,
here are only two categories, positive or negative). Determining which Nave Bayesian classiers often perform well, and the current state of
category a document D is mostly associated with means calculating sentiment analysis indicates that they work very well for sentiment
the probability that document D is in category Ci, written P(Ci|D) for polarity classication [4,31].
each category Ci.
Then we can calculate P(Ci|D) by computing: 3.3. Data and measures for rm nancial value

P DjC i P C i We obtain nancial-statement and nancial-market data from


C i jD : 5
P D COMPUSTAT and the Center of Research in Security Prices (CRSP). We
use these data to construct measures of abnormal returns and cumula-
P(Ci|D) is the probability that document D is in category Ci; that is, tive abnormal returns. Fama and French [12] present a time-series
the probability that given the set of words in D, they appear in category model of the evolution of excess security returns (relative to a risk-free
Ci. P(D|Ci) is the probability that for a given category Ci, the words in D rate) as a function of excess market returns, a high-minus-low market-
appear in that category. P(Ci) is the probability of a given category; to-book ratio factor, and a small-minus-big market capitalization factor.
that is, the probability of a document being in category Ci without con- The Fama and French four factor model is therefore often used as the
sidering its contents. P(D) is the probability of that specic document benchmark model to generate normal returns. This model extends the
occurring. To calculate which category D should go in, we need to calcu- market model with the returns on a size portfolio (SMB), a value
late P(Ci|D) for each of the categories and nd the largest probability. portfolio (HML) and a momentum portfolio (UMD). FamaFrench-
Because each of those calculations involves the unknown but xed momentum four-factor model is:
value P(D), we just ignore it and calculate:
Rjt j j Rmt sj SMBt hj HMLt uj UMDt jt 9
P C i jD P DjC i  P C i : 6
where Rjt is the rate of return of the common stock of the jth rm on day t;
P(D) can also be safely ignored because you are interested in the Rmt is the rate of return of a market index on day t; SMBt is the average
relative, not absolute, values of P(Ci|D), and P(D) simply acts as a scal- return on small market-capitalization portfolios minus the average re-
ing factor on P(Ci|D). D is split into the set of words in the document, turn on three large market-capitalization portfolios; HMLt is the average
called W0 through Wm 1. To calculate P(D|Ci), we need to know the return on two high book-to-market equity portfolios minus the average
likelihood that each word appears in Ci rst. Assume that words ap- return on two low book-to-market equity portfolios; UMDt is the aver-
pear independently from other words (which is clearly not true for age return on two high prior return portfolios minus the average return
most languages) and P(D|Ci) is the simple product of the probabilities on two low prior return portfolios. jt is a random variable that, by con-
for each word: struction, must have an expected value of zero, and is assumed to be
uncorrelated with Rmt, uncorrelated with Rkt for kj, not autocorrelated,
P DjC i P W 0 jC i  P W 1 jC i   P W m1 jC i : 7 and homoskedastic. j is a parameter that measures the sensitivity of Rjt
to the excess return on the market index; sj measures the sensitivity of
For any category, P(Wj|Ci) is calculated as the number of times Wj Rjt to the difference between small and large capitalization stock returns;
appears in Ci divided by the total number of words in Ci. P(Ci) is hj measures the sensitivity of Rjt to the difference between value and
Y. Yu et al. / Decision Support Systems 55 (2013) 919926 923

growth stock returns; and uj measures the sensitivity of Rjt to the differ- Table 3
ence between high prior return stock returns and low prior return stock Description of key variables.

returns. Thus, we dene the abnormal return (ARjt) (or prediction error) Variable Description and measure
from the common stock of the jth rm on day t as:
ARit The abnormal return of the stock price for company i at day t.
  IRit The idiosyncratic risk of the stock price for company i at day t.
ARjt Rjt ^ R ^s SMB h^ HML u
^j ^ j UMDt 10 BLOG_POS_NUMit The number of positive sentiment blogs for company i at day t.
j mt j t j t
BLOG_NEG_NUMit The number of negative sentiment blogs for company i at day t.
BLOG_NUMit Total number of mentions in blogs for company i at day t.
where the coefcients ^ ; ^s ; h^ and u
^ j; ^ j are ordinary least squares esti- BLOG_SENTIit Overall sentiment in blogs for company i at day t.
j j j
mates of j,j,sj,hj and uj. The idiosyncratic risk (IRjt) is the standard devi- FORUM_POS_NUMit The number of positive sentiment forums for company i at day t.
FORUM_NEG_NUMit The number of negative sentiment forums for company i at day t.
ation of the model residuals.
FORUM_NUMit Total number of mentions in forums for company i at day t.
FORUM_SENTIit Overall sentiment in forums for company i at day t.
4. Econometric modeling and estimation results TWEET_POS_NUMit The number of positive sentiment Tweets for company i at day t.
TWEET_NEG_NUMit The number of negative sentiment Tweets for company i at day t.
We estimate two equations, where endogenous variables are rm eq- TWEET_NUMit Total number of mentions in Tweets for company i at day t.
TWEET_SENTIit Overall sentiment in Tweets for company i at day t.
uity value (return and risk), using the xed-effects panel data estimation NEWS_POS_NUMit The number of positive sentiment news for company i at day t.
technique. To control for any company idiosyncratic factors that could in- NEWS_NEG_NUMit The number of negative sentiment news for company i at day t.
uence stock return and risk, such as company size, industry characteris- NEWS_NUMit Total number of mentions in news for company i at day t.
tics, and others, we include company xed effects in the model by adding NEWS_SENTIit Overall sentiment in news for company i at day t.
MEDIA_NUMit Total number of mentions in social media for company i at day t.
company-specic dummy variables. The company-specic xed effects
NEWS_SENTIit Overall sentiment in social media for company i at day t.
capture the idiosyncratic and time-constant unobserved characteristics
associated with each company in our data. The advantage of xed effects
estimation is that it controls for intrinsic company characteristics, which
short term span of the dataset and one-day lagged independent variable
inherently affect stock movement. In addition, xed effects estimation
setting. Previous research indicates that stock market may need some
also allows the error term to be arbitrarily correlated with other explan-
time to respond to the social media information [25].
atory variables, thus making the estimation results more robust. Table 3
We then extend our analysis in Table 5 to examine the impact of
describes the variable name and measures. Table 4 shows the descriptive
each individual media metrics. Table 6 shows the results of using num-
statistics.
ber of mentions (count) of each media metrics. Model (b1) shows that
The two equations are specied as follows:
number of blog mentions has a marginal negative effect on return but a
m m n n signicantly positive effect on risk. Number of forum mentions has a
ARit r X i;t1 r X i;t1 r i it 11
negative effect on return. Number of tweets has a signicantly positive
m m n n
effect on risk. Model (b2) adds the interaction terms between social and
IRit s X i;t1 s X i;t1 s i it : 12 conventional media mentions, only the interaction terms of forum and
news mentions, and Twitter and news mentions, have a marginally
Eq. (11) uses the abnormal return (ARit) as the dependent variable, negative effect on return. Risk, nevertheless, seems to be signicantly
and Eq. (12) uses the idiosyncratic risk (IRit) as the dependent vari- inuenced by most variables. Besides number of blog mentions and
able. Let i = 1,,N index the companies, t = 1,,T index the time tweets, number of conventional news mentions is also found to be pos-
m
(day), Xi,t 1 is a vector of one-day lagged independent variables itively and signicantly correlated with risk. The interaction terms of
including all three social media (blog, forum, and Twitter) metrics, forum and news mentions, and Twitter and news mentions, have a neg-
n
and Xi,t 1 is a vector of one-day lagged independent variables of con- ative relationship with risk. Consistent with the results in Table 5, re-
ventional news media metrics. i and i denote the company-specic sults in Table 6 suggest that the sheer volume of social media may
xed effects that capture the idiosyncratic characteristics associated help conventional media to reduce the risk, though the social media
with each company. volume itself may increase the risk. Again, the return of the stock
Table 5 shows the xed-effect estimation results using the total vol- price seems to be only marginally affected.
ume or sentiment of social and conventional media as the independent
variables. In Model (a1), only total number of social media counts has a
Table 4
signicant positive relationship with risk, but not with return. In Model Summary statistics of the daily data.
(a2), the interaction term of the social and conventional media counts is
added. It is shown that the interaction term has a marginally negative re- Variable N Mean Median Std. Dev. Min. Max.

lationship with return, but a highly negative signicant relationship AR 50,611 0.0002 0.0005 0.04 0.61 1.36
with risk. Considering that the social media volume is signicantly larger IR 50,611 0.03 0.02 0.02 0.01 0.14
BLOG_POS_NUM 50,611 0.25 0.00 0.97 0.00 42.00
than conventional media, this result suggests that the volume of social
BLOG_NEG_NUM 50,611 0.01 0.00 0.08 0.00 4.00
and conventional media complements with each other to reduce the BLOG_NUM 50,611 0.44 0.00 1.64 0.00 76.00
uncertainty associated with the stock prices. In Model (a3) and (a4), BLOG_SENTI 50,611 0.24 0.00 0.96 2.00 42.00
we nd that the social media sentiment has a strong positive relation- FORUM_POS_NUM 50,611 0.56 0.00 2.00 0.00 64.00
ship with stock risk, indicating that the overall sentiment of social FORUM_NEG_NUM 50,611 0.00 0.00 0.06 0.00 3.00
FORUM_NUM 50,611 1.23 0.00 3.80 0.00 109.00
media channels may increase the uctuation of the stock market. The in-
FORUM_SENTI 50,611 0.55 0.00 2.00 1.00 64.00
teraction term of the social and conventional media sentiment does not TWEET_POS_NUM 50,611 0.11 0.00 0.82 0.00 77.00
show a signicant relationship with either return or risk. Results in TWEET_NEG_NUM 50,611 0.40 0.00 1.97 0.00 227.00
Table 5 demonstrate that overall social media metrics has a strong rela- TWEET_NUM 50,611 3.17 0.00 9.37 0.00 573.00
TWEET_SENTI 50,611 0.29 0.00 1.76 150.00 52.00
tionship with risk, which indicates that the information instilled from
NEWS_POS_NUM 50,611 0.05 0.00 0.24 0.00 5.00
various social media channels may contribute to the uncertainty of the NEWS_NEG_NUM 50,611 0.00 0.00 0.07 0.00 2.00
market. In addition, we notice that the impact of count and sentiment NEWS_NUM 50,611 0.08 0.00 0.33 0.00 6.00
may have different directions, which suggest the importance of delving NEWS_SENTI 50,611 0.04 0.00 0.24 2.00 5.00
into the textual content mentioned in the media. Furthermore, it seems MEDIA_NUM 50,611 4.84 0.00 10.99 0.00 573.00
NEWS_SENTI 50,611 0.51 0.00 2.80 150.00 67.00
the impact is more salient with risk than return. This may be due to the
924 Y. Yu et al. / Decision Support Systems 55 (2013) 919926

Table 5
Fixed effects estimation results for overall social media and news.

Variable Coefcient (Std. Err.) Coefcient (Std. Err.) Coefcient (Std. Err.) Coefcient (Std. Err.)

Model (a1) Model (a2) Model (a3) Model (a4)

Return equation: with abnormal return ARit as dependent variable


Constant .0002 (.0002) .0003 (.0002) .0002 (.0002) .0003 (.0002)
NEWS_NUMi,t1 .0002 (.0005) .001 (.001)
MEDIA_NUMi,t1 9.43e 06 (.00002) .00003 (.00002)
NEWS_SENTIi,t1 .00003 (.0006) .001 (.001)
MEDIA_SENTIi,t1 .00002 (.00002) .00003 (.00002)
NEWS_NUMi,t1 MEDIA_NUMi,t1 .00004 (.00002)
NEWS_SENTIi,t1 MEDIA_SENTIi,t1

Risk equation: with idiosyncratic risk IRit as dependent variable


Constant .03 (9.97e 06) .03 (.00001) .03 (9.75e 06) .03 (9.88e 06)
NEWS_NUMi,t1 .00004 (.00003) .0001 (.00004)
MEDIA_NUMi,t1 8.36e 06 (9.4e 07) 1.00e 05 (1.06e 06)
NEWS_SENTIi,t1 .00003 (.00003) .00005 (.00004)
MEDIA_SENTIi,t1 7.92e 06 (1.35e 06) 8.50e 06 (1.44e 06)
NEWS_NUMi,t1 MEDIA_NUMi,t1 4.46e 06 (1.31e 06)
NEWS_SENTIi,t1 MEDIA_SENTIi,t1 2.55e 06 (2.21e 06)
N = 49,807 Group = 824

Note: standard errors in parentheses.


Company dummies (xed effects for each of the 824 companies) used in estimating the model are not reported.
p b .01.
p b .05.
p b .10.

Table 7 shows the estimation results using sentiment measures for medial metrics, as well as its interaction with conventional media met-
each individual media. Results in Model (b3) is consistent with that in rics, has a varied impact on risk and return.
Model (b1), that blog sentiment has a positive impact but forum senti- Lastly, we examine the count of positive and negative count of media
ment has a negative impact on return. Both blog and Twitter sentiment mentions on stock performance to get more insights on the effect at a
are also found to have a positive effect on risk. Interestingly, as shown in more granular level. Table 8 shows the results. Positive blog posts have
Model (b4), it seems the interaction term between Twitter and news a strong positive impact on return, and negative forum posts have a
sentiment has a signicant negative effect on returns, but none of the strong negative impact on return. The results provide a better explana-
interaction terms has a signicant effect on risk. Considering results in tion for results shown in Tables 6 and 7, which suggest that the majority
Tables 6 and 7 together, social media metrics seem to have a much of blog posts may be positive comments on companies and products,
stronger impact than conventional news metrics, yet social media and while forum posts may have more negative discussions. For the risk
conventional news media do have a joint effect on the market. The ef- equation, we nd positive blog posts also contribute to the uctuation
fect also seems to be stronger on risk than return, and the volume of of the market. Interestingly, both positive and negative tweets have pos-
the inuence is more salient than the sentiment. Moreover, each social itive relationships with the risk. This is also consistent with the results in

Table 6
Fixed effects estimation results for volume of individual social media and news.

Variable Coefcient (Std. Err.) Coefcient (Std. Err.)

Model (b1) Model (b2)

Return equation: with abnormal return ARit as dependent variable


Constant .0002 (.0002) .0003 (.0001)
BLOG_NUMi,t1 .0002 (.0001) .0002 (.0001)
FORUM_NUMi,t1 .0001 (.00004) .0001 (.00005)
TWEET_NUMi,t1 .00002 (.00002) .00004 (.00002)
NEWS_NUMi,t1 .0002 (.0005) .001 (.001)
BLOG_NUMi,t1 NEWS_NUMi,t1 .00003(.0002)
FORUM_NUMi,t1 NEWS_NUMi,t1 .0002 (.0001)
TWEET_NUMi,t1 NEWS_NUMi,t1 .00004 (.00003)

Risk equation: with idiosyncratic risk IRit as dependent variable


Constant .03 (.00001) .03 (.00001)
BLOG_NUMi,t1 .00002 (6.56e 06) .00002 (6.93e 06)
FORUM_NUMi,t1 2.48e 06 (2.44e 06) 8.36e 07 (2.60e 06)
TWEET_NUMi,t1 9.4e 06 (1.06e 06) .00001 (1.21e 06)
NEWS_NUMi,t1 .00003 (.00003) .0001 (.00004)
BLOG_NUMi,t1 NEWS_NUMi,t1 7.66e 06 (.00001)
FORUM_NUMi,t1 NEWS_NUMi,t1 8.80e 06 (5.15e 06)
TWEET_NUMi,t1 NEWS_NUMi,t1 4.59e 06 (1.41e 06)
N = 49,807 Group = 824

Note: standard errors in parentheses.


Company dummies (xed effects for each of the 862 companies) used in estimating the model are not reported.
p b .01.
p b .05.
p b .10.
Y. Yu et al. / Decision Support Systems 55 (2013) 919926 925

Table 7
Fixed effects estimation results for sentiment of individual social media and news.

Variable Coefcient (Std. Err.) Coefcient (Std. Err.)

Model (b3) Model (b4)

Return equation: with abnormal return ARit as dependent variable


Constant .0002 (.0002) .0003 (.0002)
BLOG_SENTIi,t1 .0002 (.0001) .0002 (.0001)
FORUM_SENTIi,t1 .00007 (.00004) .00004 (.00005)
TWEET_SENTIi,t1 .00004 (.00003) .00005 (.00003)
NEWS_SENTIi,t1 .0001 (.0005) .001 (.0006)
BLOG_SENTIi,t1 NEWS_SENTIi,t1 .00004(.0002)
FORUM_SENTIi,t1 NEWS_SENTIi,t1 .00007 (.00005)
TWEET_SENTIi,t1 NEWS_SENTIi,t1 .0002 (.0001)

Risk equation: with idiosyncratic risk IRit as dependent variable


Constant .03 (.001) .03 (.001)
BLOG_SENTIi,t1 .00003 (6.68e 6) .00003 (7.03e 6)
FORUM_SENTIi,t1 2.97e 06 (2.51e 06) 1.57e 06 (2.65e 06)
TWEET_SENTIi,t1 .00001 (1.70e 06) .00001 (1.83e 06)
NEWS_SENTIi,t1 .00002 (.00003) .0001 (.00004)
BLOG_SENTIi,t1 NEWS_SENTIi,t1 .00001(.00001)
FORUM_SENTIi,t1 NEWS_SENTIi,t1 1.15e 06 (2.57e 06)
TWEET_SENTIi,t1 NEWS_SENTIi,t1 8.85e 06 (5.41e 06)
N = 49,807 Group = 824

Note: standard errors in parentheses.


Company dummies (xed effects for each of the 824 companies) used in estimating the model are not reported.
p b .01.
p b .05.
p b .10.

Tables 6 and 7 that both volume and sentiment of Twitter posts are sig- setters and rms. In this study, we use both social media and conventional
nicantly positively related to risk. Only negative news mentions are data to empirically evaluate the effect. With a sample of 52,746 messages
found to have a marginally positive signicant relationship with risk. from 824 rms from various social media and conventional media
sources, we cover six industries including pharmaceutical, retail, software,
5. Discussion and conclusion savings institutions, health care, and hotel. Our ndings add to the litera-
ture on media's impact on rm stock performances. Specically, we show
The effect of social media and conventional media, their relative im- that overall social media sentiment has a stronger impact on rm stock
portance, and their interrelatedness on short term rm stock market per- performance than conventional media, while social and conventional
formances (e.g., return and risk) is of interest to academics, standard- media have a strong interaction effect on stock performance. These results
highlight the importance of social media and conventional media (in a
Table 8 less degree) on rm stock performance and uncover the moderating rela-
Fixed effects estimation results for sentiment of individual social media and news. tionship between these two types of media sources. Next, we examine
whether the effect of social media on rm stock performance varies
Variable Coefcient (Std. Err.)
depending on social media type (e.g., blogs, Twitter, and forums). Specif-
Model (c1) ically, using sentiment measures for each individual media, we nd that
Return equation: with abnormal return ARit as dependent variable blog sentiment has a positive impact while forum sentiment has a nega-
Constant .0003 (.0002) tive impact on return. Additionally, both blog and Twitter sentiment are
BLOG_POS_NUMi,t1 .0002 (.0001)
found to have a positive effect on risk. Further, we nd that the interaction
BLOG_NEG_NUMi,t1 .00004 (.002)
FORUM_POS_NUMi,t1 .00004 (.00004) effect between Twitter and news sentiment has a signicant negative ef-
FORUM_NEG_NUMi,t1 .003 (.001) fect on returns and but not a signicant effect on risk. Lastly, we examine
TWEET_POS_NUMi,t1 .00003 (.00003) the count of positive and negative count of media messages on stock per-
TWEET_NEG_NUMi,t1 .00003 (.00005)
formance to gather insights on the effect at a more detailed level. We doc-
NEWS_POS_NUMi,t1 .0002 (.0005)
NEWS_NEG_NUMi,t1 .003 (.002)
ument that positive blog posts have a strong positive impact on return
while negative forum posts have a strong negative impact on return.
Risk equation: with idiosyncratic risk IRit as dependent variable We conjecture that blog messages contain more positive contents while
Constant .03 (.001) forum messages are more negative oriented. Thus, better social media re-
BLOG_POS_NUMi,t1 .00002 (6.75e 06)
search may be associated with the quality in information availability and
BLOG_NEG_NUMi,t1 .0001 (.0001)
FORUM_POS_NUMi,t1 2.71e 06 (2.56e 06) information processing.
FORUM_NEG_NUMi,t1 .00002 (.00005) In summary, our results do not suggest that textual analysis of vari-
TWEET_POS_NUMi,t1 .00001 (1.47e 06) ous media sources will resolve, to paraphrase Roll [33], our profession's
TWEET_NEG_NUMi,t1 9.62e 06 (2.86e 06) modest ability to explain stock returns. Our results, however, suggest
NEWS_POS_NUMi,t1 .00002 (.00003)
NEWS_NEG_NUMi,t1 .0002 (.0001)
that textual analysis can contribute to our ability to understand the im-
N = 49,807 Group = 824 pact of information from stock returns, and even if sentiment in media
sometimes does not directly cause returns it might be an efcient way
Note: standard errors in parentheses.
Company dummies (xed effects for each of the 824 companies) used in estimating the for analysts to capture other sources of information. Another limitation
model are not reported. lies in that we examine the overall media sentiment rather than being
p b .01. business domain (e.g., accounting or nance) specic. We suggest that
p b .05.
nancial and business intelligence researchers be cautious when relying
p b .10.
on word classication schemes derived outside the domain of business
926 Y. Yu et al. / Decision Support Systems 55 (2013) 919926

usage. Applying non-business word lists to accounting and nance [30] B. Pang, L. Lee, A sentimental education: sentiment analysis using subjectivity sum-
marization based on minimum cuts, Proceedings of the 42nd Annual Meeting on
topics can lead to a high misclassication rate and spurious correlations. Association for Computational Linguistics, Association for Computational Linguistics,
Nevertheless, the study opens up avenues for future research that 2004, p. 271.
could examine media effects on rm stock performance applying specic [31] B. Pang, L. Lee, Opinion Mining and Sentiment Analysis, 2008, (Now Pub).
[32] T. Pedersen, A simple approach to building ensembles of Naive Bayesian classi-
accounting or nance domain knowledge. Another area for future study ers for word sense disambiguation, Proceedings of the 1st North American chap-
is to explore the tone of the messages in various media sources and the ter of the Association for Computational Linguistics conference, Association for
extent of sentiment among the general public as compared to the senti- Computational Linguistics, 2000, pp. 6369.
[33] R. Roll, R2, Journal of Finance 43 (2) (1988) 541566.
ment among more sophisticated media practitioners such as analysts.
[34] J.C. Short, T.B. Palmer, The application of DICTION to content analysis research in
strategic management, Organizational Research Methods 11 (4) (2008) 727752.
References [35] G.P. Sonnier, L. McAlister, O.J. Rutz, A dynamic model of the effect of online com-
munications on rm sales, Marketing Science 30 (4) (2011) 702716.
[1] N. Archak, A. Ghose, P.G. Ipeirotis, Deriving the pricing power of product features [36] S. Srinivasan, D. Hanssens, Marketing and rm value: metrics, methods, ndings, and
by mining consumer reviews, Management Science 57 (8) (2011) 14851509. future directions, Boston U. School of Management Research Paper No. 2009-6, 2008.
[2] B. Bickart, R.M. Schindler, Internet forums as inuential sources of consumer in- [37] A. Stephen, J. Galak, The Complementary Roles of Traditional and Social Media
formation, Journal of Interactive Marketing 15 (3) (2001) 3140. Publicity in Driving Marketing Performance, 2010.
[3] Q. Cao, W. Duan, Q. Gan, Exploring determinants of voting for the helpfulness of [38] S. Tong, D. Koller, Support vector machine active learning with applications to
online user reviews: a text mining approach, Decision Support Systems 50 (2) text classication, Journal of Machine Learning Research 2 (2002) 4566.
(2011) 511521. [39] P.D. Turney, Thumbs up or thumbs down?: semantic orientation applied to
[4] Q. Cao, M.A. Thompson, Y. Yu, Sentiment analysis in decision sciences research: an unsupervised classication of reviews, Proceedings of the 40th Annual Meeting
illustration to IT governance, Decision Support Systems 54 (2) (2013) 10101015. on Association for Computational Linguistics, Association for Computational Lin-
[5] H. Chen, P. De, Y.J. Hu, B.H. Hwang, Customers as Advisors: The Role of Social guistics, 2002, pp. 417424.
Media in Financial Markets, 2012, Available at SSRN 1807265. [40] A. Wright, Mining the Web for feelings, not facts, New York Times 24 (2009).
[6] J.A. Chevalier, D. Mayzlin, The effect of word of mouth on sales: online book
reviews, National Bureau of Economic Research, 2003.
[7] S.R. Das, M.Y. Chen, Yahoo! for Amazon: sentiment extraction from small talk on Yang Yu is currently an MIS Ph.D. candidate in the Rawls
the web, Management Science 53 (9) (2007) 13751388. College of Business at Texas Tech University. Yang also
[8] W. Duan, B. Gu, A. Whinston, Informational cascades and software adoption on holds a Management Science and Engineering Ph.D. from
the Internet: an empirical investigation, MIS Quarterly 33 (1) (2009) 2348. the School of Economics and Management at Beijing Univer-
[9] W. Duan, B. Gu, A.B. Whinston, Do online reviews matter?an empirical investi- sity of Aeronautics & Astronautics. Yang's research interests
gation of panel data, Decision Support Systems 45 (4) (2008) 10071016. include social media, business intelligence, IT security
[10] W. Duan, B. Gu, A.B. Whinston, The dynamics of online word-of-mouth and product and supply chain information systems. He has published
salesan empirical investigation of the movie industry, Journal of Retailing 84 (2) papers in journals such as Decision Support Systems, Infor-
(2008) 233242. mation Systems and E-Business Management. He has been
[11] G. Escudero, L. Marquez, G. Rigau, Naive Bayes and Exemplar-based Approaches awarded the Best Interdisciplinary Research Award at the
to Word Sense Disambiguation Revisited, 2000, arXiv preprint cs/0007011. Decision Sciences Institute Conference (DSI), 2012.
[12] E.F. Fama, K.R. French, Common risk factors in the returns on stocks and bonds,
Journal of nancial economics 33 (1) (1993) 356.
[13] A. Ghose, S.P. Han, An empirical analysis of user content generation and usage be-
havior on the mobile Internet, Management Science 57 (9) (2011) 16711691.
[14] E. Gilbert, K. Karahalios, C. Sandvig, The network in the garden: designing social Wenjing Duan, Associate Professor of Information Systems &
media for rural life, American Behavioral Scientist 53 (9) (2010) 13671388. Technology Management, received her Ph.D. in Information
[15] N. Godbole, M. Srinivasaiah, S. Skiena, Large-scale sentiment analysis for news Systems from University of Texas at Austin in 2006. Wenjing's
and blogs, Proceedings of the International Conference on Weblogs and Social research interests glide the intersections between Information
Media (ICWSM), 2007. Systems, Economics, and Marketing. Among her primary re-
[16] D. Godes, D. Mayzlin, Using online conversations to study word-of-mouth com- search interests are the social and economic impact of online
munication, Marketing Science 23 (4) (2004) 545560. consumer-generated content and social media, online commu-
[17] D. Godes, D. Mayzlin, Y. Chen, S. Das, C. Dellarocas, B. Pfeiffer, B. Libai, S. Sen, M. nities and online social network, information systems and
Shi, P. Verlegh, The rm's management of social interactions, Marketing Letters marketing, and healthcare and IT. Wenjing has published in
16 (3) (2005) 415428. MIS Quarterly, Information Systems Research, Communications
[18] D. Gruhl, R. Guha, R. Kumar, J. Novak, A. Tomkins, The predictive power of online of ACM, Journal of Retailing, Decision Support Systems, among
chatter, Proceedings of the eleventh ACM SIGKDD international conference on others. She is also the recipient of the Emerald Management
Knowledge discovery in data mining, ACM, 2005, pp. 7887. Reviews Citations of Excellence Awards, NET Institute Research
[19] M. Joshi, D. Das, K. Gimpel, N.A. Smith, Movie reviews and revenues: an experi- Grant, and serves on the Editorial Board of the Decision
ment in text regression, Human Language Technologies: The 2010 Annual Con- Support Systems. For more details, see http://home.gwu.edu/~wduan/.
ference of the North American Chapter of the Association for Computational
Linguistics, Association for Computational Linguistics, 2010, pp. 293296.
[20] E. Kyriazidou, Estimation of dynamic panel data sample selection models, Review
of Economic Studies 68 (3) (2001) 543572. Dr. Qing Cao is the Jerry Rawls Professor of Management
[21] D. Lewis, Naive (Bayes) at forty: the independence assumption in information Information Systems at the Rawls College of Business,
retrieval, Machine Learning (1998) 415, ECML-98. Texas Tech University. He holds a Ph.D. from the College
[22] N. Li, D.D. Wu, Using text mining and sentiment analysis for online forums of Business Administration at the University of Nebraska
hotspot detection and forecast, Decision Support Systems 48 (2) (2010) 354368. (2001). His research interests include IT governance, sup-
[23] B. Liu, Sentiment analysis and subjectivity, Handbook of Natural Language Pro- ply chain information management, strategic alignment,
cessing (2010) 627666. and business intelligence. Dr. Cao was the recipient of the
[24] Y. Liu, X. Huang, A. An, X. Yu, ARSA: a sentiment-aware model for predicting sales University of Missouri-Kansas City Trustee's Faculty Re-
performance using blogs, Proceedings of the 30th Annual International ACM search Award (2005). Dr. Cao has received the 2012
SIGIR Conference on Research and Development in Information Retrieval, ACM, Chancellor's Council Distinguished Research Award at Tex-
2007, pp. 607614. as Tech University. He is also a recipient of the Best Inter-
[25] X. Luo, J. Zhang, W. Duan, Social media and rm equity value, information sys- disciplinary Research Award at the 43rd Annual Decision
tems research, (forthcoming). Sciences Institute (DSI) Conference, 2012. Dr. Cao has
[26] G. Mishne, N. Glance, Leave a reply: an analysis of weblog comments, Third Annu- published more than 42 research papers in top business
al Workshop on the Weblogging Ecosystem, 2006. journals such as Journal of Operations Management, Decision Sciences, Decision Support
[27] O. Netzer, R. Feldman, J. Goldenberg, M. Fresko, Mine your own business: Systems, Communications of ACM, International Journal of Production Research, European
market-structure surveillance through text mining, Marketing Science 31 (3) Journal of Operational Research, among many others. Dr. Cao also served as the Associ-
(2012) 521543. ate Program Chair at the Decision Sciences Institute (DSI) Annual Meeting in 2008.
[28] K. Nigam, R. Ghani, Analyzing the effectiveness and applicability of co-training,
Proceedings of the Ninth International Conference on Information and Knowl-
edge Management, ACM, 2000, pp. 8693.
[29] A. Pak, P. Paroubek, Twitter based system: using Twitter for disambiguating senti-
ment ambiguous adjectives, Proceedings of the 5th International Workshop on
Semantic Evaluation, Association for Computational Linguistics, 2010, pp. 436439.

Вам также может понравиться