Вы находитесь на странице: 1из 6

From Clickbait to Fake News Detection:

An Approach based on Detecting the Stance of Headlines to Articles

Peter Bourgonje, Julian Moreno Schneider, Georg Rehm


DFKI GmbH, Language Technology Lab
Alt-Moabit 91c, 10559 Berlin, Germany
{peter.bourgonje,julian.moreno schneider,georg.rehm}@dfki.de

Abstract velopments and their effects are not harmless but


can have a significant impact on real-world events,
We present a system for the detection of which is illustrated by a description of the role of
the stance of headlines with regard to their social media in the 2016 US presidential election
corresponding article bodies. The ap- by (Allcott and Gentzkow, 2017), and by a study
proach can be applied in fake news, es- on the effectiveness and debunking strategies of
pecially clickbait detection scenarios. The rumours surrounding the Affordable Care Act by
component is part of a larger platform for (Berinsky, 2017).
the curation of digital content; we con- While the cause of this situation may have its
sider veracity and relevancy an increas- roots in many different aspects of modern soci-
ingly important part of curating online in- ety, and hence needs to be approached from sev-
formation. We want to contribute to the eral different angles, we aim to make a contribu-
debate on how to deal with fake news and tion from the angle of Language Technology and
related online phenomena with technolog- Natural Language Processing. We consider fully-
ical means, by providing means to separate automated procedures for fact-checking, clickbait
related from unrelated headlines and fur- detection or fake news classification not feasible
ther classifying the related headlines. On at this point (Rehm, 2017), but aim to support the
a publicly available data set annotated for community by providing means of detecting arti-
the stance of headlines with regard to their cles or pieces of news that need to be approached
corresponding article bodies, we achieve a with caution, where a human has to make final
(weighted) accuracy score of 89.59. decisions (on credibility, legitimacy etc.), but is
aided by a set of tools. The approach described
1 Introduction
in this paper can serve as the back-end of such a
With the advent of social media and its increas- smart set of tooling around fact-checking and can
ingly important role as a provider and amplifier augment news coming from both traditional and
of news, basically anyone, anywhere, can pro- non-traditional (social media) sources. We envi-
duce and help circulate content for other people sion the resulting set of tools as a collection of
to read. Traditional barriers to publishing con- expert tools for specific job profiles (like a jour-
tent (like a press to print newspapers or broadcast- nalist or a news editor), or in the shape of a sim-
ing time for radio or television) have disappeared, ple browser plug-in, flagging unverified or dubious
and with this, at least part of traditional quality content to the end user.
control procedures have disappeared as well. Ba- The work presented in this paper was carried
sic journalistic principles like source verification, out under the umbrella of a two-year research and
fact checking and accountability can be easily by- technology transfer project, in which a research
passed or simply ignored by individuals or organ- centre collaborates with four SME partners that
isations publishing content on Twitter, Facebook face the challenge of having to process, analyse
or other social networks. The impact of this situ- and make sense of large amounts of digital con-
ation is illustrated by the predominance of terms tent. The companies cover four different use cases
like “trolls”, “fake news”, “post-truth media” and and sectors (Rehm and Sasaki, 2015) including
“alternative facts”. There is evidence that these de- journalism. For these partners we develop a plat-

84
Proceedings of the 2017 EMNLP Workshop on Natural Language Processing meets Journalism, pages 84–89
c
Copenhagen, Denmark, September 7, 2017. 2017 Association for Computational Linguistics
form that provides access to language and knowl- based on their content: tweets spreading news are
edge technologies (Bourgonje et al., 2016a,b). The affirmed by related tweets, whereas tweets spread-
services are integrated by the SME partners into ing rumours are mostly questioned or denied. In
their own in-house systems or those of clients. this paper we propose a solution that involves the
In this paper, we aim to contribute to a first human-in-the-loop. We think that our approach
step in battling fake news, often referred to as can be a valuable part of solving the problem de-
stance detection, where the challenge is to detect scribed above. The rest of this paper is divided
the stance of a claim with regard to another piece into five sections. Section 2 reviews related work,
of content. Our experiments are based on the setup Section 3 describes the data set used, Section 4
of the first Fake News Challenge (FNC1).1 . In explains our approach in detail and Section 5 pro-
FNC1, the claim comes in the form of a head- vides an evaluation. Our conclusions are presented
line, and the other piece of content is an article in Section 6.
body. This step may seem, and, in fact, is, a long
way from automatically checking the veracity of 2 Related Work
a piece of content with regard to some kind of
ground truth. But the problem lies exactly in the The suggestion of using Language Technologies
definition of the truth, and the fact that it is sen- (NLP, NLU etc.) to design solutions for mod-
sitive to bias. Additionally, and partly because ern online media phenomena such as “fake news”,
of this, annotated corpora, allowing training and “hate speech”, “abusive language”, etc. is receiv-
experimental evaluation, are hard to come by and ing rapidly growing interest in the form of shared
also often (in the case of fact checker archives) not tasks, workshops and conferences. The aware-
freely available. We argue that detecting whether ness that LT can contribute to solutions related to
a piece of content is related or not related to an- these topics is present. Yet, at the same time, it
other piece of content (e. g., headline vs. article is being acknowledged that the problem is much
body) is an important first step, which would per- more complex than anything that can be solved
haps best be described as clickbait detection (i. e., by exploiting current state of the art techniques
a headline not related to the actual article is more alone. The effect known as “belief persever-
likely to be clickbait). Following the FNC1 setup, ance” or “continued influence effect” (Wilkes and
the further classification of related pieces of con- Leatherbarrow, 1988) and its influence on mod-
tent into more fine-grained classes provides valu- ern media and politics is described by (Nyhan
able information once the “truth” (in the form of a and Reifler, 2015), who state that reasoning based
collection of facts) has been established, so that on facts that have shown to be false, remains in
particular pieces of content can be classified as place until an alternative line of reasoning has
“fake” or, rather, “false”. Since this definitive, re- been offered. The credibility of a politician step-
solving collection of facts is usually hard to come ping down due to bribery accusations is not re-
by, the challenge of stance detection can be put stored after only rejecting this explanation (by a
to use combining the outcome with credibility or letter from the prosecutors). In addition, an al-
reputation scores of news outlets, where several ternative explanation (like being named president
high-credibility outlets disagreeing with a partic- of a university, but not being able to disclose this
ular piece of content point towards a false claim. until the predecessor has stepped down) has to be
Stance detection can also prove relevant for de- provided. Another socio-psychological contribu-
tecting political bias: if authors on the same end tion on the topic of “fake news” and its consump-
of the political spectrum are more likely to agree tion is presented by (Marchi, 2012) who report
with each other, the (political) preference of one on a survey among teenagers and their news con-
author can be induced once the preference of the sumption habits. Although they have a slightly
other author is known. Additionally, the stances different definition of “fake news” than the one
of utterances towards a specific piece of content we use in this paper, the study presents a rele-
can provide hints on its veracity. (Mendoza et al., vant overview of the consumption of news and
2010) show that the propagation of tweets regard- the important aspects with different social groups.
ing crisis situations (like natural disasters) differs The authors claim that “authenticity” is highly val-
ued among teenagers consuming news, hence their
1 http://www.fakenewschallenge.org explained preference for blogs, satirical shows,

85
or basically anything other than traditional me- 3 Data Set
dia outlets, which they consider “identical”, lack-
Our experiments are conducted on the dataset re-
ing contextual information and any authenticity.
leased by the organisers of the first Fake News
The acknowledgment that teenagers increasingly
Challenge (FNC1) on stance detection. The data
rely on news coming from non-traditional news
set is based on the work of (Ferreira and Vla-
sources underlines the need for new ways of deal-
chos, 2016) and can be downloaded from the cor-
ing with challenges related to these alternative
responding GitHub page, along with a baseline
sources. (Conroy et al., 2015) present a use-
implementation for this task, achieving a score of
ful overview of recent approaches towards “fake
79.53.2 The data consists of a set of headlines and
news” detection using NLP and network analy-
articles that are combined with each other (mul-
ses. The authors include several state-of-the-art
tiple times, in different combinations) and anno-
figures and acknowledge the fact that these num-
tated for one of four classes: “unrelated”, “agree”,
bers are domain-dependent, which is why it is
“disagree”, “discuss”, indicating the stance of the
difficult to arrive at a state-of-the-art figure in-
headline towards the content of the article (see Ta-
dependent of a specific use case and data set.
ble 1).
From an NLP perspective, the challenge of deal-
ing with this problem is further exemplified by
the fact that annotated data is hard to find, and, if Unique headlines 1.648
present, exhibits rather low inter-annotator agree- Unique articles 1.668
ment. Approaching the “abusive language” and Annotated pairs 49.972 100%
“hate speech” problem from an NLP angle (Bour- Class: unrelated 36.545 73%
gonje et al., 2017), (Ross et al., 2016) introduce a Class: discuss 8.909 18%
German corpus of tweets and annotate it for hate Class: agree 3.678 7%
speech, resulting in figures for Krippendorff’s α Class: disagree 840 2%
between 0.18 and 0.29, (Waseem, 2016) compare
amateur (CrowdFlower) annotations and expert Table 1: Key figures of the FNC-1 data set
annotations on an English corpus of Tweets and re-
port figures for Cohen’s Kappa of 0.14, (Van Hee The FNC1 scoring method consists of first veri-
et al., 2015) use a Dutch corpus annotated for cy- fying whether a particular combination of headline
berbullying and report Kappa scores between 0.19 and article has been correctly classified as “unre-
and 0.69, and (Kwok and Wang, 2013) investigate lated” (the corresponding class) or “related” (one
English racist tweets and report an overall inter- of the classes “agree”, “disagree” or “discuss”).
annotator agreement of only 33%. Getting this binary classification correct amounts
up to 25% of the final, weighted score. The re-
maining 75% of the score consists of correctly
classifying headline article pairs in the three re-
maining classes. The setup of our system adheres
An approach similar to ours is described by to this scoring method, and hence applies several
(Ferreira and Vlachos, 2016), who introduce a data classifiers sequentially, as explained in Section 4.
set and three-class classification (“for”, “against”,
“observing”). In addition to a logistic regres- 4 Approach and Methods
sion classifier, the authors exploit dependency
In line with the scoring system of the challenge,
parse graphs, a paraphrase database (Pavlick et al.,
we first apply a procedure to decide whether a par-
2015) and several other features, to arrive at an ac-
ticular headline/article combination is related or
curacy of 73%. Another related approach is de-
unrelated. This is done based on n-gram match-
scribed by (Augenstein et al., 2016), who apply
ing of the lemmatised input (headline or article),
stance detection methods on the SemEval 2016
using the CoreNLP Lemmatiser (Manning et al.,
Task 6 data set. Their focus is on learning stances
2014). The number of matching n-grams (where
towards a topic in an unsupervised and weakly su-
n = 1..6) in the headline and article is multiplied
pervised way using a neural network architecture.
by length and IDF value of the matching n-gram
(Babakar and Moy, 2016) present a useful and re-
cent overview of fact checking approaches. 2 https://github.com/FakeNewsChallenge/fnc-1

86
(n-grams containing only stop words or punctua- cally motivated features, some of them inspired by
tion are not considered), then divided by the total the work of (Ferreira and Vlachos, 2016). From
number of n-grams. If the resulting score is above rather basic ones (like a question mark at the end
some threshold (we established 0.0096 as the op- of a headline to detect “disagree” instances) to
timal value), the pair is taken to be related. more sophisticated ones (like extracting a depen-
A formal definition is provided in Equation 1: dency graph, looking for negation-type typed de-
considering a headline and an article represented pendencies and calculate their normalised distance
by two arrays (H and A) of all possible lemmatised to the root node of the graph, and compare this
n-grams when n ∈ [1, 6], h(i) and a(i) being the ith value for headline and article), but these did not
element of arrays H and A, len(·) being a func- improve upon the final weighted score reported in
tion that computes the length in tokens of a string Table 2.
(n-gram), T FTk being the frequency of appearance
of term k in array T and IDF k being the inverse 5 Evaluation
document frequency of term k in all the articles.
The first step of deciding whether a head-
len(H) line/article pair is related or not is done based on
∑i=1 T F h(i) ∗ IDF h(i)
sc = (1) n-gram matching (of lemmatised n-grams). This
len(H) + len(A)
procedure is rule-based and only relies on find-
where ing an optimal value for the threshold, based on
the data. To arrive at an optimal value, we used
h(i) H(i) all data and did not separate it into training and
T F h(i) = {(T FH + T FA ) ∗ len(h(i))} (2)
test sets. Since the subsequent classification meth-
As shown in Table 1, the majority of “related” ods are based on machine learning, the following
instances are of the class “discuss” and simply as- evaluation figures are the result of 50-fold cross-
signing this class to all “related” instances leads to validation, with a 90-10 division of training and
an accuracy of 61.51 already (for this portion of test data, respectively.
the data set), as shown in the “Majority vote” col- Considering that the combination of headlines
umn. To improve upon this baseline and to further and article bodies has been performed randomly
classify the related pairs into “agree”, “disagree” with many obviously unrelated combinations, the
or “discuss”, we use Mallet’s Logistic Regres- relatedness score of 93.27 can be considered rel-
sion classifier implementation (McCallum, 2002) atively low.3 Upon manual investigation of the
trained on headlines only (without lemmatisation cases classified as “unrelated” (but that were in
or stop word removal), using the three classes. fact of the “agree”, “disagree” or “discuss” class),
This resulted in a weighted score of 79.82 (column we found that the vast majority had headlines with
“3-class classifier”). In subsequent experiments, different wordings that were not matching after
we introduced a (relative) confidence threshold: if lemmatisation. One concrete example with the
the distance between the best scoring class and the headline “Small Meteorite Hits Managua” in its
second-best scoring class is above some thresh- article body mentions “the Nicaraguan capital” but
old (we established 0.7 as the optimal value), not “Managua” and “a chunk of an Earth-passing
the best-scoring class is assigned to the pair. If asteroid” instead of “small meteorite”. To improve
the difference was below the threshold, we used the approach for cases such as this one, we pro-
three binary classifiers to decide between the best pose to include more sophisticated techniques to
scoring class and the second-best scoring class capture word relatedness in a knowledge-rich way
(i. e., one binary classifier for “agree”-“disagree”, as an important part of future work. The other
one for “agree”-“discuss” and one for “discuss”- way round, cases classified as related that were
“disagree”). These classifiers are trained on both in fact annotated as “unrelated” contained words
the headline and the article (joined together, with- in the headline that were frequently mentioned in
out lemmatisation or stop word removal). The re- the article body. One example with the headline
sults are shown in the column “Combined classi- “SHOCK CLAIM: PGA Golfer Says Tiger Woods
fiers” in Table 2. Is Suspended For Failed Drug Test” was combined
This setup leads to the best results on the data 3 The variation for this row in Table 2 is due to different
set. In other experiments we used more linguisti- runs (on different, random splits of the data).

87
Majority vote 3-class classifier Combined classifiers

Relatedness score 93.27 93.26 93.29


Three-class score 61.51 75.34 88.36
Weighted score 69.45 79.82 89.59

Table 2: Results of 50-fold cross-validation

with an article body about the divorce of Tiger tagged as “unrelated”, this is not something that
Woods and Elin Nordegren. Here, we suggest, as is usually encountered in a real-world scenario.
part of future work, to include event detection, to For the more fine-grained classification of articles
move away from entity-based representations and that have been classified as “related”, the three-
put more focus on the event actually reported. way classification is a relevant first step, but other
After deciding on relatedness, we are left with classes may need to be added to the set, or a more
(on average) 1,320 instances. For the three-class detailed division may need to be made in order
classification of this set, we obtained (on aver- to take the next steps in tackling the fake news
age) 686 cases that scored above the scoring dif- challenge. Additionally, we see the integration
ference threshold and were assigned their class of known facts and general discourse knowledge
by this three-class Logistic Regression classifier. (possibly through Linked Data), and the incorpo-
Of these, 642 were correct, resulting in an ac- ration of source credibility information as impor-
curacy of 93.64 for this portion of the data set tant and promising suggestions for future research.
(i. e., “related”). The average number of cases
where the scoring difference was below the thresh-
old (634) were classified using the three binary 6 Conclusions
classifiers. This resulted in 544 correctly classi-
fied instances, and a score of 85.83 for this sec- We present a system for stance detection of
tion of the data set. Putting these scores together, headlines with regard to their corresponding ar-
the weighted score and the individual components ticle bodies. Our system is based on simple,
are shown in Table 2, i. e., the relatedness score lemmatisation-based n-gram matching for the bi-
for the binary decision “related” or “unrelated” nary classification of “related” vs. “unrelated”
(25% of the weighted score) and the three-class headline/article pairs. The best results were ob-
score for the classification of “related” instances tained using a setup where the more fine-grained
into “agree”, “disagree” or “discuss” (75% of the classification of the “related” pairs (into “agree”,
weighted score). To get an idea of the effect of “disagree”, “discuss”) is carried out using a Logis-
the first stage’s error rate on the second stage of tic Regression classifier at first, then three binary
processing, we re-ran the experiments taking the classifiers with slightly different training proce-
“related” vs. “unrelated” information from the an- dures for the cases where the first classifier lacked
notations directly. This resulted in a three-class confidence (i. e., the difference between the best
score of 89.82, i. e., a 1.46 drop in accuracy due to and second-best scoring class was below a thresh-
classification errors in the first stage. old). We improve on the accuracy base line set
While these numbers look promising for initial by the organisers of the FNC1 by over 10 points
steps towards tackling the challenge that fake news and scored 9th place (out of 50 participants) in
poses globally, we acknowledge that at least the the actual challenge. As described in Section 1,
25% of the score (the relatedness score of 93.27) the approach explained in this article can be part
is not directly applicable in a real world scenario, of the set of services needed by a fact-checking
since the data set was artificially boosted by ran- tool (Rehm, 2017). The first, binary classification
domly combining headlines and article bodies – of “related” vs. “unrelated” can be exploited for
a headline such as “Isis claims to behead US jour- clickbait detection. The more fine-grained classi-
nalist” is combined with an article on who is going fication of related headlines can specifically sup-
to be the main actor in a biopic on Steve Jobs. Al- port in the detection of political bias and rumour
though this headline/article pair was (obviously) veracity (Srivastava et al., 2017).

88
Acknowledgments R. Marchi. 2012. With facebook, blogs, and fake news,
teens reject journalistic objectivity. Journal of Com-
The authors wish to thank the anonymous reviewers for
their helpful feedback. The project “Digitale Kuratierung-
munication Inquiry, 36(3):246–262.
stechnologien” is supported by the German Federal Min- Andrew Kachites McCallum. 2002. Mallet: A machine
istry of Education and Research, Wachstumskern-Potenzial
learning for language toolkit.
(no. 03WKP45). http://www.digitale-kuratierung.de.
Marcelo Mendoza, Barbara Poblete, and Carlos
Castillo. 2010. Twitter Under Crisis: Can We Trust
References What We RT? In Proc. of the First Workshop on
Social Media Analytics, pages 71–79.
Hunt Allcott and Matthew Gentzkow. 2017. Social me-
dia and fake news in the 2016 election. Working Pa- Brendan Nyhan and Jason Reifler. 2015. Displacing
per 23089, National Bureau of Economic Research. misinformation about events: An experimental test
of causal corrections. Journal of Exp. Political Sci-
Isabelle Augenstein, Tim Rocktäschel, Andreas Vla- ence, 2(01):81–93.
chos, and Kalina Bontcheva. 2016. Stance detec-
tion with bidirectional conditional encoding. CoRR, Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch,
abs/1606.05464. Benjamin Van Durme, and Chris Callison-Burch.
2015. PPDB 2.0: Better paraphrase ranking, fine-
Mevan Babakar and Will Moy. 2016. The State of Au- grained entailment relations, word embeddings, and
tomated Factchecking. style classification. In Proc. of the 53rd Annual
Adam J. Berinsky. 2017. Rumors and health care Meeting of the ACL, Beijing, China, Volume 2: Short
reform: Experiments in political misinformation. Papers, pages 425–430.
British Journal of Political Science, 47(2):241–262. Georg Rehm. 2017. An Infrastructure for Empowering
Peter Bourgonje, Julian Moreno-Schneider, Jan Internet Users to handle Fake News and other Online
Nehring, Georg Rehm, Felix Sasaki, and Ankit Media Phenomena. In Language Technologies for
Srivastava. 2016a. Towards a Platform for Curation the Challenges of the Digital Age: Proc. of GSCL
Technologies: Enriching Text Collections with 2017, Berlin.
a Semantic-Web Layer. In The Semantic Web, Georg Rehm and Felix Sasaki. 2015. Digitale Ku-
number 9989 in LNCS, pages 65–68. Springer. ratierungstechnologien – Verfahren für die effiziente
Verarbeitung, Erstellung und Verteilung qualitativ
Peter Bourgonje, Julian Moreno Schneider, and Georg
hochwertiger Medieninhalte. In Proc. of GSCL
Rehm. 2017. Automatic Classification of Abusive
2015, pages 138–139.
Language and Personal Attacks in Various Forms of
Online Communication. In Language Technologies Björn Ross, Michael Rist, Guillermo Carbonell, Ben-
for the Challenges of the Digital Age: Proc. of GSCL jamin Cabrera, Nils Kurowsky, and Michael Wo-
2017, Berlin. jatzki. 2016. Measuring the Reliability of Hate
Speech Annotations: The Case of the European
Peter Bourgonje, Julian Moreno Schneider, Georg
Refugee Crisis. In Proc. of NLP4CMC III: 3rd
Rehm, and Felix Sasaki. 2016b. Processing
Workshop on NLP for CMC, pages 6–9.
Document Collections to Automatically Extract
Linked Data: Semantic Storytelling Technologies Ankit Srivastava, Georg Rehm, and Julian Moreno
for Smart Curation Workflows. In Proc. of the Schneider. 2017. DFKI-DKT at SemEval-2017 Task
2nd Int. Workshop on NLG and the Semantic Web 8: Rumour Detection and Classification Using Cas-
(WebNLG 2016), pages 13–16, Edinburgh, UK. cading Heuristics. In Proc. of SemEval-2017, pages
477–481, Vancouver, Canada. ACL.
Niall J. Conroy, Victoria L. Rubin, and Yimin Chen.
2015. Automatic deception detection: Methods for Cynthia Van Hee, Els Lefever, Ben Verhoeven, Julie
finding fake news. In Proc. of the 78th ASIS&T An- Mennes, Bart Desmet, Guy De Pauw, Walter Daele-
nual Meeting, ASIST ’15, pages 82–82. mans, and Veronique Hoste. 2015. Detection and
fine-grained classification of cyberbullying events.
W. Ferreira and A. Vlachos. 2016. Emergent: a novel In Proc. of the Int. Conf. Recent Advances in NLP,
data-set for stance classification. The 15th Annual pages 672–680.
Conference of the North American Chapter of the
Association for Computational Linguistics: Human Zeerak Waseem. 2016. Are You a Racist or Am I See-
Language Technologies. ACL. ing Things? Annotator Influence on Hate Speech
Detection on Twitter. In Proc. of the 1st Workshop
Irene Kwok and Yuzhou Wang. 2013. Locate the hate: on NLP and Computational Social Science, pages
Detecting tweets against blacks. In Proc. of the 27th 138–142.
AAAI Conf. on AI, pages 1621–1622.
A. L. Wilkes and M. Leatherbarrow. 1988. Editing
Christopher D. Manning, Mihai Surdeanu, John Bauer, episodic memory following the identification of er-
Jenny Finkel, Steven J. Bethard, and David Mc- ror. The Quarterly Journal of Exp. Psychology,
Closky. 2014. The Stanford CoreNLP Toolkit. In 40(2):361–387.
ACL, pages 55–60.

89

Вам также может понравиться