Вы находитесь на странице: 1из 43

Frontiers of

Computational Journalism
Columbia Journalism School
Week 9: Knowledge Representation
November 9, 2015

This class

Structured Journalism
Ontologies and Linked Data
Relations from Text
Knowledge in Classical AI

Structured Journalism

Unstructured data

Structured data

Everyblock.com circa 2009

Connected China. Reuters, 2013

Article Metadata
headline

photo
photo credit
photo caption
byline
publication date
dateline
article body
related articles

Schema.org news markup


Overall type of the object on this page, in HTML head

Headline, dateline, date as additions to div/span properties

Byline expressed as nested object (using itemscope) of type schema.org/Person

Driving application: rich snippets

Schema.org covers not just news but music, restaurants, people,


organizations, reviews, offers...
Snippets, and better search-ability generally, are motivation for
Google, Yahoo, Bing to push schema.org

Additional metadata from indexing team

In database, but doesn't necessarily make it to HTML.

Application: content navigation

Articles about Syria


on NYT topic page
More reliable than simple
text search (because the
relevance algorithm knows a
story is "about" Syria.)

Application: automatic stories


Wall Street is high on Molson Coors Brewing (TAP), expecting it to report earnings
that are up 17.5% from a year ago when it reports its third quarter earnings on
Wednesday, November 7, 2012. The consensus estimate is $1.34 per share, up
from earnings of $1.14 per share a year ago.
The consensus estimate has dipped over the past month, from $1.35, but its still
up from the consensus estimate of $1.19 three months ago. For the fiscal year,
analysts are expecting earnings of $3.89 per share. Revenue is projected to
eclipse the year-earlier total of $954.4 million by 31%, finishing at $1.25 billion for
the quarter. For the year, revenue is projected to roll in at $4.04 billion.
The companys net income has declined in the last two quarters. The company
posted profit falling by 52.8% in the second quarter. This is after it reported a profit
decline in the first quarter by 4.1%.

Automatic story generation, by Narrative Science

Ontologies and Linked Data

What objects and relations are available?

Often represented as class hierarchy.


Arrows = is_a relation

(Part of) a real ontology, from Cyc

Every big news org has their own


big ontology L

topics, people, organizations, places...

Yaaay Linked Data!


Triples of (subject relation object), each a URL or literal
<urn:x-states:New%20York>
<http://purl.org/dc/terms/alternative>
"NY
<http://dbpedia.org/resource/Columbia_University>
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://schema.org/CollegeOrUniversity>

Abbreviations possible with many formats...


<http://dbpedia.org/resource/Columbia_University>
ns6:CollegeOrUniversity

rdf:type

NYT ontology available as LOD

owl:SameAs makes this interoperable

NYT API can return linked data


{
"title": "Syria's Rebels Open Talks on Forging United Political Front"
"body": "BEIRUT, Lebanon Syria s fractious opposition groups began
negotiations in Doha, Qatar, on Sunday to forge a more unified front to reshape
the political landscape in a bloody conflict that claims more than 100 lives
virtually every day. Given the scant prospects that any attempt to restructure
the opposition will succeed the",
"dbpedia_resource_url": [
"http://dbpedia.org/resource/Hillary_Rodham_Clinton",
"http://dbpedia.org/resource/Bashar_al-Assad"],
"facet_terms": "CLINTON, HILLARY RODHAM ASSAD, BASHAR AL- SYRIA DOHA
(QATAR) SYRIAN NATIONAL COUNCIL STATE DEPARTMENT WAR AND REVOLUTION DEFENSE AND
MILITARY FORCES"
}

Relations from Text

Objects and relations in text?

names, dates, places, verbs.

Named Entity Recognition


Extract subjects, objects, from text.
Also, resolve pronouns if possible.
"Gov. Andrew M. Cuomo on Wednesday gave a sea
wall the nod. Because of the recent history of
powerful storms hitting the area, he said, elected
officials have a responsibility to consider new and
innovative plans to prevent similar damage in the
future."

NER state of the art


Commercial: Google Knowledge Graph
Academic: Stanford NER library

Next level of understanding: verbs


The water that made rivers of Avenues C and D
receded on Tuesday, and the East Village was a
mixture of disaster and nonchalance. A group of
young men in pajama pants and shorts threw a
football on East 12th Street, while workers pumped the
basement of CHP Hardware on Avenue C and Eighth
Street.

subject verb object

Knowledge Representation
in Classic AI

KR in GOFAI
Classic "symbolic" paradigm represents knowledge as
statements in mathematical logic.
Many variations. Most are subsets or modifications of
standard first order logic (FOL).
Mathematical representation of human knowledge is a
very old dream! (Greeks, Leibniz, Esperanto)

Leibniz, 1685
The only way to rectify our reasonings is to make them
as tangible as those of the Mathematicians, so that
we can find our error at a glance, and when there are
disputes among persons, we can simply say: Let us
calculate, without further ado, to see who is right.

Predicates and Relations


Predicate: asserts that object belongs to a class
vechicle(schoolbus)
bird(tweety)
straight_gangsta(emily_bell)

Relation: asserts relationship between objects


is_a(car, vehicle)
higher_rank(general, colonel)
capital(paris, france)

Inference
General rules
a (a => b) => b
p !p

Domain specific inferences


is_a(car, vehicle)
can_move(vehicle)
=> can_move(car)

News as relations between entities


Alice attended the wedding

attended(alice, wedding)
IBM was founded in 1917.

founded(IBM, 1917)
Hurricane Sandy hit New York

hit(hurricane_sandy, New_York)

Encode facts as relation(subject,object)


also written (subject relation object)

Things we could do with this


Question answering
The granddaughter of which actor starred in E.T.?
(?x acted-in E.T.)(?y is-a actor)(?x granddaughter-of ?y)

Inference
(bob brother-of alice)
(alice mother-of lucy) =>
(bob uncle-of lucy)

Answer questions using inference


how many executives of publicly-traded Canadian companies died
in car crashes?

Problems
Not all subjects are simple.
Over a hundred guests attended the wedding

attended(num_guests, wedding)
greater_than(num_guests,100)

Some relations have multiple parts.


Hurricane Sandy hit New York on Monday
hit(sandy, New_York, monday)

Standard inference doesnt allow defaults


All birds fly
bird(tweety)
bird(?x) => flies(?x)
=> flies(tweety)

But, penguins and chickens dont fly


bird(?x) & !penguin(?x) & !chicken(?x)=> flies(?x)

Now we cant guess that tweety flies


bird(tweety) => flies(tweety) ?
we dont know!

Standard mathematical logic


doesnt deal well with exceptions
Some people dont have a last name.
Sometimes an election isnt decided on election day.
Is a trash can used as a flower pot still a trash can?
Is a broken car still a vehicle if it can't move?

Relations from sentence parsing


The water that made rivers of Avenues C and D
receded on Tuesday, and the East Village was a
mixture of disaster and nonchalance. A group of
young men in pajama pants and shorts threw a
football on East 12th Street, while workers pumped the
basement of CHP Hardware on Avenue C and Eighth
Street.

subject verb object

Relation extraction systems


Commercial: IBM's DeepQA (Watson)
Academic: Open IE project

Ontology explosions
(water made rivers of Avenues C and D)
(East Village was a mixture of disaster and nonchalance)
(group of young men in pajama pants and shorts threw
football)
(workers pumped the basement of CHP Hardware )

Do we have all of these in the ontology?

General Question Answering

Precision/recall tradeoff. State of the art is IBMs DeepQA

DeepQA use of structured data


Watson can also use detected relations to query a triple
store and directly generate candidate answers. Due to the
breadth of relations in the Jeopardy domain and the variety
of ways in which they are expressed, however, Watsons
current ability to effectively use curated databases to simply
look up the answers is limited to fewer than 2 percent of
the clues.
- Ferruci et. al. Building Watson

Вам также может понравиться