Вы находитесь на странице: 1из 8

TopStory

COMS6998: Cloud Computing & Big Data


Fall 2015

Aaron Zakem
Khetthai Laksanakorn
Dhruv Kuchhal
Venciya George

Team
Aaron Zakem
MSc. in Computer Science
Machine Learning Track

Khetthai Laksanakorn
B.Sc. in Computer Engineering

Dhruv Kuchhal

Venciya George

MSc. in Electrical Engineering

MSc. in Computer Science


Machine Learning Track

Project Summary
Collect current news stories and extract subjects.
Monitor Tweet stream and store tweets and hashtags.
Determine trending news subjects by tweet subjects by top 5 most frequently
occurring subjects and hashtags.
Display trending subjects and example stories and tweets on web page.

News Stories
Current news stories are collected from the New York Times, the Guardian, and a
variety of blogs and news sources provided by the Alchemy DataNews API.
Subjects are extracted from the New York Times, Guardian, and Alchemy DataNews
articles using the Alchemy Concept Tagging API. For each article, the 3 concepts
with the highest relevance scores are stored as the subjects of the article.
The top five most frequently occurring subjects are stored as the trending subjects.
A separate subject count is performed for only the New York Times and Guardian
articles, and the trending subjects are stored for this subset of articles alone.
The article aggregator is deployed on Elastic Beanstalk and refreshes the article
collection and top subjects every 4 hours.

Tweets
The Tweet stream is monitored and tweet text, hashtags, and other identifying information
is stored in two databases.
One database stores unfiltered tweets sampled from the tweet stream; the other database
stores tweets filtered by a string with the keywords news, report, world, politics, economy,
business, sports, international.
Up to 20,000 most recent tweets from the past 24 hours are pulled from each database
when the webpage is loaded. The hashtags are taken as subjects. The top subjects in
both the filtered tweet stream and unfiltered stream are determined by frequency of
occurrence.
Many tweets in the unfiltered stream relate to celebrity news and various celebrity-related
contests (e.g., MTV Stars). The goal of the database for the filtered stream is to see
whether more significant subjects occur at a higher rate.

Architecture

Architecture and APIs


The story aggregator is a node.js application deployed on Elastic Beanstalk. It makes use of the
NYTimes, Guardian, Alchemy DataNews, Alchemy Concept Tagging, and mySQL APIs. Data for
the aggregated articles is stored in a mySQL RDS instance on AWS.
The tweet aggregator is comprised of two node.js applications which make use of the Twitter and
mySQL APIs. Data for the aggregated tweets is stored in two separate mySQL RDS instances
on AWS.
The story server, which functions as the back end server, is a node.js application deployed on
Elastic Beanstalk, and makes use of the mySQL and Express APIs. The front end server is also
a node.js application deployed on Elastic Beanstalk that makes use of the mySQL and Express
APIs.

Results