Вы находитесь на странице: 1из 22

LOGIN

SIGN UP

TEAM UPDATES

6 Useful Databases to Dig for


Data (and 100 more)
August 4, 2011 by Ai Ching

[List updated on 31 October 2014]


You already know that data is the bread and butter of reports and presentations. Data makes
your presentation solid. It backs up the ideas you are selling. It gives people reasons to listen to
you.
However, data digging is a struggle. Its a struggle to look for reputable and legit sources,
especially in this digital age.
To make our life easier, we have scraped up a list of useful databases that you can bookmark.
Here are eight useful databases for you to dig for data (and a couple hundreds more).

1. Freebase
Freebase is an open platform for data sharing. It contains a wide range of topics from ctional
characters to Modest Mouse. You can even curate your data with data plotting feature. You can
plot your datasets in timeline or map.

2. UN Data
This database contains large datasets, consisting virtually all the public data collected by the
United Nation. To access the API you have to sign up (it will only take a couple of minutes).

3. WorldBank
Where else to look for nancial data of the world but the WorldBank? You can get virtually any
countrys financial and economy standings here. Some other topics included are:

Agriculture & Rural Development


Aid Effectiveness
Economic Policy and External Debt
Education
Energy & Mining
Environment
Financial Sector
Health
Infrastructure
Labor & Social Protection
Poverty
Private Sector
Public Sector
Science & Technology
Social Development
Urban Development

4. Data.gov
Data.gov is leading the way in democratizing public sector data and driving innovation. This
movement has spread throughout cities, states, and countries. 5 of 50+ categories:
Agriculture
Arts, Recreation, and Travel
Banking, Finance, and Insurance
Births, Deaths, Marriages, and Divorces
Business

5. Infochimps
Infochimps contains paid and free datasets just about anything. Whats cool about Infochimps is
that you can download datasets into csv format. Wats more is that you can ddle with the API
to extract the data speci c to your needs. Try Twitter as your search metric and you will see
what I mean.

6. Google Public Data


The Google Public Data Explorer makes large datasets easy to explore, visualize and
communicate.

7. Google Scholar
The Google Scholar is a free search engine that contains all kinds of academic literatures. Citing
journal publishers, universities research papers, and other scholarly materials do not just make
your content looks smarter, but as well as more trustworthy.

8. Data Market
Data Market contains in-house and third party datasets. Its a good place to explore data related
to economics, healthcare, food and agriculture, and the automotive industry.
And heres a random collection of datasets.
Torrent downloads and uploads on Pirate Bay
Social media & networks from Stanford Uni
Human Emotions by We Feel Fine: to allow other artists to more easily make pieces that
explore these human emotions
LittleSis profiles whos who in the biggest organisations in the world
NY Times bestseller
Trending Topics : Trending Topics serves Hot Wikipedia Topics daily. It gets you the top hits
on Wikipedia by search query.
Google Flu Trends
NY Times People: User data for com, including the user pro les, activities, news feeds, and
networks.
CrunchBase: Plenty of information about startups and large tech companies
Google Analytics
Social networks: Facebook/ Twitter/ Pinterest/ LinkedIn
Project management tools: Basecamp
Sales management tools: Salesforce
Survey tools: SurveyMonkey
Photo sharing tools: Flickr
Email marketing: MailChimp

You can also get some crazy amount of datasets and related stuff from Datamob.
DataWrangling is a place with a large volume of datasets from a wide range of elds. To make it
easier for you, we have scraped the list for you below. However, do note that list may not be up
to date as it was last updated in 2009. Be it so, its still a good place to start digging for data.
Tips on using this list: Each link comes with tags. You can do a search using keyword to nd the
appropriate database for use.
Happy data digging, people!
Announcing the Article Search API Open Blog NYTimes.com (tags: article, api, nytimes,
text, corpus, newspaper)
Twitter API Wiki / REST API Documentation: Social Graph Methods (tags: graph, network,
api, social, twitter)
Information Extraction: The RISE Repository of Information Sources (tags: information,
textmining, extraction, reviews, jobs)
Using the Wikipedia link dataset Henry Haselgrove (tags: graph, network, link, wikipedia,
pagerank)
Visualizing the Growth of Target, 1962-2008 | FlowingData (tags: visualization, retail,
finance, gis, map, location, store, via:magnetbox, target)
The Economy According To Mint (tags: finance, commercial, consumer, mint, spending)
Repositories (tags: links, textmining, books, rdf, ocr, documents)
Subsidyscope.com (tags: government, banking, csv, tarp, bailout)
Best Buy Remix Welcome to the Best Buy Remix Developer Network (tags: retail, data, api,
product, bestbuy)
twibs : find the businesses on twitter (tags: directory, businesses, twitter, companies)
True Marble Imagery Free Download (tags: gis, geo, map, mapping, images, satellite)
Massive Scrape of Twitters Friend Graph blog.infochimps.org Organizing Huge
Information Sources (tags: textmining, twitter, network, socialnetwork, pagerank, graph,
queryminer)
Twitter Scrape (rough draft) get.theinfo | Google Groups (tags: twitter, socialnetwork,
graph)
API Documentation BackType (tags: api, blog, comments, textmining, stream, trends,
backtype, queryminer)

dbpedia.org : Downloads 32 (tags: wikipedia, named_entity, rdf, ontology)


CinC Challenge 2000 datasets (tags: timeseries, machinelearning, ecg, health, medical, sleep,
apnea)
Free book usage data from the University of Hudders eld Self-plagiarism is style (tags:
books, library, borrowing, recommender, isbn, recommendation, collaborative,

ltering,

opendata)
UC Berkeley. Sheldon Margen Public Health Library. Statistical/Data Resources (tags: health,
links, resources, publichealth, berkeley)
ICWSM 2009 International AAAI Conference on Weblogs and Social Media (tags: blog,
crawl, corpus, network, web, link)
BART For Developers (tags: urban, transportation, feeds, public, sanfrancisco, bart, api)
Tim Davis: UF Sparse Matrix Collection : sparse matrices from a wide range of applications
(tags: spare, matrix)
Others Online Behavioral Targeting, Analytics and Advertising Service for Publishers, Ad
Networks, Widgets, WiFi Networks (tags: analytics, audience, segmentation, toolbar,
commercial, sem, search, advertising)
HumanScan : BioID : Downloads : BioID Face Database (tags: face, detection, image)
Face Detection (tags: facerecognition, opencv, face, links)
Building a (fast) Wikipedia of ine reader (tags: django, wikipedia, compressed, textmining,
howto)
gov: The Obama-Biden Transition Team | Join the Discussion: Healthcare (tags: textmining,
opinion, comment, topic, government, queryminer)
UN General Assembly Voting Data (tags: un, voting, statistics, government)
NORB Object Recognition Dataset, Fu Jie Huang, Yann LeCun, New York University (tags:
image, 3d)
Reddits Secret API (tags: reddit, api, json)
Amazon Web Services Public Datasets Data Wrangling Blog (tags: amazon, ebs, ec2, s3,
publicdata, hadoop)
Amazon Web Services (AWS) Hosted Public Datasets (tags: amazon, ebs, publicdata)
Executive PayWatch Database (tags: ceo, compensation, pay, economics, business, labor)
Research Datasets :: CID Data :: Center for International Development at Harvard University
(CID) (tags: economics, international, development)
NACDA: Search Holdings (tags: aging, statistics, studies)

LIFE photo archive hosted by Google (tags: images, photo, pictures, search)
Main Task QA Data (tags: question, answering, trec, nlp, machinelearning)
ADL Gazetteer Development (tags: named_entity, location, placenames, geo, nlp)
The New York Times Annotated Corpus YooName named entity recognition (tags:
named_entity, nytimes, corpus, people, organizations, locations)
downloading ossmole Google Code How to get FLOSSmole data for your own use
(tags: opensource, project, activity, mysql, dump)
Google Flu Trends | How does this work? (tags: google, health, trends, search, prediction,
epidemiology, biodefence, queries, queryminer)
Multi-Domain Sentiment Dataset (tags: sentiment, review, product, amazon)
Chris Pounds Name Generation Page (tags: bizzare, sci , phrase, name, word, generators,
random, perl)
TradingSolutions Data Sources (tags: trading, finance, s, api, list)
Announcing the New York Times Campaign Finance API Open Code New York Times
Blog (tags: nyt, api, campaign, donations, fec)
Beautiful Data WikiContent (tags: book, data, wiki, via:jhammerb)
public domain sounds | free sound library (tags: sound, publicdomain, audio)
Net ix API Welcome to the Net ix Developer Network (tags: net ix, api, movie, mashup,
netflixprize, ratings)
Data Catalog (tags: dc, government, feeds, transparency, opendata
Open beats Closed: Best Buy’s new APIs OReilly Radar (tags: retail, bestbuy, api)
Voter registration data; or, HERE IS YOUR HOPE, YOU FOOLS! The Edge of the American
West (tags: voter, registration, politics, 2008)
Tickermine (tags: custom, research, retail, finance, market, service, analyst)
Linked Movie Data Base (tags: rdf, movies, movie, api)
Big Huge Thesaurus API: Access 145,000 Words and Phrases (tags: webservice, api,
thesaurus, textmining, nlp, rest)
import/parse/fec.py at master from aaronsws watchdog GitHub (tags: fec, python, parser,
government, campaign)
The Watchdog Project: volunteer (tags: government, transparency, parsing, election, python)
Dataset of the day: Where are the Obamacans? | Off the Map Of cial Blog of FortiusOne
(tags: obama, goverment, mashup, gis, geo, map, campaign, donations)

Activity Recognition: Datasets, Bibliography and others (tags: activity, recognition, intent)
Normalized Campaign Contribution Data (tags: cmu, politics, campaign, donations, fec,
via:jhammerb, government)
YouTube Dataset (tags: youtube, research, crawl, socialnetwork, network, graph, web)
CRAWDAD (tags: wireless, RF, radio, signal, dartmouth, network)
API Documentation Twitter Development Talk | Google Groups (tags: twitter, text, api)
Web FAQ collection | ILPS (tags: faq, question_answering, questions, web, crawl, corpus, xml,
textmining)
Yahoo! Music API YDN (tags: api, yahoo, music, artists)
Search Query Performance report Google AdWords Help Center (tags: adwords, ppc,
search, metrics, webanalytics, sem, query, queryminer)
Wordze Keyword Research Tool (tags: queryminer, keyword, tool, research, commercial,
search, adwords)
Frontal Face Databases (tags: facerecognition, face, image, recognition)
Searchable Catalogs of Data (tags: links, catalogs, social)
Download Database baseball1.com (tags: baseball, database, publicdata, statistics, sports)
radiohead Google Code (tags: lidar, visualization, radiohead, google, video)
80 Million Tiny Images (tags: images, words, english, search, visualization, imagemap)
Time Series Center | Harvard University (tags: timeseries, anomaly, detection, astronomical,
physics)
OpenVisuals Open Source Visualization Framework (tags: visualization, community, design,
processing)
BGN: Domestic Names State and Topical Gazetteer Download Files (tags: gis, usgs)
NGA: Country Files (tags: country, cities, geo)
Datasets (tags: benchmark, clustering, regression, machinelearning, list, statistics,
mathematics)
Isomap Datasets (tags: nonlinear, dimensionality, reduction, faces, digits, images, manifold)
Yahoo! Search Blog: BOSS The Next Step in our Open Search Ecosystem (tags: api, open,
search, yahoo, BOSS, queryminer)
Download the Database IP Address Lookup Community Geotarget IP Project (tags:
geocoding, geoip, internet, ip, ipaddress, mysql)
Airline Data Project (tags: airline, statistics, finance, revenue, location, travel)

Reddit.com: Ask Reddit: Where to download a DB dump of Reddit? (tags: reddit,


socialnetwork, news, web)
Show Us a Better Way: What public data is already available? (tags: statistics, census, uk,
school, news, publicdata)
Collaborative ltering dataset dating agency (tags: collaborative, ltering, dating, rating,
profiles, czech)
About Us Predictify (tags: predictionmarket, tool, nance, buzz, advertising, marketing,
startup, mmds, david_kellogg)
VGChartz.com | Video Games, Charts, News, Forums, Reviews, Wii, PS3, Xbox360, DS, PSP
(tags: sales, ranking, videogames, retail)
Store Level Information (tags: retail, finance, sales, store)
Code for querying and downloading Flickr images (tags: image, python, code, ickr, matlab,
recognition)
Image Parsing Datasets (tags: image, recognition)
TAGora Data (tags: tag, tagging)
TAGora Data (tags: netflixprize, imdb, sparql)
OHPI Traffic Volume Trends (tags: government, traffic, statistics, trends, transportation)
PigTutorial Pig Wiki (tags: search, log, query, web, excite, queries, hadoop, pig, tutorial,
mapreduce, parallel, queryminer)
Quality of Life Grand Challlenge Dataset: Kitchen Capture (tags: machinelearning, motion,
capture, sensor)
Summize Twitter Search API (tags: api, buzz, opinion, trends, text, twitter, summize, search)
2008 IEEE InfoVis Contest Dataset (tags: visualization, contest, scalability, motion, tracking,
pedestrian, sensor)
IMDb Pro : Scary Movie 4: Box of ce (tags: movie, revenue, sales, box_of ce, imdb,
commercial, movie_study)
Spider-Man 2 (2004) Daily Box Office Results (tags: movie, revenue, box_office)
Live Search : xRank Celebrity check out whos hot and whos not! (tags: search, query,
volume, trends, celebrity, prediction, buzz, named_entity)
IMDbPro.com Free Trial Signup (tags: movie, revenue, timeseries, imdb, commercial,
subsription)
Free time-series and micro-data to download (tags: economics, links)
PyGTrends: Python API for Google Trends Data (tags: google, trends, search, web, analytics,

api, code, python, hack, keyword, query, forecasting, indicator, finance)


Of cial Google Blog: A new avor of Google Trends (tags: google, trends, search, query, api,
csv, keyword, timeseries)
Open Research the Data: Lastfm-ArtistTags2007 Duke Listens! (tags:fm, music, tagging,
artists, tags, collaborative, filtering)
i2b2: Informatics for Integrating Biology & the Bedside (tags: medical, obesity)
Tiger Data Set Lecture (tags: tiger, gis, lectures)
Google To Launch Large Scale Geo-Services (tags: geo, google, gps, location, geolocation, cell,
wifi, api, gis)
fms Playground (tags: celebrity, misspelling, spelling, names)
ImportGenius.com : U.S. Customs Database and Competitive Intelligence Tools (tags:
commercial, shipping, imports, exports, finance, datamining)
Directory Listing of Betfair price

les (tags: betting, prediction, betfair, price, csv,

predictionmarket)
Reuters Spotlight Article and Media API (tags: news, text, articles, api, content, media, xml,
images, publicdata)
DataSets Scikits Trac (tags: scipy, python, machinelearning, statistics, resource)
[Wikitech-l] page counters (tags: wikipedia, pageviews, trends, textmining, seo, topic)
Wikipedia article traf c statistics (tags: via:chl, wikipedia, web, analytics, seo, topic,
textmining, traffic)
Yahoo! Internet Location Platform YDN (tags: yahoo, geo, geocoding, location, landmarks,
gis)
How to find images on the internet Random knowledge (tags: images, links, lists, archive)
Yahoo offers geographic data to Web sites | Tech news blog CNET News.com (tags: gis,
webservice, yahoo, api, location, landmark)
Instructions for Obtaining Search Engine Transaction Logs (tags: query, search, log, excite,
altavista, alltheweb, transaction)
TechTC Technion Repository of Text Categorization Datasets

(tags: datamining,

textmining, categorization, classification, odp, directory, text)


The TechTC-100 Test Collection for Text Categorization (tags: textmining, classi cation,
category, odp, directory)
FEC Election Contributions: Download Detailed Files by Election Cycle (tags: individual,
donations, government, election, publicdata, fec)

Juiced Google Analytics Python API: Juice Analytics (tags: search, statistics, keywords,
analytics, api, python, web, seo, google, google_analytics, juice)
Country Name and ISO 3166 Code MySQL Import File (tags: mysql, states, countries,
isocode)
Semantic Search the US Library of Congress (tags: via:inkdroid, libraries, mashup, rdf,
semantic, search, semanticweb, books, api, webservice)
geocoded Hotels GeoNames Blog (tags: hotels, geonames)
GeoNames webservice and data download (tags: locations, cities, countries, gis)
Index of /download/worldcities (tags: cities, gis)
ualberta dependency based thesaurus and word count data (tags: corpus, text, similarity,
terms)
CommonCrawl About (tags: web, crawler, bot)
Datasets and corpus / corpora for biological literature and text mining , information
extraction and information retrival and document classi cation (tags: bioinformatics, text,
corpora, domainspecific, genomics, corpus)
Of ce of Defects Investigation (ODI), Flat File Downloads (tags: defect, recall, automobile,
fightclub, nhtsa, saefty)
p2psim kingdata : DNS server latency network distance matrices (tags: distance, matrix,
network, p2p, dns, latency, nmf, queryminer)
Sep Kamvar / Personalization / (tags: pagerank, web, matrix, matlab)
opentick.com (tags: opentick, trading, beta, feeds, finance)
WikiXMLDB: Querying Wikipedia with XQuery (tags: wikipedia, xml, ec2)
kiwitobes.com Blog Archive Walmart Growth Video (tags: walmart, visualization, video,
freebase, store, retail, locations, opening)
Open Cell Id dataset phone geolocation from GSM cellids (tags: gis, mobile, geolocation)
The Cornell Web Lab The Cornell Web Lab (tags: cornell, web, archive, hadoop, crawl)
im2gps: estimating geographic information from a single image (tags: imagerecognition,
via:csantos, gis, cmu, gps, imageprocessing, paper, hack, freaking_awesome)
Datasets: MUSCLE WP2 Evaluation, Integration and Standards (tags: image, video, audio,
currency, sports, imagerecognition)
Open Economics Store Index (tags: economics, list)
welcome @ omdb (tags: free, movie, database, netflixprize)

Cogblog Blog Archive Cogmap APIs (tags: api, cogmap, person, name, organization,
record_linkage)
Wal-Mart : Freebase The Worlds Database (tags: retail, locations, stores)
Cogmap: The Org Chart Wiki (tags: record_linkage, identity, name, organization, orgchart,
marketing)
German English Parallel Corpus de-news, Daily News 1996-2000

(tags: german,

translation, corpus, english, text, via:maxme)


Welcome to the CRCNS data sharing activity website CRCNS (tags: neuroscience, patch,
clamp, recordings, neuron, timeseries, patchclamp, data, neural, cortex, visual)
org: Free Redistributable Rich Datasets (tags: aggregator, links)
Frequent Itemset Mining Dataset Repository (tags: retail, clickstream, traf c, web, links,
sales)
Dolores Labs Blog Blog Archive Our color names data set is online (tags: colormap, color,
mechanicalturk)
TeradataUniversityNetwork.com -> Registration (tags: teradata, retail, transactional,
database)
Pascal Learning Challenge Large Datasets (tags: large, competition, challenge, svm,
machinelearning, scalability)
ECIS 2007 The 15th European Conference on Information Systems (tags: retail, dillards,
sams_club)
Alexa Web Search (tags: alexa, aws, web, search, api)
developerWorks Interviews: Massive data mining and the resurgent mainframe (tags: price,
retail, transaction, sams_club, dillards)
University of Arkansas Daily Headlines (tags: retail, dillards, uark)
Crime data bonanza!!! (tags: timeseries, crime, statistics, publicdata)
State and Federal Case Law (tags: creativecommons, court, legal, law, via:inkdroid)
Wikipedia:Lists of common misspellings/For machines Wikipedia, the free encyclopedia
(tags: spelling, mispelling, wikipedia)
Copyright Free and Public Domain Media (tags: images, audio, publicdata, maps, video, free)
Access to Web Research Collections VLC2/WT10g/WT2g (tags: blog, web, text)
Databases you can use for benchmarking (tags: image, vision, recognition)
Lyrics y Lyrics API, database access to search for music artist and song title, protocol REST
with XML document (tags: song, lyrics, database, api)

2007 IEEE AVSS Detection and Tracking Algorithm Datasets (tags: tracking, video, detection,
image, recognition, vehicle, pedestrian)
Eigenvector Research, Inc. : Datasets Available to Download (tags: NIR, spectra, chemistry,
semiconductor, pharmaceutical, matlab)
OTCBVS (tags: image, recognition, detection, pedestrian, thermal, tracking, facerecognition,
illumination)
99 Wikipedia Sources Aiding the Semantic Web AI3:::Adaptive Information (tags: links,
directory, record_linkage, extraction, wikipeida, named_entity, recognition, textmining,
semanticweb, paper)
UNdata (tags: UN, publicdata, government, statistics)
AudioScrobbler Data (tags: audioscrobbler, recommendation, collaborative, filtering, music)
The Linking Open Data dataset cloud (tags: directory, rdf, semantic, data, soup, graph)
Free Economic Data | Economic, Financial, and Demographic Data (tags: nance, economics,
portal, links)
::MLSP 2008::: MLSP competition (tags: machinelearning, trading, competition, backtest,
matlab, code, finance, via:DeliciousRob)
Computer Vision Test Images (tags: computer, vision, image, ray, trace, ngerprint, stereo,
detection, via:chl)
The Dataverse Network Project | The Dataverse Network Project (tags: statistics, repository,
harvard)
DVN Home (tags: harvard, repository, social, science, research, portal, links)
Ohio voter registration data (tags: voter, voting, politics, government, name, address,
registration)
Voter List Data Files Election Department, Clark County, Nevada (tags: voting, voter,
registration, name, address, data, election, politics, government, nevada)
Temperature data (HadCRUT3 and CRUTEM3) (tags: climate, temperature, netcdf)
MNIST handwritten digit database, Yann LeCun and Corinna Cortes (tags: handwriting,
mnist, image, recognition)
LFW : Labelled Faces in the Wild (tags: facerecognition, face, recognition, umass, image)
Making random contacts (37signals) (tags: generator, names)
Test (Sample) Data Generators (tags: generator, tools, list, via:jd)
Compete Compete Developer Resources (tags: compete, api, web, statistics, traf c,
analytics, mashup)

Machine Learning (Theory) The Peekaboom Dataset (tags: peekaboom, vision, image, large,
human, computation, machinelearning, recognition)
Ocean Processes and Modeling: Ocean Data (tags: links, oceanography, satellite)
BlogoCenter datasets (tags: blog, ucla)
Tagged datasets for named entity recognition tasks (tags: nlp, corpus, tagged, named_entity,
recognition, list)
icio.us stats deli.ckoma (tags: del.icio.us)
The Financial Data Finder A G (tags: finance, links)
Freebase Wikipedia Extraction (WEX) (tags: wikipedia, xml, structured, corpus)
The arXiv.org API (tags: arxiv, api, open, paper, academic)
England Football Results Betting Odds | Premiership Results & Betting Odds (tags: gambling,
soccer, football, excel, statistics)
HughesData Main Hughes Lab (tags: rna, bioinformatics, microarray, expression, gene,
machinelearning)
Stanford MicroArray Database (tags: bioinformatics, microarray, expression, gene,
machinelearning, stanford)
ArrayExpress Home (tags: bioinformatics, microarray, expression, gene, machinelearning)
Gene Expression Omnibus (GEO) Main page (tags: bioinformatics, microarray, expression,
gene, machinelearning)
Index of /courts.gov (tags: corpus, text, legal, law, court, ruling, opensource, publicdata)
Welcome to Openvest (tags: python,

nance, edgar, pylons, matplotlib, sec, webservice,

via:jolby)
Statistical Science Web: Datasets (tags: links, statistics)
Data Mining: Text Mining, Visualization and Social Media: TailRank, Spinn3r, TechMeme and
TechCrunch: New Attention (tags: crawler, blog, corpus)
Aleix Face Database (tags: facerecognition, machinelearning, face, image)
Data Repository Evaluation (tags: umd, links, statistics, government, sports, via:rickladd)
PMC FTP Service (tags: biology, medicine, articles, text, journal, authors)
uspop2002 data set (tags: music, similarity, machinelearning)
Internet Archive: Details: Amazon ASIN listing and similarity graph (tags: ASIN, amazon,
recommendation, collaborative, filtering, via:keyvowel)
European Climate Assessment Daily Weather Data (tags: weather, europe, ascii, netcdf)

Poverty Datasets General Information (tags: poverty, statistics)


StatLibDatasets Archive (tags: machinelearning, datamining, cmu, link, collection)
National Household Travel Survey (NHTS) Data (tags: driving, transportation, publicdata)
RealClearPolitics Election 2008 Democratic Presidential Nomination (tags: polls, politics)
Nielsen BookScan USA (tags: books, sales, commercial)
Pew Internet & American Life Project (tags: internet, demographics, online, web)
Home Numbrary (tags: finance, data)
About Numbrary (tags: searchengine, search, tagging, aggregator, numeric, extraction,
tables, collaboration, web2.0, interface, billpoint)
Main Page OpenTextMining (tags: textmining, open, nature, standards, search)
Metafilter Infodump (tags: metafilter, comments, network, via:chl)
WEBSPAM-UK2007 | Datasets | Web Spam Detection (tags: web, search, spam, crawler,
yahoo)
Google to Host Terabytes of Open-Source Science Data | Wired Science from Wired.com
(tags: google, article, openaccess)
Zillow Labs Neighborhood Boundaries (tags: neighborhoods, geo, gis, maps)
Trust network datasets TrustLet (tags: socialnetwork, trustnetwork, trust)
Crime in the United States 2006 (tags: crime, fbi)
TaskForces/CommunityProjects/LinkingOpenDa)ta/DataSets ESW Wiki (tags: opendata,
semantic, rdf, collaboration
Some Datasets Available on the Web Data Wrangling Blog (tags: publicdata, links)
XML.com: GovTrack.us, Public Data, and the Semantic Web (tags: semanticweb, rdf,
congress, politics, government)
CiteULike: Available datasets (tags: networks, research, graph, tags, paper, record_linkage)
Archive-It.org (tags: archive, internet, web, index)
Challenge: Synopsis Causality Workbench (tags: competition, machinelearning, forecasting,
contest)
Natural Language Processing (tags: microsoft, text, paraphrase, corpus)
LDC Linguistic Data Consortium Obtaining Data Resorces (tags: nlp, text, corpus, ngram,
google, commercial, license)
1990 Census Name Files (tags: census, names, identity, frequency, record_linkage)
Given Name Frequency Project: Analysis of Given Name Popularity (tags: name,

record_linkage, text, identity, code)


Email Datasets (tags: enron, names, identity, text, record_linkage)
ZoomInfo Welcome to the ZoomInfo Developer API (tags: api, identity, people, webservice,
record_linkage)
Ted Pedersen Name Discrimination Data / Name Disambiguation Data / Name Ambiguity
Data / Named Entity Resolution / Named Entity Disambiguation (tags: record_linkage, corpus,
nlp, names)
Developers Area eBay Market Data Documentation eBay Market Data Documentation
(tags: ebay, api, retail, price, code)
New SwetoDblp RDF dataset released with 11M triples (tags: name, authorship, rdf,
record_linkage)
LSDIS : SwetoDblp (tags: bibliography, rdf, ontology, duplicate, name, record_linkage)
StrikeIron Super Data Pack Web Service 1.0 StrikeIron Marketplace (tags: webservice,
publicdata, datacleaning)
Vaccines: IIS/Tech/Deduplication Test Cases (tags: duplicate)
Duplicate Detection, Record Linkage, and Identity Uncertainty: Datasets (tags: duplicate,
detection, record_linkage, datacleaning, text)
INFO 747 Social and Economic Data (tags: datacleaning, record_linkage, video, lectures,
course, cornell, economics, finance, publicdata)
Overstock.com Af liate Program (tags: retail, overstock, sales, api, product, price,
forecasting)
Amazon Web Services Developer Connection : Can Alexa WS provide detailed (tags:
finance, alexa, amazon, tech)
Market Data eBay Developers Program (tags: ebay, retail, pricing, sales, api, product)
Health Data Tools and Statistics (tags: health, information, public, publicdata)
Its a Pitch-by-Pitch Scouting Report, Minus the Scout New York Times (tags: baseball,
gameday)
opentick :: market data (tags: opentick, nasdaq, finance, stock)
Daily Kos: Obama helps us track $1,000,000,000,000 of federal spending (tags: corruption,
government, politics, finance)
Welcome to USAspending.gov (tags: government, money, politics)
Campaign Finance Reports and Data (tags: campaign, politics, elections)
Machine Learning and Data Mining Datasets (tags: face, image)

GIS for Schools (tags: epidemiology, gis, health)


Cardiac MRI dataset York University (tags: mri, cardiac)
Google Trends API coming soon | Tech news blog CNET News.com (tags: google, trends,
api)
MIT Media Lab: Reality Mining (tags: social, activity, location, cell, gis)
RL Competition 2008 Home (tags: machinelearning, reinforcement, agent, competition)
Vehicle Routing Datasets (tags: optimization, vehicle, routing)
EIA Petroleum Data, Reports, Analysis, Surveys (tags: oil, energy, statistics, economics,
petroleum)
DMOZ100k06 Michael G. Noll (tags: search, pagerank, text, tags, content)
Grading (tags: machinelearning, CMU, course, projects, graphicalmodel, code, paper)
Carnegie Mellon University CMU Graphics Lab motion capture library (tags: gait,
pedestrian, walk, motion)
Financial Forecast Centers Historical Economic and Market Data (tags: exchangerate, dollar,
economics)
Bureau of Labor Statistics Data (tags: economics, lumber, building, materials, homedepot)
Browse Business Cycle Indicators Data (tags: economics, indicators, time, series)
The Numbers Guy : Aspiring to Be the Wikipedia of Numbers (tags: nance, numberpedia,
mechanicalturk, textmining, statistics)
Social characteristics of the Marvel Universe (tags: socialnetwork, graphs, comicbooks)
net: Word Lists Collection (tags: dictionary, words)
ERS/USDA Data International Macroeconomic Data Set (tags: usda, economics, population,
cpi, gdp, income)
State Agency Databases GODORT (tags: government, directory, links, wiki, states)
The 2000 U.S. Census: 1 Billion RDF Triples (tags: gis, census, rdf, semantic, sparql)
See Whos Editing Wikipedia Diebold, the CIA, a Campaign (tags: wikipedia, authorship)
Dataset Generator Perfect data for an imperfect world. (tags: tools, generator)
National Bureasu of Economic Research: Data (tags: economics, links)
Entree Chicago Recommendation Data (tags: recommender, collaborative, restaurant)
community resource guide: ive been here before show me the links (tags: demographics,
maps, gis, statistics, links)
Social Science Data on the Net (tags: economics, social, government, health, labor, links)

NBI ASCII Files Bridge FHWA (tags: government, bridges, safety)


List of lms: A Wikipedia, the free encyclopedia (tags: net ix, net ixprize, movie, index,
wikipedia)
The arXiv on your harddrive (tags: paper, corpus, arXiv)
Insanely Useful Websites | Sunlight Foundation (tags: links, transparency, government,
politics, congress, reference)
Technophilia: Where to find public records online Lifehacker (tags: public, records, links)
Junk email project (tags: corpus, email, spam, textmining)
Enron Email Dataset (tags: enron, corpus, email, text, social, network)
ftp://ftp.bls.gov/pub/special.requests/cpi/cpiai.txt (tags: finance, cpi, inflation, data)
GOS Geospatial One Stop (tags: health, gis, epidemiology, links)
CIA Factbook Grep in Python (tags: cia, population, python, code, grep)
Miller Center of Public Affairs Richard Nixon Oval Of ce Recordings (tags: nixon, speech,
tapes, audio, mp3, wav, flac)
Deborah Jeane Palfrey Legal Defense Fund (tags: phone, politics)
UC San Diego Data Mining Competition 2007 Datasets (tags: housing, re nance,
mortgage)
package MoinMaster
Retail Industry Financial Ratios & Benchmarks (tags: retail, finance, sales, sqft)
Retail Industry Financial Ratios & Benchmarks (tags: retail, finance, sales, sqft)
stores | POI Factory (tags: retail, location, poi)
GpsPasSion Forums ** INDEX OF POI COLLECTIONS ** (tags: retail, poi, location, gis, gps)
GPS POI US : Home > Retail Stores (tags: retail, location, gis)
Collective Dynamics Group (tags: smallworld, networking, socialnetwork, graph)
Jester Data download page (tags: collaborative, filtering, jokes)
TricTrac: Video Dataset (tags: video)
Premium Business Information Databases AlacraWiki (tags: links, finance, commercial)
Index of /edgar (tags: finance, xml, edgar, sec, code, perl)
Mail Index (tags: EDGAR, sec, mail, text)
metafy / AnthraciteIdioms (tags: finance, SEC, scrape, parse, commercial)
Advance Monthly Sales for Retail and Food Services Time Series Data/Seasonal Factors

1992 to Present (tags: retail, sales, census)


TDT (tags: categorization, textmining, detection, tools)
Volume of retail sales: Social Trends 33 (tags: retail, sales, uk)
generatedata.com (tags: tools, generator, random)
S. Company Filings and Annual Reports (tags: finance, links, sec)
FTP Information EDGAR Database (tags: edgar, finance, sec, filing, ftp, instructions)
Data Mining For Investing (tags: investing, nance, datamining, announcement, sec, ling,
links)
Melissa DATA Lookups (tags: consumer, data, database, api)
FactSet: Data Maven Kiplinger.com (tags: factset, finance)
IBES (Demo) (tags: finance, ibes, analyst, forecast, wharton)
Thomson Financial I/B/E/S Data (tags: finance)
Historical Quotes Yahoo! Finance (tags: yahoo, finance, stock, price)
Network data (tags: network, links)
Bureau of Labor Statistics Home Page (tags: statistics, labor, government, consumer)
NAR: Research: EHS Data (tags: housing, sales, finance)
RFA The Industry Industry Statistics (tags: ethanol)
Chain Store Guide Retail Locations (tags: retail, finance, store, locations, gis)
Press Releases Directions Magazine (tags: retail, gis, store, locations)
Energy Information Administration EIA Of cial Energy Statistics from the U.S.
Government (tags: finance, government, energy, historical, forecasts, fuel, oil)
Databases you can use for benchmarking (tags: links)
UPC Database: Downloads (tags: product, upc, database)
Web Crawling / Crawl Datasets at Tobias Escher at the OII (tags: crawler, benchmark, search,
web, links)
TechTC Technion Repository of Text Categorization Datasets (tags: corpus, text)
TMC data archive download site (tags: traffic, data)
http://www.volvis.org/ (tags: volumerendering)
Computational Vision: Archive (tags: vision, caltech, imagerecognition)
DC Pedestrian Classification Benchmark (tags: pedestrian, image, classification, detection)
opentick :: home (tags: finance, economics, feed, free, stock, trading, opentick, opensource)

Web as Corpus (tags: textmining, corpus, concordance, wordlist, n-gram)


.:[ packet storm ]:. http://packetstormsecurity.org/ (tags: dictionary, hack, security, wordlist,
password)
Enron Dataset (tags: data, mysql, email, energy, text, socialnetwork)
Splog Blog Dataset (tags: blog, corpus, spam)
Home Page for 20 Newsgroups Data Set (tags: corpus, text, newsgroup)
White Glove Tracking (tags: crowdsourcing, image, processing, algorithm, collaborative,
distributed, web2.0, code, opensource)
NOAA Paleoclimatology Program Coral and Sclerosponge Data (tags: paleoclimatology,
climate, oceanography, coral, sponge, biology)
NAICS North American Industry Classi cation System (tags: nance, economics, naics,
industry, classifications)
Saving Democracy With Web 2.0 (tags: democracy, web2.0, mashup, government, funding,
article)
Congresspedia Congresspedia (tags: collaborative, wiki, government, congress, politics,
elections, web2.0, directory)
Population Estimates Datasets (tags: census, data, population, statistics)
CRAN Task View: Machine Learning & Statistical Learning (tags: statisticallearning,
machinelearning, code, R, libraries, cran)
Data for Data Mining (tags: linkd, datamining, timeseries, text, extraction, socialnetwork)
PAIDA Pure Python scientific analysis package (tags: python, visualization, library)
SUBDUE Graph Based Knowledge Discovery (tags: machinelearning, network, graph)
AOL search data mirrors (tags: aol, search)
Python Cheese Shop : shakespeare 0.4 (tags: python, text)
AGs corpus of news articles (tags: corpus, nlp, machinelearning, textmining)
Sampling Techniques for Massive Data Google Video (tags: video, machinelearning,
statistics, matrix, sampling, large, sparse, algorithm, experiment_design, towatch)
metachronistic Mirror the Wikipedia (tags: wikipedia, laptop, install, dump)
LETOR: Benchmark Datasets for Learning to Rank (tags: ranking, search)
CN710: Comparative Analysis of Learning Systems (Spring 2006) Class Project (tags:
machinelearning, algorithm, ogi, bu, greyhound, finance)
UrbanSim Home (tags: python, urban, software, simulation, opensource, GIS, census)

System One Wikipedia (tags: wikipedia, rdf)


System One Labs (tags: wikipedia, rdf, tools)
Face Recognition Homepage Databases (tags: face, algorithm, facerecognition, data, image)
CBCL SOFTWARE Face data set (tags: face, seung, algorithm, recognition, image)
Text Analytics Solutions from ClearForest (tags: extraction, nance, semantic, semanticweb,
text)
23C3 Mining Search Queries Google Video (tags: aol, search, video, talk, algorithm,
informationretrieval, datamining, machinelearning)
Digital History Hacks: Keywords and Clues (tags: aol, search, query, analysis)
Digital History Hacks: Searching for History (tags: aol, search, query, analysis)
The Tom Kyte Blog: An interesting data set (tags: aol, search, oracle, database, code)
KDD 2005 KDD Cup 2005: Aug 21-24, Chicago, IL. USA (tags: query, categorization,
algorithm, google)
Statistical NLP / corpus-based computational linguistics resources (tags: corpus,
machinelearning, text)
d.-student Rasmus Elsborg Madsen (tags: text, machinelearning, context, matlab)
Intelligent Web Search and Mining: Tools & Resources (tags: machinelearning, code, links)
PageRank Datasets and Code (tags: pagerank, code, algorithm)
Of cial Google Research Blog: All Our N-gram are Belong to You (tags: linguistics, google,
ngram, nlp, record_linkage)
Hyper-threaded Java Java World (tags: clustering, algorithm, java, parallel)
Statistical Modeling, Causal Inference, and Social Science (tags: blog, econometrics, nance,
machinelearning, math, statistics)
Structural Analysis of Discrete Data and Econometric Applications, by Charles F. Manski and
Daniel L. McFadden, MIT Press, 1981. (tags: books, econometrics, economics, finance, ebook)
Kris Brower Archives Google Onpage Search Results Analysis (tags: google, ranking, aol,
search, analytics)
CSE 250B Fall 2006 (tags: netflixprize, machinelearning, course)
Matrix Market (tags: matrixmarket, matrix)
Analysis of incomplete datasets: Estimation of mean values and covariance matrices and
imputation of missing values (tags: imputation, matlab, missing, EM, machinelearning)
Face Detection (tags: face, image)

CSE 250B Project 4, Fall 2006 (tags: subset, netflixprize, dimensionality, reduction)
G3DATA (tags: extract, from, graphs, hack, google, trends)
cwm a general purpose data processor for the semantic web (tags: python, processor,
semantic, web, rdf)
WebBase Project (tags: link, analysis, sturcture, web, crawler, stanford)
sam roweis : data (tags: machine, learning, matlab, python, hackers, image)
Index of /data/sequence/mnist (tags: mnist, xml, format)
MNIST handwritten digit database (tags: mnist)
Book-Crossing Dataset (tags: data, set, collaborative, filtering, datamining, books, movie)
allmovie (tags: movie, netflixprize, source)
Submissions Guidelines for the Collectorz.com Online Movie Database (tags: movie, source)
Cinema.com (tags: plot, synopsis, movie, netflixprize, prize)
LUMIERE (tags: netflixprize, prize, european, movie, revenue)
Data dumps Meta (tags: mediawiki, wikipedia, import, mysql, sql)
phone *** address * e-mail intitle:curriculum vitae Google Search (tags: resume,
google)

PREVIOUS POST

NEXT POST

Ai Ching
Ching is the Chief Email Of cer and dedicates her time to nd growth hacking ninja ways. Former P&G and
Experimental Psychologist, Chings addiction includes supporting new projects on Kickstarter and travelling.

ABOUT
Our Team
Careers

RESOURCES
Piktochart Resources

SUPPORT
Contact Us

CALL US
+1 302 703 7458
+6 128 011 4745
+44 127 479 2745
WRITE TO US
pikto.delight
info@piktochart.com

SOCIAL MEDIA
Facebook
Twitter
Google +
Pinterest

2015 Piktochart Infographics. Terms of Use | Privacy Policy

Вам также может понравиться