You are on page 1of 34

Daniel Burseth

Co-president MIT Big Data Explorers


dburseth@mit.edu
@dmbnyc
Github: dburseth

Acronyms abound
Tremendous complexity
Use building blocks not code
This is easy
EPPM of 10 requires 500 professionals
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-
work.html?emc=eta1&_r=0
Data preparation and cleansing:
Missing
Duplicative
Conventions (dates, time,
geographies)
Spacing
Can we measure data
cleanliness?
Whats our Pareto point?

AWS -> EC2
Launch instance: ami-c6b61fae (US-EAST)
Instance type m3.medium
Connect
You should see some software on the desktop



Scrape all of Craiglists Boston apartment listings using WebHarvy
Examine, clean, and prepare the data set using OpenRefine
Map our data and apply filters using Tableau

all without writing a single line of code.
A hyper-intelligent utility to scrape website data.
SysNucleus, makers of USBTrace
Heavy duty alternatives: Scrapy (scrappy.org),
Beautiful Soup


HTTP://SHOUTKEY.COM/WIRE
1. Start Config
2. Click on Hungry Mother
capture text
3. Click on Hungry Mother
capture URL
4. Click on Kendall
Square/MIT capture text
5. Click lasts review capture
text
CLEAR

1. Mine -> Scrape a list of similar
links
2. Click on Hungry Mother

Lets start collecting
information in the first sub-
page.

Edit Clear
Navigate into a sub-page
Start Config
Set as Next Page Link


Scheduler
Input keywords
Puase Inject (word of caution: scraping often violates TOS. Potentially not viable for
apps, commercial purposes!)
TRY VISITING CRAIGSLIST IN AWS BTW!!
Proxy
Database export
Download Craigslist Boston from http://shoutkey.com/glorify
Look at our data: open Boston Dirty.csv (20k rows of mess!)
Time to CLEAN: Launch GOOGLE-REFINE.EXE
Within MOZILLA, navigate to http://127.0.0.1:3333/
Create Project -> This Computer -> Browse
Parse by tab
Create Project

1. First, sort your column.
2. Then, invoke "Re-order rows permanently" in the "Sort" dropdown menu that appears on top of the
middle of the data table.
3. Then invoke Edit cells and Blank down on the Title column.
4. Then on that column, invoke menu Facet > Custom facets and Facet by blank.
5. Select true in that facet, and invoke Remove matching rows in the left most "all" dropdown menu.
6. Remove the facet.


Then run the To Number transform again



Increment the radius to 7
and make judgment calls
along the way.
Change the Distance
Function and do the same
thing


Looks like we have SOME really expensive real
estate. Data errors????
Boston Clean.csv
Load Boston clean.csv
Go to Worksheet
Great semantic example. Tableau understands that this text translates to a lat/long
Look on the map in the lower right corner
Lets Filter Data

Under Measures, drag Price onto size in Marks
Change sum(Price) to avg(Price)
Drag Price, change to max(price) into Filters and select an At Most
Right click on the filter and show Quick Filter
Drag City onto Label
Menu Map -> Map Options
Click on a node for info and drill down potential
1. Explored various webpage structures and scraped them
2. Exported the data to Refine
3. Parsed columns to extract critical price and location information
4. Used clustering algorithms to merge related geographies
5. Applied filters to identify errant prices
6. Exported the data to Tableau
7. Completed a real cursory mapping visualization
Please come talk to me