S1 Introduction To Course

S1:
Introduc-on to the Course

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Data
The amount of data created by each
doubles every 1.5 2 years

person after five years after ten years after twenty years
A. Weigend
x 10 x 100 x 10000
1 billion connected flash players
A. Weigend
40 billion RFID tags worldwide
A. Weigend
Time Scales
Technology: ~1 year Biology: ~100k years
A. Weigend
A. Weigend
Social Data = Shared Data 10 billion ................

pieces of content shared
per month
Social Data = Shared Data 1 billion

videos watched
per ..... day
Process of
creating and refining product space awareness
Shopping?
only occasionally
punctuated by purchases
How do you know peoples
secret desires?
Instrument
for
feedback
Attention
" Transactions " Clicks
n
Data Sources
Intention
" Search
n
Situation
" Location " Device
A. Weigend
What is Data Mining?
Data Mining, Spring 20013 Shawndra Hill
13
What is Data Mining? The process of discovering meaningful new correla-ons,
paNerns, and trends by siOing through large amounts of data stored in repositories and by using paNern recogni-on technologies as well as sta-s-cal and mathema-cal techniques (The Gartner Group). The explora-on and analysis of large quan--es of data in order to discover meaningful paNerns and rules (Berry and Lino). The nontrivial extrac-on of implicit, previously unknown, and poten-ally useful informa-on from data (Frawley, Paitestsky- Shapiro and Mathews).
14
What is a paNern? A rela-onship in the data. E.g.,
Deni-on (Fayyad et. al): T he non-trivial discovery of novel, valid , comprehensible and poten-ally useful paNerns from data.
What is Data Mining?
nOn Thursday nights people who buy diapers also tend to buy
beer nPeople with good credit ra-ngs are less likely to have accidents
nMale consumers, 37+, income bracket 50K-75K spend
between $25-$50 per catalog order
15
Historical Dierences Between Sta-s-cs and DM

Sta%s%cs Conrma-ve Small data sets/ File-based Small number of variables Deduc-ve Numeric data Clean data Data Mining Explora-ve Large data sets/ Databases Large number of variables Induc-ve Numeric and non-numeric (including txt, networks) Data cleaning
16
Data Mining vs. Sta-s-cs

Sta-s-cs is known for:
well dened hypotheses used to learn about a specically chosen popula-on studied using carefully collected data providing inferences with well known proper-es.
Data mining isnt that careful. It is:

data driven discovery of models and paNerns from massive and observa-onal data sets
Data Mining v. Sta-s-cs

Tradi-onal sta-s-cs
rst hypothesize, then collect data, then analyze oOen model-oriented (strong parametric models)
Data mining:

few if any a priori hypotheses data is usually already collected a priori analysis is typically data-driven not hypothesis-driven OOen algorithm-oriented rather than model-oriented
Dierent?
Yes, in terms of culture, mo-va-on: however.. sta-s-cal ideas are very useful in data mining, e.g., in valida-ng whether discovered knowledge is useful Increasing overlap at the boundary of sta-s-cs and DM e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960s)
Data Mining Enablers

Explosion of data Fast and cheap computa-on and storage
Moores Law: processing doubles every 19 months Disk storage doubles every 9 months Database technology
Compe--ve pressure in business

Data has value!
1E+7
Disk TB Shipped per Year

19 9 8 D is k T rend ( J im P orter) http ://www.dis ktrend .com/p d f/p ortrp kg .pd f.
New, successful models

SVM, boos-ng
ExaByte 1E+6
1E+5
disk TB growth: 112%/y Moore's Law: 58.7%/y
Commercial products
SAS, SPSS, Insighlul, IBM, Oracle
1E+4
Open Source products

Weka R
1E+3 1988
1991
1994
1997
2000
Data-Driven Discovery
Observa-onal data
cheap rela-ve to experimental data
Examples:
Transac-on data archives for retail stores, airlines, etc Web logs for Amazon, Google, etc The human/mouse/rat genome Etc., etc
makes sense to leverage available data useful (?) informa-on may be hidden in vast archives of data
What are the perils of observa-onal data?
Data Mining: Conuence of Mul-ple Disciplines

Database Technology Statistics
Machine Learning
Data Mining
Visualization
Information Science
Other Disciplines
Different fields have different views of what data mining is (also different terminology!)
Induc-on vs. Deduc-on

The problem of Deduc-on: How to demonstrate that an abstract idea applies to nature? The Problem of Induc-on: How to go beyond a collec-on of facts to new concepts?
22
Decision Support Systems (DSSs)

Assist managers in making decisions or choices Types of DSSs: Model-Driven: Spreadsheets and other op-miza-on-based methods from Opera-ons Management and Finance. Communica8on-Driven: Groupware (e.g. vo-ng/ra-ng), Computer-Supported Collabora-ve Work (CSCW), Document-sharing, Teleconferencing Data-driven: Collect, store, and analyze large data volumes. a.k.a. Business Intelligence (BI) systems, Warehouses, OLAP Knowledge-driven: e.g. Expert systems that capture exper-se by applying rules elicited from experts. Tradi-onal uses: medical diagnosis (e.g. MYCIN), computer congura-on (e.g. XCON), personaliza-on. Knowledge elicita-on and knowledge representa-on problems.
This course deals mainly with: data-driven DSSs (Part 1) and knowledge-driven DSSs (Part 2). We will touch briey on model-driven DSSs in Part 2 (but see OPIM101 for more on that). 23
The Course
24
Course Objec-ves
Approach business problems data-analy;cally. Think carefully & systema-cally about whether & how data can improve business performance. Be able to interact competently on the topic of data mining for business intelligence. Know the basics of data mining processes, algorithms, & systems well enough to interact with CTOs, expert data miners, and business analysts. Be able to envision data-mining opportuni-es. Hands-on experience mining data. Be prepared to follow up on ideas or opportuni-es that present themselves, e.g., by performing pilot studies
25
Our Goals
Understand the basics of the major Data Mining/Machine Learning techniques: What they do: problems they can solve Who uses them Where they are used When and how to use them How they work (at a high level only) Limitations Apply techniques and evaluate the models built
26
Course Outline
Introduc-on to Modeling & Data Mining nFundamental concepts and terminology Data Mining methods nClassica-on decision trees, associa-on rules, clustering and segmenta-on, collabora-ve ltering, gene-c algorithms etc. nInner workings nStrengths and weaknesses Evalua-on nHow to evaluate the results of a data mining solu-ons Applica-ons nReal-world business problems DM can be applied to
27
Course Informa-on
Teaching style: Lecture / Lab/ Guest Speakers (AT&T, IBM, Yahoo!) Student par-cipa-on/aNendance is important Lab sessions: Weka, Gephi, python Textbook: Various Publicly Available Readings
28
Course TOOLS SQL (MicrosoO Access)

Weka

Gephi

Python (Version 2.7)

Start installing now
29
Course Informa-on
Canvas Wordpress class site: hNp://opim672.wordpress.com Facebook/TwiNer Oce hours: M 6-7pm, F 2-5pm, or by appointment
Email: shawndra@wharton.upenn.edu
TA: Krishna Choksi (krishnac@seas.upenn.edu) Adrian Benton

30
Course Informa-on
Read material before and aOer class n 8 homework assignment (35 points) groups of 2 n Data mining project (50 points) - groups of 4 6, 10 groups per class n Final Report n Mid-semester update n End of semester presenta-on n Project Reviews n Class par-cipa-on (15 points) n Data set compe--on (op-onal for extra credit) Warning: 1. This is a hands on class 2. A signicant por-on of deliverables are at the end of the semester.
n
31
What is a DSS?
Decision Support Systems aim at allowing business users to make beHer decisions faster and take ac%on more easily and more protably based on this informa%on. This is achieved through: Predic-on Descrip-on Data Dissemina-on Prescrip-on
32
Predic-on
Induc%on:
Rules
From specic examples (instances) to general rules Instances: ID Swims Color Type Animal1 yes gray dolphin Animal2 yes black dolphin Animal3 no gray elephant Rules: IF swims=yes THEN class=dolphin Antecedent / Assump%on (Rule Body) Consequent / Conclusion (Rule Head)
Predic%on =
Determining the class or aHribute-value for a new item with some known aHributes.
33
Text Mining
34
Predic-on: Examples from Industry ?

Classifying dolphins and owers is dull (toy problems oOen cited in the data mining literature). Ques-ons: How do we use data mining/machine learning to generate revenues or reduce costs ? How do we mone-ze DM ?!!!!!
35
Examples
Mining Medical Discussion Board Data Mining Motley Fool Caps Social Network Based Marketing Social Network Based Fraud Detection Social TV Examples Profit Maximizing Recommendation Engine
Data Mining, Spring 20013 Shawndra Hill 36
Predic-on: Examples from Industry ?

Wachovia Can I predict if someone will default on their loan? Visa Can I identify fraudulent credit card Transactions?
Linens n Things
Monster.com
The World Bank 37
Predic-on: Examples from Industry?

Wachovia Can I predict if someone will Default on their loan? Visa Can I identify fraudulent credit card Transactions?
Linens n Things Predict response to recommendation online?
Monster.com Predict if stock value of company will go up based on Employee attrition? The World Bank Predict if country/organization will default?
38

ACNielson
Pepsico
39

ACNielson
Association rules for market baskets?
Pepsico Identify business opportunities?
40
Data Mining as a Core Competency
41
Examples of Data Mining Successes
Google is a company built on data mining PageRank mined the web to build beNer search Google as spell checker Google as ad placer Google as news aggregator Google as face recognizer
43
44
Data Data Data

Its all about the data - where does it come from?
www NASA Business processes/transac-ons Telecommunica-ons and networking Medical imagery Government, census, demographics (data.gov!) Sensor networks, RFID tags Sports
Types of Data: Flat File or Vector Data

2.3 -1.5 -1.3 1.1 0.1 -0.1
p
Rows = objects Columns = measurements on objects
Represent each row as a p-dimensional vector, where p is the dimensionality
In efect, embed our objects in a p-dimensional vector space OOen useful, but not always appropriate
Both n and p can be very large in data mining Matrix can be quite sparse
Text
Data Mining, Spring 2006
47
Types of Data: Sparse Matrix (Text) Data

50 100 150
Text 200 Documents

250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200
Word IDs
Sequence (Web) Data

128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
Sometimes another representation is more useful User 1 User 2 User 3 User 4 User 5 2 3 7 1 5 3 3 7 5 1 2 3 7 1 1 2 1 7 1 5 3 1 7 1 3 3 1 1 1 3 1 3 3 3 3 1 7 7 7 5 1 5 1 1 1 1 1 1
Types of Data: Relational Data

128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, ,
128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932 114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911
07911, Chester, NJ, 07954, 34000, , 40.65, -74.12 07932, Madison, NJ, 56000, 40.642, -74.132
Most large data sets are stored in relational data sets Oracle, MSFT, IBM Good open source versions: MySQL, PostGres
TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS 160
Types of Data: Time Series Data

Often many time series, long time series, or multivariate time series
140 N O 120 I T I S O 100 P X 80
60
40
10
15 TIME
20
25
30
Types of Data: Image Data
Spa-o Temporal Data

hNp://senseable.mit.edu/nyte/movies/nyte-globe-encounters.mov-encounters.mov
NetworkData
Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003.
HP Labs email network 500 people, 20k relationships
Also, temporal networks
Data Mining - Columbia University
Major Applica-on Areas

Marke%ng Customer loyalty/aNri-on Market basket analysis: On Thursdays shoppers who buy diapers also buy beer Direct marke-ng Personaliza-on Market segmenta-on Fraud Detec%on (Telecommunica-on, Credit, Securi-es) Credit risk Health Care Insurance People with good credit ra-ngs have fewer accidents Text mining: email, documents, and Web analysis. Stock Selec%on Non-business applica-ons: military, bioinforma%cs, etc.
57
Examples of Data Mining Successes

Market Basket (WalMart) Recommender Systems (Amazon.com) Fraud Detec-on in Telecommunica-ons (AT&T) Target Marke-ng / CRM Financial Markets DNA Microarray analysis Biometrics (ngerprin-ng, handwri-ng) Web Trac / Blog analysis
Why Data Mining Now?

Better and cheaper Computing Power DM
Mature data mining technology
Improved Data Collection, Access & Storage
59
Accuracy is king
Only 15% of mergers and acquisitions succeed Stephen Denning The Leaders Guide to StoryTelling, pg xiv
60
Prot is King (or It pays to be wrong some-mes)
Failure rate of new ventures invested in: 8 out of 10 Profit on Google investment: $4 billion (on $25 million)
Source: http://www.financialnews-us.com/?contentid=534017
61
Some-mes it pays to be wrong almost all the -me
Customer Lifetime Value: $2,700 Cost per flyer: 7 cents Required hit rate = 7 / 270,000 = 1 in 38,571
62
Case: Verizon Wireless (Plain Vanilla DM)

About Verizon Wireless Largest wireless provider in US Customer base: 30.3 million Covering 90% of US popula-on Challenges High customer turnover rate (churn) of 2% per month (600,000 customers disconnect per month) Associated replacement cost in hundreds of millions per year Average cost of new customer acquisi-on: $320
63
Case: Verizon Wireless
Possible solu-ons
Oer incen-ves to every customers before contracts expire
expensive no learning
64
Data Mining Solu-on: Predic-on

Build a predic%ve model: Before contracts expire use a predic8ve model to predict which customers are likely to leave (i.e., es-ma-ng the probability) Then: Oer benets such as a new phone only to customers most likely to disconnect Develop new plans to t customer needs
65
Phases in the DM Process: CRISP-DM

Business Understanding
Data Understanding Data Prepara%on Modeling Evalua%on Deployment
www.crisp-dm.org
66
CRoss Industry Standard Process-DM

Business Understanding: Understanding project objec-ves and data mining problem iden-ca-on Data Understanding: Capturing, understand, explore your data for quality issues Data Prepara%on: Data cleaning, merge data, derive aNributes etc. Modeling: Select the data mining techniques, build the model Evalua%on: Evaluate the results and approved models Deployment: Put models into prac-ce, monitoring and maintenance
67
Understanding The Business Problem and Data

IT brought idea to Marketing team and presented it as partnership
n
Marketing learned the modeling process as well as capabilities and weaknesses of modeling
IT learned the business processes and direct marketing strategies n Marketing recommended additions to attributes to use in building model
n
68
Modeling
Data Selection/Preparation Included hundreds of basic attributes Derived and Ratio fields added to enrich the model Use predictive modeling technique to refine relationship between predictors and output of interest Test Model: how will it perform in real life Select the best models (accuracy, profitability, etc.)
69
Results: Marke-ng Campaigns using Predic-ve Modeling

Began with one campaign
40-60K pieces per month Very personalized unique oer Approximately 15% take rate
Currently four main campaign types

400,000 pieces/month
Up to 35% take rate of high churn risk customers

70
Deployment
Direct Mail and Telemarketing
n
Customized one-to-one mailings
Customer Care Application

Customer flagged by offer Used By: Customer Service, Retail Channels To catch customers that:
reps were unable to contact Call to disconnect
71
Benets
nCost
Reduction
n Customers
saved up to 80% more takes n Direct Mail budget for same churner mailing reduced by 60%
Revenue Increase Average monthly revenue increase per bill Monthly usage increased
Switched
customers from analog to digital Contract Renewals increased
72
Descrip-ve Vs. Predic-ve Data Mining

Descriptive DM is used to learn about and understand the data. Example: Iden-fy and describe groups of customers with common buying behavior (Clustering)

73
Example for Descriptive (Visualization) DM Using Customer Data

Find groups of customers with similar buying paNerns
74
Descrip-ve vs. Predic-ve Data Mining

Predictive DM: Aims to build models in order to predict unknown values of interest. Examples:
35 years Professional, 95K annual income 2 children 2 credit cards 3 orders last year Last purchase: 8 months ago Average spending $30 Last purchase: $40
A model that given a customers characteris-cs predicts how much the customer will spend on the next catalog order.
Next Order: $40-$50
Most predic-ve models are also descrip-ve. Amount spent on catalog purchase= 0.001*(Annunal_Income)
+0.3*(Num_Cards)+ (1/Num_Orders)
A model that classies credit applicants to determine whether or not an applicant will default on a loan.
75
Not a magic wand

n
What Data Mining Can and Cannot Do
set of tools and methodologies. Need to know how to utilize them.

n
No automatic solutions - Data mining offers a
Like any other powerful tool can be very dangerous if not used properly. n Team work: Cannot (always) replace skilled business analysts - needs guidance and validation of output
76
What Can Go Wrong

n
Problem formula%on n Need to understand the business well, good formula-on of problem Inappropriate use of methods n (And/Or) Lack of sucient/high quality data n Computa-onal issues Evalua%on n Need domain experts throughout the process to provide indispensable input and validate results
77
What Can Go Wrong?

n
Inability to act upon paNern because of poli-cal or ethical reasons

n Securi-es Trading models n Data mining in clinical evalua-on n Privacy (Insurance & credit, Doubleclick Inc.) n Admission interviews
78
Data Mining v. Privacy

There is oOen tension between data mining and personal privacy: hNp://www.aclu.org/pizza/images/screen.swf
Risk v. Reward in Data Mining

More data about more people in fewer places
The risks of research

My own personal story:
orhow a paper published in JCGS leads me to be connected to FBI wiretapping.
2006: (chris v) Published papers on Communities of Interest using social networks and Guilt by association to catch fraud 9 September 2007: NYT lead story F.B.I. Data Mining Reached Beyond Initial Targets discusses FBI techniques COI and GBA 23 October 2007: Blogosphere erupts: How AT&T Provides the FBI with Terror Suspect Leads
The risks of research

Another story:
83
85
Wikileaks Visualizations
86
87
The Good, The Bad, and the Maybe

The question remains: how do we effectively leverage sensitive personal data for research purposes? Three case studies can give insight
Netflix Prize AOL search dataset Barabasi mobile study
Case Study 1: AOL Search Data

August 4, 2006: AOL releases 20M search terms by anonymized users for research purposes.
Why?
Within hours, uproar on the blogs

The utter stupidity of this is staggering TechCrunch
August 7: AOL removes data, issues apology

this was a screw-up, and we are angry an innocent enough attempt to reach out to the research community
August 9: NYT front page story

Identifies Thelma Arnold, 62 year old widow
Case Study 1: AOL Search Data

Whats the big deal?
Ego searches make it easy to figure out who you are combined with porn or illegal queries can make for serious privacy violations.
What went wrong

Not well thought out : risk >> reward Poor internal controls on public data release Lack of understanding of subject matter Lack of understanding of anonymizing data
Fallout
CTO + at least two others fired Data still out in the public
Is it ethical to study?
Inspiration for bad drama
purple lilac," "happy bunny pictures, "square dancing steps "cut into your trachea," "pee fetish, "Simpsons incest."
Case Study 2: Netflix Prize

October 2006: Netflix releases anonymized movie ratings from its customer base
100M ratings, 500K customers (<10% of all data)
Random integer as user ID "some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates

Narayanan and Shmatikov (2008) The adversary with a small amount of background knowledge about an individualcan identify with high probability that individuals record in the data and learnsensitive attributes Claim that Netflix data sanitization not relevant Accuse Netflix of violating Video Privacy Protection Act of 1988 Details: With aux info on 8 movies, where 2 can be wrong, and dates are known within 14 days; 99% de-anonymization Aux info can be gotten via web sites, water coolers, etc People might be willing to give away some ratings, but not others

Much ado about nothing Although paper is technically correct, dates are key Without dates, you must know 8 movies, all outside of the top 500 to get over 80% chance of de-anonymization Auxiliary data very hard to come by No known cases discovered Netflix did it right Consulted with top machine learning experts 0 < risk << reward Investment in quality data and expertise mitigated risk
Case Study 3: Barabasi Mobile Study

Gonzalez, Hidalgo and Barabasi (2008)
Article in Nature outlines study on human mobility patterns
100000 individuals selected randomly from dataset of 6 million Unidentified country (unclear if the researchers knew) Cell tower location at start of call 206 individuals were pinged every two hours for a week
Findings
humans follow simple, reproducible patterns Sample finding: Nearly three-quarters of those studied mainly stayed within a 20-mile-wide circle for half a year. Results could impact all phenomena driven by human mobility, from epidemic prevention to emergency response and urban planning.
Case Study 3: Barabasi Mobile Studyof cell phone users Uproar ensued over secret tracking
Blowback of negative feedback to Nature and scientists Study would be illegal in the US Approval from ONR review board and Northeastern review board. Barabasi did not check with an ethics panel
Response
Hidalgo: the data could be misused, but we were not trying to do evil things. We are trying to make the world a little better. Northeastern and Nature backed the research Continues to be referenced as an example of dangerous research Risk and reward both very high
How do we guarantee that data is private?
Research Concepts - Privacy

quasi-iden-ers combina-ons of aNributes within the data that can be used to iden-fy individuals. E.g. 87% of the popula-on of the United States can be uniquely iden-ed by gender, date of birth, and 5-digit zip code Datasets are k-anonymous when for any given quasi- iden-er, a record is indis-nguishable from k-1 others.
But, one step further, maybe all k have a given sensi-ve aNribute!
The distribu-on of target values within a group is referred to as l-diversity.
Ways to fuzz data to increase anonymity and diversity:

Generalize / summarize the data : bin size, aggregate counts Suppress or delete data Perturb data
Data Mining SoOware

SoOware
Can use any soOware you like: Preferred: Weka Also: R, SAS, SPSS, Systat, Enterprise Miner. Matlab, SQL Server Maybe : Excel
What is R?
Open source sta-s-cal soOware grown out of S/Splus www.r-project.org Packages at CRAN R Tutorials available online (see website and CRAN) Great graphics (with a bit of a learning curve)
Data Mining - Columbia University
authorita-ve texts (yet). This class draws from many sources, best are
Resources does not have Data mining is a new eld and as such,
Data Mining Techniques: For Marke-ng, Sales, and Customer Support, by Michael J.A. Berry, Gordon Lino, published by John Wiley & Sons, Inc. Elements of Sta%s%cal Learning Has%e, Tibshirani, and Friedman Handbook of Data Mining Hand, Mannila and Smyth Interac-ve and Dynamic Graphics for Data Analysis Cook and Swayne Also good class notes available from other classes:
David Madigan, Columbia Di Cook, Iowa State Padhraic Smyth, UC Irvine Jiawei Han, Simon Fraser
see class web site for pointers to these notes, or just Google them!)
Assignment 1
n
By Monday (01/16/2013) midnight on canvas

n n n
Confirm access to canvas! Required readings
Profiles will be posted on canvas to facilitate group selection ASAP Generate 3 potential classification (prediction) problems/ideas as part of Assignment 1 (Start exploring publicly available data sets projects from last year are available)
n
99
Projects From Prior Years
100
Sources: Andreas Weigend, Chris Volinsky
101
S1: Introduc-on to the Course

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

S1 Introduction To Course

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

S1 Introduction To Course

Загружено:

Авторское право:

Доступные форматы

S1:

Introduc-on to the Course

Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

doubles every 1.5 2 years

1 billion connected flash players

40 billion RFID tags worldwide

Social Data = Shared Data 10 billion ................

Social Data = Shared Data 1 billion

per ..... day

creating and refining product space awareness

How do you know peoples

What is Data Mining?

Data Mining, Spring 20013 Shawndra Hill

What is Data Mining? The process of discovering meaningful new correla-ons,

What is a paNern? A rela-onship in the data. E.g.,

What is Data Mining?

between $25-$50 per catalog order

Historical Dierences Between Sta-s-cs and DM

Data Mining vs. Sta-s-cs

Data mining isnt that careful. It is:

Data Mining v. Sta-s-cs

Data Mining Enablers

Compe--ve pressure in business

Disk TB Shipped per Year

New, successful models

disk TB growth: 112%/y Moore's Law: 58.7%/y

Open Source products

What are the perils of observa-onal data?

Data Mining: Conuence of Mul-ple Disciplines

Induc-on vs. Deduc-on

Decision Support Systems (DSSs)

Data Mining, Spring 20013 Shawndra Hill

Course TOOLS SQL (MicrosoO Access)

Python (Version 2.7)

TA: Krishna Choksi (krishnac@seas.upenn.edu) Adrian Benton

Predic-on: Examples from Industry ?

Predic-on: Examples from Industry ?

The World Bank 37

Predic-on: Examples from Industry?

Linens n Things Predict response to recommendation online?

Predic-on: Examples from Industry?

Predic-on: Examples from Industry?

Pepsico Identify business opportunities?

Data Mining as a Core Competency

Examples of Data Mining Successes

Data Mining as a Core Competency

Data Mining as a Core Competency

Data Data Data

Types of Data: Flat File or Vector Data

Data Mining, Spring 2006

Types of Data: Sparse Matrix (Text) Data

Text 200 Documents

Sequence (Web) Data

Types of Data: Relational Data

TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS 160

Types of Data: Time Series Data

140 N O 120 I T I S O 100 P X 80

Types of Data: Image Data

Spa-o Temporal Data

HP Labs email network 500 people, 20k relationships

Also, temporal networks

Data Mining - Columbia University