Вы находитесь на странице: 1из 102

S1:

Introduc-on to the Course


Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30


Data
The amount of data created by each

doubles every 1.5 2 years


person after five years after ten years after twenty years
A. Weigend

x 10 x 100 x 10000

1 billion connected flash players

A. Weigend

40 billion RFID tags worldwide

A. Weigend

Time Scales
Technology: ~1 year Biology: ~100k years
A. Weigend

A. Weigend

Social Data = Shared Data 10 billion ................


pieces of content shared

per month

Social Data = Shared Data 1 billion


videos watched

per ..... day

Process of

creating and refining product space awareness

Shopping?
only occasionally

punctuated by purchases

How do you know peoples

secret desires?

Instrument

for

feedback

Attention
" Transactions " Clicks
n

Data Sources
Intention
" Search
n

Situation
" Location " Device

A. Weigend

What is Data Mining?

Data Mining, Spring 20013 Shawndra Hill

13

What is Data Mining? The process of discovering meaningful new correla-ons,

paNerns, and trends by siOing through large amounts of data stored in repositories and by using paNern recogni-on technologies as well as sta-s-cal and mathema-cal techniques (The Gartner Group). The explora-on and analysis of large quan--es of data in order to discover meaningful paNerns and rules (Berry and Lino). The nontrivial extrac-on of implicit, previously unknown, and poten-ally useful informa-on from data (Frawley, Paitestsky- Shapiro and Mathews).
14

What is a paNern? A rela-onship in the data. E.g.,

Deni-on (Fayyad et. al): T he non-trivial discovery of novel, valid , comprehensible and poten-ally useful paNerns from data.

What is Data Mining?

nOn Thursday nights people who buy diapers also tend to buy

beer nPeople with good credit ra-ngs are less likely to have accidents
nMale consumers, 37+, income bracket 50K-75K spend

between $25-$50 per catalog order

15

Historical Dierences Between Sta-s-cs and DM


Sta%s%cs Conrma-ve Small data sets/ File-based Small number of variables Deduc-ve Numeric data Clean data Data Mining Explora-ve Large data sets/ Databases Large number of variables Induc-ve Numeric and non-numeric (including txt, networks) Data cleaning
16

Data Mining vs. Sta-s-cs


Sta-s-cs is known for:
well dened hypotheses used to learn about a specically chosen popula-on studied using carefully collected data providing inferences with well known proper-es.

Data mining isnt that careful. It is:


data driven discovery of models and paNerns from massive and observa-onal data sets

Data Mining v. Sta-s-cs


Tradi-onal sta-s-cs
rst hypothesize, then collect data, then analyze oOen model-oriented (strong parametric models)

Data mining:

few if any a priori hypotheses data is usually already collected a priori analysis is typically data-driven not hypothesis-driven OOen algorithm-oriented rather than model-oriented

Dierent?

Yes, in terms of culture, mo-va-on: however.. sta-s-cal ideas are very useful in data mining, e.g., in valida-ng whether discovered knowledge is useful Increasing overlap at the boundary of sta-s-cs and DM e.g., exploratory data analysis (based on pioneering work of John Tukey in the 1960s)

Data Mining Enablers


Explosion of data Fast and cheap computa-on and storage
Moores Law: processing doubles every 19 months Disk storage doubles every 9 months Database technology

Compe--ve pressure in business


Data has value!

1E+7

Disk TB Shipped per Year


19 9 8 D is k T rend ( J im P orter) http ://www.dis ktrend .com/p d f/p ortrp kg .pd f.

New, successful models


SVM, boos-ng

ExaByte 1E+6

1E+5

disk TB growth: 112%/y Moore's Law: 58.7%/y

Commercial products
SAS, SPSS, Insighlul, IBM, Oracle

1E+4

Open Source products


Weka R

1E+3 1988

1991

1994

1997

2000

Data-Driven Discovery
Observa-onal data
cheap rela-ve to experimental data
Examples:
Transac-on data archives for retail stores, airlines, etc Web logs for Amazon, Google, etc The human/mouse/rat genome Etc., etc

makes sense to leverage available data useful (?) informa-on may be hidden in vast archives of data

What are the perils of observa-onal data?

Data Mining: Conuence of Mul-ple Disciplines


Database Technology Statistics

Machine Learning

Data Mining

Visualization

Information Science

Other Disciplines

Different fields have different views of what data mining is (also different terminology!)

Induc-on vs. Deduc-on


The problem of Deduc-on: How to demonstrate that an abstract idea applies to nature? The Problem of Induc-on: How to go beyond a collec-on of facts to new concepts?

22

Decision Support Systems (DSSs)


Assist managers in making decisions or choices Types of DSSs: Model-Driven: Spreadsheets and other op-miza-on-based methods from Opera-ons Management and Finance. Communica8on-Driven: Groupware (e.g. vo-ng/ra-ng), Computer-Supported Collabora-ve Work (CSCW), Document-sharing, Teleconferencing Data-driven: Collect, store, and analyze large data volumes. a.k.a. Business Intelligence (BI) systems, Warehouses, OLAP Knowledge-driven: e.g. Expert systems that capture exper-se by applying rules elicited from experts. Tradi-onal uses: medical diagnosis (e.g. MYCIN), computer congura-on (e.g. XCON), personaliza-on. Knowledge elicita-on and knowledge representa-on problems.
This course deals mainly with: data-driven DSSs (Part 1) and knowledge-driven DSSs (Part 2). We will touch briey on model-driven DSSs in Part 2 (but see OPIM101 for more on that). 23

The Course

Data Mining, Spring 20013 Shawndra Hill

24

Course Objec-ves
Approach business problems data-analy;cally. Think carefully & systema-cally about whether & how data can improve business performance. Be able to interact competently on the topic of data mining for business intelligence. Know the basics of data mining processes, algorithms, & systems well enough to interact with CTOs, expert data miners, and business analysts. Be able to envision data-mining opportuni-es. Hands-on experience mining data. Be prepared to follow up on ideas or opportuni-es that present themselves, e.g., by performing pilot studies
25

Our Goals
Understand the basics of the major Data Mining/Machine Learning techniques: What they do: problems they can solve Who uses them Where they are used When and how to use them How they work (at a high level only) Limitations Apply techniques and evaluate the models built

26

Course Outline
Introduc-on to Modeling & Data Mining nFundamental concepts and terminology Data Mining methods nClassica-on decision trees, associa-on rules, clustering and segmenta-on, collabora-ve ltering, gene-c algorithms etc. nInner workings nStrengths and weaknesses Evalua-on nHow to evaluate the results of a data mining solu-ons Applica-ons nReal-world business problems DM can be applied to
27

Course Informa-on

Teaching style: Lecture / Lab/ Guest Speakers (AT&T, IBM, Yahoo!) Student par-cipa-on/aNendance is important Lab sessions: Weka, Gephi, python Textbook: Various Publicly Available Readings
28

Course TOOLS SQL (MicrosoO Access)


Weka

Gephi

Python (Version 2.7)


Start installing now
29

Course Informa-on
Canvas Wordpress class site: hNp://opim672.wordpress.com Facebook/TwiNer Oce hours: M 6-7pm, F 2-5pm, or by appointment

Email: shawndra@wharton.upenn.edu

TA: Krishna Choksi (krishnac@seas.upenn.edu) Adrian Benton


30

Course Informa-on
Read material before and aOer class n 8 homework assignment (35 points) groups of 2 n Data mining project (50 points) - groups of 4 6, 10 groups per class n Final Report n Mid-semester update n End of semester presenta-on n Project Reviews n Class par-cipa-on (15 points) n Data set compe--on (op-onal for extra credit) Warning: 1. This is a hands on class 2. A signicant por-on of deliverables are at the end of the semester.
n
31

What is a DSS?
Decision Support Systems aim at allowing business users to make beHer decisions faster and take ac%on more easily and more protably based on this informa%on. This is achieved through: Predic-on Descrip-on Data Dissemina-on Prescrip-on
32

Predic-on
Induc%on:

Rules

From specic examples (instances) to general rules Instances: ID Swims Color Type Animal1 yes gray dolphin Animal2 yes black dolphin Animal3 no gray elephant Rules: IF swims=yes THEN class=dolphin Antecedent / Assump%on (Rule Body) Consequent / Conclusion (Rule Head)

Predic%on =

Determining the class or aHribute-value for a new item with some known aHributes.
33

Text Mining

34

Predic-on: Examples from Industry ?


Classifying dolphins and owers is dull (toy problems oOen cited in the data mining literature). Ques-ons: How do we use data mining/machine learning to generate revenues or reduce costs ? How do we mone-ze DM ?!!!!!
35

Examples
Mining Medical Discussion Board Data Mining Motley Fool Caps Social Network Based Marketing Social Network Based Fraud Detection Social TV Examples Profit Maximizing Recommendation Engine
Data Mining, Spring 20013 Shawndra Hill 36

Predic-on: Examples from Industry ?


Wachovia Can I predict if someone will default on their loan? Visa Can I identify fraudulent credit card Transactions?

Linens n Things

Monster.com

The World Bank 37

Predic-on: Examples from Industry?


Wachovia Can I predict if someone will Default on their loan? Visa Can I identify fraudulent credit card Transactions?

Linens n Things Predict response to recommendation online?

Monster.com Predict if stock value of company will go up based on Employee attrition? The World Bank Predict if country/organization will default?

38

Predic-on: Examples from Industry?


ACNielson

Pepsico

39

Predic-on: Examples from Industry?


ACNielson
Association rules for market baskets?

Pepsico Identify business opportunities?

40

Data Mining as a Core Competency

41

Examples of Data Mining Successes

Google is a company built on data mining PageRank mined the web to build beNer search Google as spell checker Google as ad placer Google as news aggregator Google as face recognizer

Data Mining as a Core Competency

43

Data Mining as a Core Competency

44

Data Data Data


Its all about the data - where does it come from?
www NASA Business processes/transac-ons Telecommunica-ons and networking Medical imagery Government, census, demographics (data.gov!) Sensor networks, RFID tags Sports

Types of Data: Flat File or Vector Data


2.3 -1.5 -1.3 1.1 0.1 -0.1

p
Rows = objects Columns = measurements on objects
Represent each row as a p-dimensional vector, where p is the dimensionality
In efect, embed our objects in a p-dimensional vector space OOen useful, but not always appropriate

Both n and p can be very large in data mining Matrix can be quite sparse

Text

Data Mining, Spring 2006

47

Types of Data: Sparse Matrix (Text) Data


50 100 150

Text 200 Documents


250 300 350 400 450 500 20 40 60 80 100 120 140 160 180 200

Word IDs

Sequence (Web) Data


128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -, 128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -, 128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

Sometimes another representation is more useful User 1 User 2 User 3 User 4 User 5 2 3 7 1 5 3 3 7 5 1 2 3 7 1 1 2 1 7 1 5 3 1 7 1 3 3 1 1 1 3 1 3 3 3 3 1 7 7 7 5 1 5 1 1 1 1 1 1

Types of Data: Relational Data


128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -, 128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -, 128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -, 128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -, 128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -, ,

128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932 114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911

07911, Chester, NJ, 07954, 34000, , 40.65, -74.12 07932, Madison, NJ, 56000, 40.642, -74.132

Most large data sets are stored in relational data sets Oracle, MSFT, IBM Good open source versions: MySQL, PostGres

TRAJECTORIES OF CENTROIDS OF MOVING HAND IN VIDEO STREAMS 160

Types of Data: Time Series Data


Often many time series, long time series, or multivariate time series

140 N O 120 I T I S O 100 P X 80

60

40

10

15 TIME

20

25

30

Types of Data: Image Data

Spa-o Temporal Data


hNp://senseable.mit.edu/nyte/movies/nyte-globe-encounters.mov-encounters.mov

NetworkData

Algorithms for estimating relative importance in networks S. White and P. Smyth, ACM SIGKDD, 2003.

HP Labs email network 500 people, 20k relationships

Also, temporal networks

Data Mining - Columbia University

Major Applica-on Areas


Marke%ng Customer loyalty/aNri-on Market basket analysis: On Thursdays shoppers who buy diapers also buy beer Direct marke-ng Personaliza-on Market segmenta-on Fraud Detec%on (Telecommunica-on, Credit, Securi-es) Credit risk Health Care Insurance People with good credit ra-ngs have fewer accidents Text mining: email, documents, and Web analysis. Stock Selec%on Non-business applica-ons: military, bioinforma%cs, etc.
57

Examples of Data Mining Successes


Market Basket (WalMart) Recommender Systems (Amazon.com) Fraud Detec-on in Telecommunica-ons (AT&T) Target Marke-ng / CRM Financial Markets DNA Microarray analysis Biometrics (ngerprin-ng, handwri-ng) Web Trac / Blog analysis

Why Data Mining Now?


Better and cheaper Computing Power DM

Mature data mining technology

Improved Data Collection, Access & Storage

59

Accuracy is king
Only 15% of mergers and acquisitions succeed Stephen Denning The Leaders Guide to StoryTelling, pg xiv

60

Prot is King (or It pays to be wrong some-mes)

Failure rate of new ventures invested in: 8 out of 10 Profit on Google investment: $4 billion (on $25 million)

Source: http://www.financialnews-us.com/?contentid=534017

61

Some-mes it pays to be wrong almost all the -me

Customer Lifetime Value: $2,700 Cost per flyer: 7 cents Required hit rate = 7 / 270,000 = 1 in 38,571

62

Case: Verizon Wireless (Plain Vanilla DM)


About Verizon Wireless Largest wireless provider in US Customer base: 30.3 million Covering 90% of US popula-on Challenges High customer turnover rate (churn) of 2% per month (600,000 customers disconnect per month) Associated replacement cost in hundreds of millions per year Average cost of new customer acquisi-on: $320
63

Case: Verizon Wireless

Possible solu-ons
Oer incen-ves to every customers before contracts expire
expensive no learning

64

Data Mining Solu-on: Predic-on


Build a predic%ve model: Before contracts expire use a predic8ve model to predict which customers are likely to leave (i.e., es-ma-ng the probability) Then: Oer benets such as a new phone only to customers most likely to disconnect Develop new plans to t customer needs

65

Phases in the DM Process: CRISP-DM


Business Understanding

Data Understanding Data Prepara%on Modeling Evalua%on Deployment

www.crisp-dm.org
66

CRoss Industry Standard Process-DM


Business Understanding: Understanding project objec-ves and data mining problem iden-ca-on Data Understanding: Capturing, understand, explore your data for quality issues Data Prepara%on: Data cleaning, merge data, derive aNributes etc. Modeling: Select the data mining techniques, build the model Evalua%on: Evaluate the results and approved models Deployment: Put models into prac-ce, monitoring and maintenance
67

Case: Verizon Wireless

Understanding The Business Problem and Data


IT brought idea to Marketing team and presented it as partnership
n

Marketing learned the modeling process as well as capabilities and weaknesses of modeling

IT learned the business processes and direct marketing strategies n Marketing recommended additions to attributes to use in building model
n
68

Case: Verizon Wireless

Modeling
Data Selection/Preparation Included hundreds of basic attributes Derived and Ratio fields added to enrich the model Use predictive modeling technique to refine relationship between predictors and output of interest Test Model: how will it perform in real life Select the best models (accuracy, profitability, etc.)

69

Case: Verizon Wireless

Results: Marke-ng Campaigns using Predic-ve Modeling


Began with one campaign
40-60K pieces per month Very personalized unique oer Approximately 15% take rate

Currently four main campaign types


400,000 pieces/month
Up to 35% take rate of high churn risk customers


70

Deployment
Direct Mail and Telemarketing
n

Case: Verizon Wireless

Customized one-to-one mailings

Customer Care Application


Customer flagged by offer Used By: Customer Service, Retail Channels To catch customers that:
reps were unable to contact Call to disconnect

71

Benets
nCost

Case: Verizon Wireless

Reduction

n Customers

saved up to 80% more takes n Direct Mail budget for same churner mailing reduced by 60%

Revenue Increase Average monthly revenue increase per bill Monthly usage increased
Switched

customers from analog to digital Contract Renewals increased

72

Descrip-ve Vs. Predic-ve Data Mining


Descriptive DM is used to learn about and understand the data. Example: Iden-fy and describe groups of customers with common buying behavior (Clustering)

73

Example for Descriptive (Visualization) DM Using Customer Data


Find groups of customers with similar buying paNerns

74

Descrip-ve vs. Predic-ve Data Mining


Predictive DM: Aims to build models in order to predict unknown values of interest. Examples:
35 years Professional, 95K annual income 2 children 2 credit cards 3 orders last year Last purchase: 8 months ago Average spending $30 Last purchase: $40

A model that given a customers characteris-cs predicts how much the customer will spend on the next catalog order.

Next Order: $40-$50

Most predic-ve models are also descrip-ve. Amount spent on catalog purchase= 0.001*(Annunal_Income)
+0.3*(Num_Cards)+ (1/Num_Orders)

A model that classies credit applicants to determine whether or not an applicant will default on a loan.

75

Not a magic wand


n

What Data Mining Can and Cannot Do

set of tools and methodologies. Need to know how to utilize them.


n

No automatic solutions - Data mining offers a

Like any other powerful tool can be very dangerous if not used properly. n Team work: Cannot (always) replace skilled business analysts - needs guidance and validation of output

76

What Can Go Wrong


n

Problem formula%on n Need to understand the business well, good formula-on of problem Inappropriate use of methods n (And/Or) Lack of sucient/high quality data n Computa-onal issues Evalua%on n Need domain experts throughout the process to provide indispensable input and validate results
77

What Can Go Wrong?


n

Inability to act upon paNern because of poli-cal or ethical reasons


n Securi-es Trading models n Data mining in clinical evalua-on n Privacy (Insurance & credit, Doubleclick Inc.) n Admission interviews

78

Data Mining v. Privacy


There is oOen tension between data mining and personal privacy: hNp://www.aclu.org/pizza/images/screen.swf

Risk v. Reward in Data Mining


More data about more people in fewer places

The risks of research


My own personal story:
orhow a paper published in JCGS leads me to be connected to FBI wiretapping.

2006: (chris v) Published papers on Communities of Interest using social networks and Guilt by association to catch fraud 9 September 2007: NYT lead story F.B.I. Data Mining Reached Beyond Initial Targets discusses FBI techniques COI and GBA 23 October 2007: Blogosphere erupts: How AT&T Provides the FBI with Terror Suspect Leads

The risks of research


Another story:

Data Mining, Spring 2006

83

Data Mining, Spring 2006

85

Wikileaks Visualizations
86

Data Mining, Spring 2006

87

The Good, The Bad, and the Maybe


The question remains: how do we effectively leverage sensitive personal data for research purposes? Three case studies can give insight
Netflix Prize AOL search dataset Barabasi mobile study

Case Study 1: AOL Search Data


August 4, 2006: AOL releases 20M search terms by anonymized users for research purposes.
Why?

Within hours, uproar on the blogs


The utter stupidity of this is staggering TechCrunch

August 7: AOL removes data, issues apology


this was a screw-up, and we are angry an innocent enough attempt to reach out to the research community

August 9: NYT front page story


Identifies Thelma Arnold, 62 year old widow

Case Study 1: AOL Search Data


Whats the big deal?
Ego searches make it easy to figure out who you are combined with porn or illegal queries can make for serious privacy violations.

What went wrong


Not well thought out : risk >> reward Poor internal controls on public data release Lack of understanding of subject matter Lack of understanding of anonymizing data

Fallout
CTO + at least two others fired Data still out in the public
Is it ethical to study?

Inspiration for bad drama

purple lilac," "happy bunny pictures, "square dancing steps "cut into your trachea," "pee fetish, "Simpsons incest."

Case Study 2: Netflix Prize


October 2006: Netflix releases anonymized movie ratings from its customer base
100M ratings, 500K customers (<10% of all data)
Random integer as user ID "some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates

Case Study 2: Netflix Prize


Narayanan and Shmatikov (2008) The adversary with a small amount of background knowledge about an individualcan identify with high probability that individuals record in the data and learnsensitive attributes Claim that Netflix data sanitization not relevant Accuse Netflix of violating Video Privacy Protection Act of 1988 Details: With aux info on 8 movies, where 2 can be wrong, and dates are known within 14 days; 99% de-anonymization Aux info can be gotten via web sites, water coolers, etc People might be willing to give away some ratings, but not others

Case Study 2: Netflix Prize


Much ado about nothing Although paper is technically correct, dates are key Without dates, you must know 8 movies, all outside of the top 500 to get over 80% chance of de-anonymization Auxiliary data very hard to come by No known cases discovered Netflix did it right Consulted with top machine learning experts 0 < risk << reward Investment in quality data and expertise mitigated risk

Case Study 3: Barabasi Mobile Study


Gonzalez, Hidalgo and Barabasi (2008)
Article in Nature outlines study on human mobility patterns
100000 individuals selected randomly from dataset of 6 million Unidentified country (unclear if the researchers knew) Cell tower location at start of call 206 individuals were pinged every two hours for a week

Findings
humans follow simple, reproducible patterns Sample finding: Nearly three-quarters of those studied mainly stayed within a 20-mile-wide circle for half a year. Results could impact all phenomena driven by human mobility, from epidemic prevention to emergency response and urban planning.

Case Study 3: Barabasi Mobile Studyof cell phone users Uproar ensued over secret tracking
Blowback of negative feedback to Nature and scientists Study would be illegal in the US Approval from ONR review board and Northeastern review board. Barabasi did not check with an ethics panel

Response
Hidalgo: the data could be misused, but we were not trying to do evil things. We are trying to make the world a little better. Northeastern and Nature backed the research Continues to be referenced as an example of dangerous research Risk and reward both very high

How do we guarantee that data is private?

Research Concepts - Privacy


quasi-iden-ers combina-ons of aNributes within the data that can be used to iden-fy individuals. E.g. 87% of the popula-on of the United States can be uniquely iden-ed by gender, date of birth, and 5-digit zip code Datasets are k-anonymous when for any given quasi- iden-er, a record is indis-nguishable from k-1 others.

But, one step further, maybe all k have a given sensi-ve aNribute!
The distribu-on of target values within a group is referred to as l-diversity.

Ways to fuzz data to increase anonymity and diversity:


Generalize / summarize the data : bin size, aggregate counts Suppress or delete data Perturb data

Data Mining SoOware


SoOware
Can use any soOware you like: Preferred: Weka Also: R, SAS, SPSS, Systat, Enterprise Miner. Matlab, SQL Server Maybe : Excel

What is R?
Open source sta-s-cal soOware grown out of S/Splus www.r-project.org Packages at CRAN R Tutorials available online (see website and CRAN) Great graphics (with a bit of a learning curve)

Data Mining - Columbia University

authorita-ve texts (yet). This class draws from many sources, best are

Resources does not have Data mining is a new eld and as such,
Data Mining Techniques: For Marke-ng, Sales, and Customer Support, by Michael J.A. Berry, Gordon Lino, published by John Wiley & Sons, Inc. Elements of Sta%s%cal Learning Has%e, Tibshirani, and Friedman Handbook of Data Mining Hand, Mannila and Smyth Interac-ve and Dynamic Graphics for Data Analysis Cook and Swayne Also good class notes available from other classes:
David Madigan, Columbia Di Cook, Iowa State Padhraic Smyth, UC Irvine Jiawei Han, Simon Fraser

see class web site for pointers to these notes, or just Google them!)

Assignment 1
n

By Monday (01/16/2013) midnight on canvas


n n n

Confirm access to canvas! Required readings

Profiles will be posted on canvas to facilitate group selection ASAP Generate 3 potential classification (prediction) problems/ideas as part of Assignment 1 (Start exploring publicly available data sets projects from last year are available)
n
99

Projects From Prior Years

Data Mining, Spring 20013 Shawndra Hill

100

Sources: Andreas Weigend, Chris Volinsky

101

S1: Introduc-on to the Course


Shawndra Hill Spring 2013 TR 1:30-3pm and 3-4:30

Вам также может понравиться