Академический Документы
Профессиональный Документы
Культура Документы
INTRODUCTION
Week 1
Today
Data Mining
Purchase information
Web site browsing habits
Social network data
Goals: customer profiling, targeted marketing, fraud detection
Questions that analyst will try to answer by data mining:
Who
Target Example
Links:
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-targetfigured-out-a-teen-girl-was-pregnant-before-her-father-did/
http://www.kdnuggets.com/2014/05/target-predict-teen-pregnanc
y-insidestory.html
the process of
automatically discovering
useful information in large
data repositories
to find novel and useful
patterns that might
otherwise remain
unknown
looking up records in a
MySQL database
(database)
finding relevant web
pages based on a
Google search query
(information retrieval)
Input
Input Data
Data
MySQL
.csv
Data
Data Mining
Mining
Decision
Trees
Support
Vector
Machines
Linear
Regression
Postprocessin
Postprocessin
g
g
Visualization
Pattern
Interpretatio
n
Reporting
Reporting to
to
Boss
Boss
closing the
loop
Input Data
Preprocessing
Data Mining
Linear Regression
Support Vector Machines
Decision Trees
Clustering
Postprocessing
Performing:
Visualization
Statistical significant tests, confidence
intervals, hypothesis testing to eliminate
spurious data mining results
(yikes,
math!)
Scalability
High Dimensionality
Data Ownership
private
Based on a hypothesize-and-test
paradigm
1.
2.
3.
4.
Hypothesis proposed
Experiment designed to gather data
Data analyzed w/ respect to hypothesis
Hypothesis accepted or rejected
Hypothesis-and-test pattern
Data collection
Laborious process
Generation and evaluation
of thousands of hypotheses
Usually on relatively
smaller datasets
Data Mining
Vocabulary
10
id
Home Marital
Owner Status
Annual
Income
Defaulted
Barrower
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Column: attribute,
feature, field,
dimension,
variable
Row: instance,
record,
observation,
sample
1.
Objective: predict
value of a particular
attribute, based on the
values of other
attributes
Defaulted Barrower? is
the target (or
dependent variable)
Attributes/features used
for making the prediction
are known as
10
id
Home Marital
Owner Status
Annual
Income
Defaulted
Barrower?
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
2.
Objective: derive
patterns (correlations,
clusters) that
summarize underlying
relationships in data
Often more exploratory
and requires an
explanation of found
results
10
id
Home
Owner
Marital
Status
Annual
Income
Defaulted
Barrower
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Available Datasets
References