Вы находитесь на странице: 1из 28

COURSE

INTRODUCTION
Week 1

CSC 600: Data Mining

Today

What is Data Mining?


Syllabus / Course Webpage
Types of Data
Data Quality
Exploring Data: Summary Statistics,
Visualization
Introduction to R

What is Data Mining?

Data Mining and Business Analytics deal with collecting


and analyzing data for better decision making.
Goal: solving business problems

Data collection (more and more data is being collected)


Warehousing of data (readily available for analysis; data
from numerous sources already integrated)
Computer storage and computer power cheaper every day
Good software for performing analysis

Data Mining

blends traditional data analysis


(mathematical + statistical) with
sophisticated machine learning
algorithms
Math
Programming ability to process big data
Busine
CS
Business interested in decision making
ss
Art of data mining

Data Mining Applications

Business collect lots of data

Purchase information
Web site browsing habits
Social network data
Goals: customer profiling, targeted marketing, fraud detection
Questions that analyst will try to answer by data mining:
Who

are the most profitable customers?


What products can be cross-sold?
What is the revenue outlook for the company next year?

Many variables are collected; few turn out to be useful.

Target Example

2010 project to predict customer pregnancy (pregnancy


scores)

Tremendous sales opportunity when family prepares for


newborn
Send specific marketing material (baby coupon book)
Awareness of false positives; camouflaged activities

Links:

http://www.forbes.com/sites/kashmirhill/2012/02/16/how-targetfigured-out-a-teen-girl-was-pregnant-before-her-father-did/
http://www.kdnuggets.com/2014/05/target-predict-teen-pregnanc
y-insidestory.html

Data Mining Applications

Medicine, Science, Engineering collecting lots of data

NASA / weather observations (collecting land surface, ocean,


atmosphere readings)
Molecular Biology data (large amounts of genomic data being
gathered to better understand function of genes)
Medical data (outcomes of procedures)
Questions that scientist will try to answer by data mining:
How

is land surface precipitation and temperature affected by ocean


surface temperature?
How well can we predicts the beginning and end of the growing season
for a region?

What we will do in this


Course

Learn Basic-to-Intermediate Data Mining


Techniques
Apply them on Datasets
Program using R Statistical Framework
Read, Understand, Discuss, Critique
Scientific Papers
Perform Significant Individual Data

Syllabus / Course Webpage

Accessing Course Webpage Resources

What is Data Mining?

the process of
automatically discovering
useful information in large
data repositories
to find novel and useful
patterns that might
otherwise remain
unknown

What is NOT data


Mining?

looking up records in a
MySQL database
(database)
finding relevant web
pages based on a
Google search query
(information retrieval)

Data Mining and Knowledge Discovery

Input
Input Data
Data
MySQL
.csv

Process of converting raw data into


useful information
Data
Data
Preprocessin
Preprocessin
g
g
Feature
Selection
Dimensional
ity
Reduction
Normalizatio
n

Data
Data Mining
Mining
Decision
Trees
Support
Vector
Machines
Linear
Regression

Postprocessin
Postprocessin
g
g
Visualization
Pattern
Interpretatio
n

Reporting
Reporting to
to
Boss
Boss
closing the
loop

Input Data

Available in data in variety of formats:

Big Data / Data Warehouse

Flat files (.csv or .txt)


Spreadsheets (Excel .xls tougher to deal with)
Relational tables (MySQL)
Data spread out over multiple locations

CS programming ability often necessary


Sometimes enormous amount of effort

Digitizing hand-written notes

Preprocessing

To transform raw input data into an


appropriate format for subsequent analysis

Fusing data from multiple sources


Cleaning data to remove noise
Duplicate observations
garbage

in garbage out also applies to data mining

Selecting records and features that are relevant


to the data mining task at hand

Data Mining

Applying Appropriate Data Mining Task


on Data

Linear Regression
Support Vector Machines
Decision Trees
Clustering

Postprocessing

Performing:

Visualization
Statistical significant tests, confidence
intervals, hypothesis testing to eliminate
spurious data mining results
(yikes,

math!)

Challenges of Data Mining

Scalability

Gigabytes, terabytes, petabytes, exabytes


of data
Storage, processing
are data mining algorithms scalable?
Limits of R statistical framework

Challenges of Data Mining

High Dimensionality

Datasets with hundreds or thousands of


attributes
Some traditional data analysis techniques were
developed for low-dimensional data, and many
not work well with high-dimensional data
Many variables are collected; few turn out to be
useful.

Challenges of Data Mining

Heterogeneous and Complex Data

Traditional data analysis often deals with


data sets containing attributes of the same
type (e.g. all continuous, all categorical)
Non-traditional data: collection of web
pages (w/ semi-structured text and
hyperlinks)

Challenges of Data Mining

Data Ownership

Good data being geographically


distributed owned by more than one
organization (e.g. medical records)
Access to good data
Facebook

private

and google keep their collected data

Traditional Data Analysis

Based on a hypothesize-and-test
paradigm
1.
2.
3.
4.

Hypothesis proposed
Experiment designed to gather data
Data analyzed w/ respect to hypothesis
Hypothesis accepted or rejected

Traditional Data Analysis

Hypothesis-and-test pattern
Data collection
Laborious process
Generation and evaluation
of thousands of hypotheses
Usually on relatively
smaller datasets

Data Mining

Datasets analyzed typically not result


of a carefully designed experiment
Opportunistic samples of data
Datasets of size TB
Because of data quantity, role of
traditional statistical concepts
(confidence intervals, statistical
significance tests) is reduced
With large data sets, almost any
small difference becomes significant

Vocabulary

10

id

Home Marital
Owner Status

Annual
Income

Defaulted
Barrower

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Column: attribute,
feature, field,
dimension,
variable
Row: instance,
record,
observation,
sample

Data Mining Tasks


Predictive Tasks

1.

Objective: predict
value of a particular
attribute, based on the
values of other
attributes

Defaulted Barrower? is
the target (or
dependent variable)
Attributes/features used
for making the prediction
are known as

10

id

Home Marital
Owner Status

Annual
Income

Defaulted
Barrower?

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Data Mining Tasks


Descriptive Tasks

2.

Objective: derive
patterns (correlations,
clusters) that
summarize underlying
relationships in data
Often more exploratory
and requires an
explanation of found
results

10

id

Home
Owner

Marital
Status

Annual
Income

Defaulted
Barrower

Yes

Single

125K

No

No

Married

100K

No

No

Single

70K

No

Yes

Married

120K

No

No

Divorced 95K

Yes

No

Married

No

Yes

Divorced 220K

No

No

Single

85K

Yes

No

Married

75K

No

10

No

Single

90K

Yes

60K

Available Datasets

References

Introduction to Data Mining, 1st edition,


Tan et al.
Data Mining and Business Analytics in R,
1st edition, Ledolter

Вам также может понравиться