Week1 Intro

COURSE
INTRODUCTION
Week 1
CSC 600: Data Mining
Today
What is Data Mining?

Syllabus / Course Webpage
Types of Data
Data Quality
Exploring Data: Summary Statistics,
Visualization
Introduction to R
Data Mining and Business Analytics deal with collecting

and analyzing data for better decision making.
Goal: solving business problems
Data collection (more and more data is being collected)

Warehousing of data (readily available for analysis; data
from numerous sources already integrated)
Computer storage and computer power cheaper every day
Good software for performing analysis
Data Mining
blends traditional data analysis

(mathematical + statistical) with
sophisticated machine learning
algorithms
Math
Programming ability to process big data
Busine
CS
Business interested in decision making
ss
Art of data mining
Data Mining Applications
Business collect lots of data
Purchase information
Web site browsing habits
Social network data
Goals: customer profiling, targeted marketing, fraud detection
Questions that analyst will try to answer by data mining:
Who
are the most profitable customers?

What products can be cross-sold?
What is the revenue outlook for the company next year?
Many variables are collected; few turn out to be useful.
Target Example
2010 project to predict customer pregnancy (pregnancy

scores)
Tremendous sales opportunity when family prepares for

newborn
Send specific marketing material (baby coupon book)
Awareness of false positives; camouflaged activities
Links:
http://www.forbes.com/sites/kashmirhill/2012/02/16/how-targetfigured-out-a-teen-girl-was-pregnant-before-her-father-did/
http://www.kdnuggets.com/2014/05/target-predict-teen-pregnanc
y-insidestory.html
Data Mining Applications
Medicine, Science, Engineering collecting lots of data
NASA / weather observations (collecting land surface, ocean,

atmosphere readings)
Molecular Biology data (large amounts of genomic data being
gathered to better understand function of genes)
Medical data (outcomes of procedures)
Questions that scientist will try to answer by data mining:
How
is land surface precipitation and temperature affected by ocean

surface temperature?
How well can we predicts the beginning and end of the growing season
for a region?
What we will do in this

Course
Learn Basic-to-Intermediate Data Mining

Techniques
Apply them on Datasets
Program using R Statistical Framework
Read, Understand, Discuss, Critique
Scientific Papers
Perform Significant Individual Data
Syllabus / Course Webpage
Accessing Course Webpage Resources
the process of
automatically discovering
useful information in large
data repositories
to find novel and useful
patterns that might
otherwise remain
unknown
What is NOT data

Mining?
looking up records in a
MySQL database
(database)
finding relevant web
pages based on a
Google search query
(information retrieval)
Data Mining and Knowledge Discovery
Input
Input Data
Data
MySQL
.csv
Process of converting raw data into

useful information
Data
Data
Preprocessin
Preprocessin
g
g
Feature
Selection
Dimensional
ity
Reduction
Normalizatio
n
Data
Data Mining
Mining
Decision
Trees
Support
Vector
Machines
Linear
Regression
Postprocessin
Postprocessin
g
g
Visualization
Pattern
Interpretatio
n
Reporting
Reporting to
to
Boss
Boss
closing the
loop
Input Data
Available in data in variety of formats:
Big Data / Data Warehouse
Flat files (.csv or .txt)

Spreadsheets (Excel .xls tougher to deal with)
Relational tables (MySQL)
Data spread out over multiple locations
CS programming ability often necessary

Sometimes enormous amount of effort
Digitizing hand-written notes
Preprocessing
To transform raw input data into an

appropriate format for subsequent analysis
Fusing data from multiple sources

Cleaning data to remove noise
Duplicate observations
garbage
in garbage out also applies to data mining
Selecting records and features that are relevant

to the data mining task at hand
Data Mining
Applying Appropriate Data Mining Task

on Data
Linear Regression
Support Vector Machines
Decision Trees
Clustering
Postprocessing
Performing:
Visualization
Statistical significant tests, confidence
intervals, hypothesis testing to eliminate
spurious data mining results
(yikes,
math!)
Challenges of Data Mining
Scalability
Gigabytes, terabytes, petabytes, exabytes

of data
Storage, processing
are data mining algorithms scalable?
Limits of R statistical framework
High Dimensionality
Datasets with hundreds or thousands of

attributes
Some traditional data analysis techniques were
developed for low-dimensional data, and many
not work well with high-dimensional data
Many variables are collected; few turn out to be
useful.
Heterogeneous and Complex Data
Traditional data analysis often deals with

data sets containing attributes of the same
type (e.g. all continuous, all categorical)
Non-traditional data: collection of web
pages (w/ semi-structured text and
hyperlinks)
Data Ownership
Good data being geographically

distributed owned by more than one
organization (e.g. medical records)
Access to good data
Facebook
private
and google keep their collected data
Traditional Data Analysis
Based on a hypothesize-and-test
paradigm
1.
2.
3.
4.
Hypothesis proposed
Experiment designed to gather data
Data analyzed w/ respect to hypothesis
Hypothesis accepted or rejected
Traditional Data Analysis
Hypothesis-and-test pattern
Data collection
Laborious process
Generation and evaluation
of thousands of hypotheses
Usually on relatively
smaller datasets
Data Mining
Datasets analyzed typically not result

of a carefully designed experiment
Opportunistic samples of data
Datasets of size TB
Because of data quantity, role of
traditional statistical concepts
(confidence intervals, statistical
significance tests) is reduced
With large data sets, almost any
small difference becomes significant
Vocabulary
10
id
Home Marital
Owner Status
Annual
Income
Defaulted
Barrower
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Column: attribute,
feature, field,
dimension,
variable
Row: instance,
record,
observation,
sample
Data Mining Tasks

Predictive Tasks
1.
Objective: predict
value of a particular
attribute, based on the
values of other
attributes
Defaulted Barrower? is
the target (or
dependent variable)
Attributes/features used
for making the prediction
are known as
10
id
Home Marital
Owner Status
Annual
Income
Defaulted
Barrower?
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Data Mining Tasks

Descriptive Tasks
2.
Objective: derive
patterns (correlations,
clusters) that
summarize underlying
relationships in data
Often more exploratory
and requires an
explanation of found
results
10
id
Home
Owner
Marital
Status
Annual
Income
Defaulted
Barrower
Yes
Single
125K
No
No
Married
100K
No
No
Single
70K
No
Yes
Married
120K
No
No
Divorced 95K
Yes
No
Married
No
Yes
Divorced 220K
No
No
Single
85K
Yes
No
Married
75K
No
10
No
Single
90K
Yes
60K
Available Datasets
References
Introduction to Data Mining, 1st edition,

Tan et al.
Data Mining and Business Analytics in R,
1st edition, Ledolter

Week1 Intro

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Week1 Intro

Загружено:

Авторское право:

Доступные форматы

COURSE

CSC 600: Data Mining

What is Data Mining?

What is Data Mining?

Data Mining and Business Analytics deal with collecting

Data collection (more and more data is being collected)

blends traditional data analysis

Data Mining Applications

Business collect lots of data

are the most profitable customers?

Many variables are collected; few turn out to be useful.

2010 project to predict customer pregnancy (pregnancy

Tremendous sales opportunity when family prepares for

Data Mining Applications

Medicine, Science, Engineering collecting lots of data

NASA / weather observations (collecting land surface, ocean,

is land surface precipitation and temperature affected by ocean

What we will do in this

Learn Basic-to-Intermediate Data Mining

Syllabus / Course Webpage

Accessing Course Webpage Resources

What is Data Mining?

What is NOT data

Data Mining and Knowledge Discovery

Process of converting raw data into

Available in data in variety of formats:

Big Data / Data Warehouse

Flat files (.csv or .txt)

CS programming ability often necessary

Digitizing hand-written notes

To transform raw input data into an

Fusing data from multiple sources

in garbage out also applies to data mining

Selecting records and features that are relevant

Applying Appropriate Data Mining Task

Challenges of Data Mining

Gigabytes, terabytes, petabytes, exabytes

Challenges of Data Mining

Datasets with hundreds or thousands of

Challenges of Data Mining

Heterogeneous and Complex Data

Traditional data analysis often deals with

Challenges of Data Mining

Good data being geographically

and google keep their collected data

Traditional Data Analysis

Traditional Data Analysis

Datasets analyzed typically not result

Data Mining Tasks

Data Mining Tasks

Introduction to Data Mining, 1st edition,

Вам также может понравиться