Академический Документы
Профессиональный Документы
Культура Документы
Knowledge
Information
Data
Data Mining Applications
Lots of data being collected
and warehoused
Web data, e-commerce
Social Networks
purchases at department/
grocery stores
Bank/Credit Card
transactions
Government agencies
Descriptive Tasks
Find general properties that describe the data
Data Mining Tasks
Classification [Predictive]
Regression [Predictive]
Visualization [Descriptive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Graph Mining / Social Networks [Descriptive]
Classification: Example
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers
likely to buy a new cell-phone product.
Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided
otherwise. This {buy, not buy} binary decision forms the class
attribute.
Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
Use this information as input attributes to learn a classifier model.
To predict class attribute value of new customers, given their input
attributes known.
Classification: Example
Customer Churn/Attrition:
Goal: To predict whether a customer is likely to be lost to a
competitor.
Approach:
Use detailed record of transactions with each of the past and present
customers, to find attributes.
How often the customer calls, where he calls, what time-of-the-day he calls
most, his financial status, marital status, etc.
Label the customers as loyal of disloyal.
Find a model for loyalty.
Regression/Prediction: Example
Predict a value of a given continuous valued variable
based on the values of other variables, assuming a linear
or nonlinear model of dependency.
Greatly studied in statistics, econometrics, neural
network fields.
Examples:
Predicting sales amounts of new product based on advertising
expenditure.
Predicting wind velocities as a function of temperature,
humidity, air pressure, etc.
Time series prediction of stock market indices (forecasting).
Clustering: Example
Market Segmentation:
Goal: subdivide a market into distinct subsets of customers
where any subset may conceivably be selected as a market
target to be reached with a distinct marketing mix.
Approach:
Collect different attributes of customers based on their geographical
and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying patterns of
customers in same clusters vs. those from different cluster.
Clustering: Example
Document Clustering:
Goal: To find groups of documents that are similar to each
other based on the important terms appearing in them.
Approach: To identify frequently occurring terms in each
document. Form a similarity measure based on the frequencies
of different terms. Use it to cluster.
Gain: Information Retrieval can utilize the clusters to relate a
new document or search terms to clustered documents.
Association Rule Mining: Example
Given a set of record each of which contain some
number of items from a given collection;
Produce dependency rules which will predict occurrence of an
item based on occurrence of other items.
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
CRISP-DM
Cross Industry Standard Process for Data Mining
Steps in Data Mining
1. Develop an understanding of the purpose of the data mining project
2. Obtain the data set to be used in the analysis
Random sampling from a large database to capture records
While data mining deals with very large databases
usually the analysis to be done requires only thousands or tens of thousands of records
3. Explore, clean, and preprocess the data
This involves verifying that the data are in reasonable condition
How should missing data be handled?
Are the values in a reasonable range, given what you would expect for each variable?
Are there obvious outliers?
The data are reviewed graphically - for example, a matrix of scatter plots showing the
relationship of each variable with each other variable
4. Reduce the data, if necessary
Where supervised training is involved
separate it into training, validation and test data sets
eliminate unneeded variables
transforming variables
creating new variables
Steps in Data Mining cont.
5. Determine the data mining task
classification, prediction, clustering, etc.
6. Choose the data mining techniques to be used
Decision trees, Nave Bayes, Hierarchical Clustering, etc.
7. Use algorithms to perform the task
This is typically an iterative process
Choosing different variables or settings within the algorithm
8. Interpret the results of the algorithms
Each algorithm may also be tested on the validation data for tuning purposes
validation data becomes a part of the fitting process!
likely to underestimate the error in the deployment of the model that is finally
chosen
9. Deploy the model in real world
For example, the model might be applied to a purchased list of possible
customers
action might be include in the mailing if the predicted amount of purchase is
> $10
Review Questions
What is the difference between data and intelligence?