Data Mining

Another Look at Data Mining
Why do we mine? What do we mine? How do we mine?
What is Data Mining

Data mining discovers meaningful new correlations, hidden patterns and relationships in your data Conceptual descendent of statistics Combines machine learning,statistics,and databases Knowledge discovery:process of building and implementing a data mining solution
CS753 Dr. Mary Ann Robbert
Data Mining Overview

Knowledge Discovery in Databases, KDD No one data mining approach

each tool viewed logically as application of client Can reside on separate machine or in separate process and access data warehouse
RDBMS or proprietary OLAP embed data mining capabilities deeply within engines to improve efficiency and add extensions Requires a good foundation in terms of a data warehouse
CS753
Dr. Mary Ann Robbert
Data Mining Overview (cont)
Common algorithmic approaches

association, affinity grouping predicting, sequence-based analysis clustering classification estimation
Steps are:data selection, data transformation,data mining,result interpretation.

Strategic Benefit of Data Mining

Direct Marketing Trend Analysis Fraud detection Forecasting in Financial Markets
CS753
Why Data Mining Now?

Economics
Unprecedented affordability of MIPS and MB
Parallel computing
Enormous amounts of data can be processed
Popularity of data warehouses, data marts

Relatively clean data available
CS753
Data Mining compared to Traditional Analysis
Traditional Analysis
Did sales of product X increase in Nov.? Do sales of product X decrease when there is a
promotion on product Y?
Data mining is result oriented

What are the factors that determine sales of
product X?
CS753
Data Mining compared to Traditional Analysis (cont)
Traditional; analysis is incremental

Does billing level affect turnover? Does location affect turnover? Analyst builds model step by step
Data Mining is result oriented

Identify the factors and predict turnover
CS753
Steps in Data Mining
Data Manipulation - can be 70-80% of data mining effort

data cleaning missing values data derivation merging data SupervisedSupervised-articulating goal, choosing dependent variable or output and specifying data fields UnsupervisedUnsupervised-group similar types of data or identify exceptions
Defining a study

Steps in Data Mining (cont)
Reading the data and building the model

model summarizes large amounts of data by
accumulating indicators (frequencies,weight,conjunctions,differentiation)

Understanding the model

Know the particular model
Prediction
Choose the best outcome based on historical data
Models
Genetic Algorithms Neural Nets Agents Statistics Visualization
CS753
Genetic Algorithms
Artificial intelligence system that mimics the evolutionary, survivalsurvival-ofof-thethe-fittest processes to generate increasingly better solutions to a problem. Genetic algorithms produce several generations of solutions, choosing the best of the current set for each new generation. Examples

Generating human faces based on a few known features. Generating solutions to routing problems. Generating stock portfolios.
EVOLUTION IN GENETIC ALGORITHMS

SELECTION - or survival of the fittest. The key is to give preference to better outcomes. CROSSOVER - combining portions of good outcomes in the hope of creating an even better outcome. MUTATION - randomly trying combinations and evaluating the success (or failure) of the outcome.
Neural Nets

Mathematical Model of the Way a Brain Functions Machine learning approach by which historical data can be examined for pattern recognition A neural network simulates the human ability to classify things based on the experience of seeing many examples.

Pros -Numerical Data Cons - Opaque, Art or Science
CS753
: / / w w w . a
Example
Distinguishing
different chemical
compounds
Detecting Reading
anomalies in human tissue that may signify disease handwriting fraud in credit card use
Detecting
CS753
Intelligent Agents
Software entities that carry out some set of operations on behalf of user or program with some degree of autonomy and employ some knowledge or representation of users goals and desires. Some common characteristics

ability to communicate, cooperate and coordinate with other agents ability to act autonomously to achieve collective goal of system
Intelligent Agents (cont)
Tasks
automate repetitive tasks finding and filtering information summarizing complex data
Capability to learn and make recommendations Black box approach hides complexity and allows for design of scalable system
Comparison
AI System Expert Systems Neural Networks Problem Type Diagnostic or prescriptive Identification, classification, prediction Based On Strategies of experts The human brain Starting Information Experts know-how Acceptable patterns
Genetic Algorithms
Biological Optimal solution evolution
Set of possible solutions Your preferences
Intelligent Agents
Specific and repetitive tasks
One or more AI techniques
Statistics
SAS, SPSS
Pros - Established technology Cons - Needs assumptions, nominal
variable handling, management acceptance?
CS753
Visualization
Data visualization refers to technologies that support visualization of information Includes digital images, GIS, multidimensions, 3-D presentations, animations http://www.almaden.ibm.com/cs/quest/dem o/assoc/general.html
CS753
Data Mining is Not a Silver Bullet
It does not:
Find answers to questions you dont ask Eliminate the need for domain experience Remove the need for data analysis skills
CS753
Data Mining Software

http://www.kdnuggets.com/software/ http://www.attar.com/ download http://www.cs.bham.ac.uk/~anp/software.ht ml software listing
CS753
Six Rules of Data Quality

by Ken Orr
1. Data that is not used cannot be correct for very long 2. Data Quality in an information system is a function of its use, not its collection 3.Data quality will ultimately be no better than its most stringent use 4. Data quality problems tend to become worse with the age of the system 5. Less likely it is that some data element will change, more traumatic it will be when it finally does change. CS753 Dr. Mary Ann Robbert 6. Information overload affects data quality
Data Quality Software
http://www.rulequest.com/gritbot-info.html
CS753
General DW Data transformation

Resolve inconsistent legacy formats Strip out unwanted fields Interpret codes into text Combine data from multiple sources under a common key Find fields used for multiple purposes and interpret fields value based on context
Data transformation for Data Mining

Flag normal, abnormal, out of bounds or impossible facts Recognize random or noise values from context and mask out Apply uniform treatment to NULL values Flag fast records with changed status Classify individual record by one of its aggregates
Conclusion
For successful data mining:

data analysis and mining goals must be
identifies and formulated appropriate data must be selected, cleaned and prepared for queries and business analysis
http://www.rulequest.com/cubistexamples.html#BOSTON http://www.almaden.ibm.com/cs/quest/

Data Mining

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Data Mining

Загружено:

Авторское право:

Доступные форматы

Another Look at Data Mining

Why do we mine? What do we mine? How do we mine?

What is Data Mining

CS753 Dr. Mary Ann Robbert

Data Mining Overview

Knowledge Discovery in Databases, KDD No one data mining approach

Dr. Mary Ann Robbert

Data Mining Overview (cont)

Common algorithmic approaches

association, affinity grouping predicting, sequence-based analysis clustering classification estimation

Steps are:data selection, data transformation,data mining,result interpretation.

Strategic Benefit of Data Mining

Dr. Mary Ann Robbert

Why Data Mining Now?

Popularity of data warehouses, data marts

Dr. Mary Ann Robbert

Data Mining compared to Traditional Analysis

Data mining is result oriented

Dr. Mary Ann Robbert

Data Mining compared to Traditional Analysis (cont)

Traditional; analysis is incremental

Data Mining is result oriented

Dr. Mary Ann Robbert

Steps in Data Mining

Data Manipulation - can be 70-80% of data mining effort

Steps in Data Mining (cont)

Reading the data and building the model

accumulating indicators (frequencies,weight,conjunctions,differentiation)

Understanding the model

Dr. Mary Ann Robbert

EVOLUTION IN GENETIC ALGORITHMS

Pros -Numerical Data Cons - Opaque, Art or Science

Dr. Mary Ann Robbert

Dr. Mary Ann Robbert

Intelligent Agents (cont)

CS753 Dr. Mary Ann Robbert

Biological Optimal solution evolution

Set of possible solutions Your preferences

Specific and repetitive tasks

One or more AI techniques

variable handling, management acceptance?

Dr. Mary Ann Robbert

Dr. Mary Ann Robbert

Data Mining is Not a Silver Bullet

Dr. Mary Ann Robbert

Data Mining Software

Dr. Mary Ann Robbert

Six Rules of Data Quality

Data Quality Software

Dr. Mary Ann Robbert

General DW Data transformation

CS753 Dr. Mary Ann Robbert

Data transformation for Data Mining

CS753 Dr. Mary Ann Robbert

For successful data mining:

CS753 Dr. Mary Ann Robbert

Вам также может понравиться