Академический Документы
Профессиональный Документы
Культура Документы
Introduction
y Motivation: Why data mining?
Motivation: Why data mining?
y What is data mining?
y Data Mining: On what kind of data?
Davide Anguita
y Data mining functionality
SmartLab
DIBE – Facoltà di Ingegneria y Are all the patterns interesting?
Università degli Studi di Genova
U i ità d li St di di G y Classification of data mining systems
Classification of data mining systems
y Major issues in data mining
Motivation: “Necessity is the Mother of Invention” Evolution of Database Technology
y Data
Data explosion problem
explosion problem y 1960s:
y Automated data collection tools and mature database y Data collection, database creation, IMS and network DBMS
technology lead to tremendous amounts of data stored in y 1970s:
databases, data warehouses and other information y Relational data model, relational DBMS implementation
repositories y 1980s:
y We are drowning in data, but starving for knowledge! y RDBMS, advanced data models (extended‐relational, OO,
y SSolution: Data warehousing and data mining
l i D h i dd i i deductive etc ) and application oriented DBMS (spatial
deductive, etc.) and application‐oriented DBMS (spatial,
scientific, engineering, etc.)
y Data warehousing and on‐line analytical processing
y 1990s—2000s:
y Extraction of interesting knowledge (rules, regularities,
y Data mining and data warehousing, multimedia databases,
patterns, constraints) from data in large databases and Web databases
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 3 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 4
Market Analysis and Management Market Analysis and Management
y Where are the data sources for analysis? y Customer profiling
p g
y Credit card transactions, loyalty cards, discount coupons, y data mining can tell you what types of customers buy what
customer complaint calls, plus (public) lifestyle studies products (clustering or classification)
y Target marketing y Identifying customer requirements
y Find clusters of “model” customers who share the same
y identifying the best products for different customers
characteristics: interest, income level, spending habits, etc.
y use prediction to find what factors will attract new
y Determine customer purchasing patterns over time
customers
y Conversion of single to a joint bank account: marriage, etc.
C i f i l j i b k i
y Provides summary information
y Cross‐market analysis
y Associations/co‐relations between product sales
y various multidimensional summary reports
y Prediction based on the association information y statistical summary information (data central tendency and
variation)
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 7 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 8
Corporate Analysis and Risk
Management Fraud Detection and Management
y Finance planning
p g y Applications
y widely used in health care, retail, credit card services,
y cash flow analysis and prediction
telecommunications (phone card fraud), etc.
y cross‐sectional and time series analysis (financial‐ratio, trend y Approach
analysis, etc.) y use historical data to build models of fraudulent behavior and
y Resource planning: use data mining to help identify similar instances
y summarize and compare the resources and spending y Examples
y auto insurance: detect a group of people who stage accidents to
y Competition: collect on insurance
ll i
y monitor competitors and market directions y money laundering: detect suspicious money transactions (US
Treasury's Financial Crimes Enforcement Network)
y group customers into classes and a class‐based pricing procedure
y medical insurance: detect professional patients and ring of
y set pricing strategy in a highly competitive market doctors
Fraud Detection and Management Other Applications
y Detecting inappropriate medical treatment y Sports
y Australian Health Insurance Commission identifies that in many y IBM Advanced Scout analyzed NBA game statistics (shots
cases blanket screening tests were requested (save Australian blocked, assists, and fouls) to gain competitive advantage for
$1m/yr). New York Knicks and Miami Heat
y Detecting telephone fraud
y Astronomy
y Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected y JPL and the Palomar Observatory discovered 22 quasars with the
norm. help of data mining
y British Telecom identified discrete groups of callers with frequent
British Telecom identified discrete groups of callers with frequent y Internet Web Surf‐Aid
Internet Web Surf Aid
intra‐group calls, especially mobile phones, and broke a y IBM Surf‐Aid applies data mining algorithms to Web access logs
multimillion dollar fraud.
for market‐related pages to discover customer preference and
y Retail behavior pages, analyzing effectiveness of Web marketing,
y Analysts estimate that 38% of retail shrink is due to dishonest improving Web site organization, etc.
employees.
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 11 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 12
Data Mining: A KDD Process Steps of a KDD Process
Pattern Evaluation
y Data mining: the core of knowledge discovery process.
g g yp y Learning the application domain:
y relevant prior knowledge and goals of application
y Creating a target data set: data selection
Data Mining
y Data cleaning and preprocessing: (may take 60% of effort!)
y Data reduction and transformation:
Task-relevant Data y Find useful features, dimensionality/variable reduction, invariant
representation.
y Choosing functions of data mining
Data Warehouse Selection
y summarization, classification, regression, association, clustering.
y Choosing the mining algorithm(s)
Data Cleaning y Data mining: search for patterns of interest
y Pattern evaluation and knowledge presentation
Data Integration y visualization, transformation, removing redundant patterns, etc.
y Use of discovered knowledge
Databases
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 13 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 14
Data Mining and Business Intelligence Architecture of a Typical Data Mining System
Increasing potential
to support
business decisions End User Graphical user interface
Making
Decisions
Pattern evaluation
Data Presentation Business
Analyst
Visualization Techniques
Data mining engine
Data Mining Data
Information Discovery Analyst
Knowledge‐base
Database or data
Data Exploration
Statistical Analysis, Querying and Reporting warehouse server
Data cleaning & data integration Filtering
Data Warehouses / Data Marts
OLAP, MDA DBA Data
Data Sources Databases Warehouse
Paper, Files, Information Providers, Database Systems, OLTP
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 15 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 16
Data Mining: On What Kind of Data? Data Mining Functionalities (1)
y Relational databases y Concept
Concept description: Characterization and discrimination
description: Characterization and discrimination
y Data warehouses y Generalize, summarize, and contrast data characteristics,
y Transactional databases e.g., dry vs. wet regions
y Advanced DB and information repositories y Association (correlation and causality)
y Object‐oriented and object‐relational databases y Multi‐dimensional vs. single‐dimensional association
y Spatial databases y age(X, “20..29”) ^ income(X, “20..29K”) ‐> buys(X, “PC”)
y Time‐series data and temporal data
i i d d ld [support = 2%, confidence = 60%]
y Text databases and multimedia databases y contains(T, “computer”) ‐> contains(T, “software”) [1%,
y Heterogeneous and legacy databases 75%]
y WWW
Data Mining Functionalities (2) Data Mining Functionalities (3)
y Classification and Prediction y Outlier
Outlier analysis
analysis
y Finding models (functions) that describe and distinguish classes
y Outlier: a data object that does not comply with the general
or concepts for future prediction
behavior of the data
y E.g., classify countries based on climate, or classify cars based on
gas mileage y It can be considered as noise or exception but is quite
y Presentation: decision‐tree, classification rule, neural network useful in fraud detection, rare events analysis
y Prediction: Predict some unknown or missing numerical values y Trend and evolution analysis
y Cluster analysis
Cluster analysis y Trend and deviation: regression analysis
y Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns y Sequential pattern mining, periodicity analysis
y Clustering based on the principle: maximizing the intra‐class y Similarity‐based analysis
similarity and minimizing the interclass similarity y Other pattern‐directed or statistical analyses
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 19 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 20
Are All the “Discovered” Patterns Interesting? Can We Find All and Only Interesting Patterns?
y A data mining system/query may generate thousands of y Find all the interesting patterns: Completeness
gp p
patterns, not all of them are interesting.
f
y Can a data mining system find all the interesting patterns?
y Suggested approach: Human‐centered, query‐based, focused
mining y Association vs. classification vs. clustering
y Interestingness measures: A pattern is interesting if it is easily y Search for only interesting patterns: Optimization
understood by humans, valid on new or test data with some y Can a data mining system find only the interesting
degree of certainty, potentially useful, novel, or validates some
hypothesis that a user seeks to confirm patterns?
y Objective vs. subjective interestingness measures:
Objective vs subjective interestingness measures: y Approaches
y Objective: based on statistics and structures of patterns, e.g., y First general all the patterns and then filter out the uninteresting
support, confidence, etc. ones.
y Subjective: based on user’s belief in the data, e.g., y Generate only the interesting patterns—mining query
unexpectedness, novelty, actionability, etc. optimization
Data Mining: Confluence of Multiple Disciplines Data Mining: Classification Schemes
Database y General
General functionality
functionality
Statistics
Technology
y Descriptive data mining
y Predictive data mining
y Different views, different classifications
Machine
Data Mining Visualization y Kinds of databases to be mined
Learning
y Kinds of knowledge to be discovered
g
y Kinds of techniques utilized
Information y Kinds of applications adapted
Other
Science Disciplines
OLAP Mining: An Integration of
A Multi‐Dimensional View of Data Mining Classification Data Mining and Data Warehousing
y Databases to be mined y Data mining systems, DBMS, Data warehouse systems coupling
g y , , y p g
y Relational, transactional, object‐oriented, object‐relational, active,
spatial, time‐series, text, multi‐media, heterogeneous, legacy, WWW, y No coupling, loose‐coupling, semi‐tight‐coupling, tight‐coupling
etc.
y On‐line analytical mining data
y Knowledge to be mined
y Characterization, discrimination, association, classification, clustering, y integration of mining and OLAP technologies
trend, deviation and outlier analysis, etc. y Interactive mining multi‐level knowledge
y Multiple/integrated functions and mining at multiple levels
y Necessity of mining knowledge and patterns at different levels of
y Techniques utilized
y Database‐oriented, data warehouse (OLAP), machine learning,
abstraction by drilling/rolling pivoting slicing/dicing etc
abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
statistics, visualization, neural network, etc. y Integration of multiple mining functions
y Applications adapted
y Characterized classification, first clustering and then association
y Retail, telecommunication, banking, fraud analysis, DNA mining, stock
market analysis, Web mining, Weblog analysis, etc.
Major Issues in Data Mining (1) Major Issues in Data Mining (2)
y Mining methodology and user interaction y Issues relating to the diversity of data types
y Mining different kinds of knowledge in databases y Handling relational and complex types of data
y Interactive mining of knowledge at multiple levels of abstraction y Mining information from heterogeneous databases and global
y Incorporation of background knowledge information systems (WWW)
y Data mining query languages and ad‐hoc data mining y Issues related to applications and social impacts
y Expression and visualization of data mining results y Application of discovered knowledge
y Handling noise and incomplete data y Domain‐specific data mining tools
y I t lli t
Intelligent query answering
i
y Pattern evaluation: the interestingness problem
y Process control and decision making
y Performance and scalability
y Integration of the discovered knowledge with existing
y Efficiency and scalability of data mining algorithms knowledge: A knowledge fusion problem
y Parallel, distributed and incremental mining methods y Protection of data security, integrity, and privacy
Summary A Brief History of Data Mining Society
y Data mining: discovering interesting patterns from large y 1989 IJCAI Workshop on Knowledge Discovery in Databases
amounts of data
f (Piatetsky‐Shapiro)
(Pi k Sh i )
y A natural evolution of database technology, in great demand, y Knowledge Discovery in Databases (G. Piatetsky‐Shapiro and W.
Frawley, 1991)
with wide applications
y 1991‐1994 Workshops on Knowledge Discovery in Databases
y A KDD process includes data cleaning, data integration, data y Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
selection, transformation, data mining, pattern evaluation, and Piatetsky‐Shapiro, P. Smyth, and R. Uthurusamy, 1996)
knowledge presentation y 1995‐1998 International Conferences on Knowledge Discovery in
y Mining can be performed in a variety of information Databases and Data Mining (KDD’95‐98)
repositories
i i y Journal of Data Mining and Knowledge Discovery (1997)
y Data mining functionalities: characterization, association, y 1998 ACM SIGKDD, SIGKDD’1999‐2001 conferences, and SIGKDD
classification, clustering, outlier and trend analysis, etc. Explorations
y Classification of data mining systems y More conferences on data mining
y PAKDD, PKDD, SIAM‐Data Mining, (IEEE) ICDM, etc.
y Major issues in data mining
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 29 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 30
Biomedical Data Mining and DNA
Data Mining Applications Analysis
y Data
Data mining is a young discipline with wide and diverse
mining is a young discipline with wide and diverse y DNA sequences: 4 basic building blocks (nucleotides): adenine
(A), cytosine (C), guanine (G), and thymine (T).
( ) ( ) ( ) ( )
applications
y Gene: a sequence of hundreds of individual nucleotides
y There is still a nontrivial gap between general principles of arranged in a particular order
data mining and domain‐specific, effective data mining y Humans have around 100,000 genes
tools for particular applications y Tremendous number of ways that the nucleotides can be
ordered and sequenced to form distinct genes
y Some application domains
y Semantic integration of heterogeneous, distributed genome
g g , g
y Biomedical and DNA data analysis
d l d d l databases
y Financial data analysis y Current: highly distributed, uncontrolled generation and use of a
wide variety of DNA data
y Retail industry y Data cleaning and data integration methods developed in data
y Telecommunication industry mining will help
Data Mining for Financial Data
DNA Analysis: Examples Analysis
y Similarity search and comparison among DNA sequences y Financial data collected in banks and financial institutions are
y Compare the frequently occurring patterns of each class (e.g., diseased often relatively complete, reliable, and of high quality
and healthy)
y Identify gene sequence patterns that play roles in various diseases y Design and construction of data warehouses for
y Association analysis: identification of co‐occurring gene sequences multidimensional data analysis and data mining
y Most diseases are not triggered by a single gene but by a combination y View the debt and revenue changes by month, by region, by
of genes acting together sector, and by other factors
y Association analysis may help determine the kinds of genes that are y Access statistical information such as max, min, total, average,
likely to co‐occur together in target samples
trend etc
trend, etc.
y Path analysis: linking genes to different disease development stages
y Different genes may become active at different stages of the disease y Loan payment prediction/consumer credit policy analysis
y Develop pharmaceutical interventions that target the different stages y feature selection and attribute relevance ranking
separately y Loan payment performance
y Visualization tools and genetic data analysis
y Consumer credit rating
Financial Data Mining Data Mining for Retail Industry
y Classification and clustering of customers for targeted
g g y Retail
Retail industry: huge amounts of data on sales, customer
industry: huge amounts of data on sales, customer
marketing shopping history, etc.
y multidimensional segmentation by nearest‐neighbor,
y Applications of retail data mining
classification, decision trees, etc. to identify customer groups or
associate a new customer to an appropriate customer group y Identify customer buying behaviors
y Detection of money laundering and other financial crimes y Discover customer shopping patterns and trends
y integration from multiple DBs (e.g., bank transactions,
g p ( g, , y Improve the quality of customer service
federal/state crime history DBs) y Achieve better customer retention and satisfaction
y Tools: data visualization, linkage analysis, classification, clustering y Enhance goods consumption ratios
tools, outlier analysis, and sequential pattern analysis tools (find
y Design more effective goods transportation and distribution
unusual access sequences)
policies
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 35 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 36
Data Mining in Retail Industry:
Examples Data Mining for Telecomm. Industry (1)
y Design and construction of data warehouses based on the yAA rapidly expanding and highly competitive industry and a
rapidly expanding and highly competitive industry and a
benefits of data mining great demand for data mining
y Multidimensional analysis of sales, customers, products, time,
and region y Understand the business involved
y Analysis of the effectiveness of sales campaigns y Identify telecommunication patterns
y Customer retention: Analysis of customer loyalty y Catch fraudulent activities
y Use customer loyalty card information to register sequences of y Make better use of resources
purchases of particular customers
h f ti l t
y Improve the quality of service
y Use sequential pattern mining to investigate changes in customer
consumption or loyalty y Multidimensional analysis of telecommunication data
y Suggest adjustments on the pricing and variety of goods y Intrinsically multidimensional: calling‐time, duration,
y Purchase recommendation and cross‐reference of items location of caller, location of callee, type of call, etc.
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 37 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 38
Data Mining for Telecomm. Industry (2) Visual Data Mining
y Fraudulent pattern analysis and the identification of unusual y Visualization: use of computer graphics to create visual images
patterns which aid in the understanding of complex, often massive
hi h id i h d di f l f i
representations of data
y Identify potentially fraudulent users and their atypical usage
y Visual Data Mining: the process of discovering implicit but useful
patterns knowledge from large data sets using visualization techniques
y Detect attempts to gain fraudulent entry to customer accounts y Purpose of Visualization
y Discover unusual patterns which may need special attention y Gain insight into an information space by mapping data onto graphical
primitives
y Multidimensional association and sequential pattern analysis
y Provide qualitative overview of large data sets
y Find usage patterns for a set of communication services by
Fi d f f i i i b y Search for patterns, trends, structure, irregularities, relationships
customer group, by month, etc. among data.
y Promote the sales of specific services y Help find interesting regions and suitable parameters for further
quantitative analysis.
y Improve the availability of particular services in a region
y Provide a visual proof of computer representations derived
y Use of visualization tools in telecommunication data analysis
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 39 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 40
Boxplots from Statsoft: multiple
Visual Data Mining & Data Visualization
variable combinations
y Integration
Integration of visualization and data mining
of visualization and data mining
y data visualization
y data mining result visualization
y data mining process visualization
y interactive visual data mining
y Data visualization
y Data in a database or data warehouse can be viewed
y at different levels of granularity or abstraction
y as different combinations of attributes or dimensions
y Data can be presented in various visual forms
Visualization of data mining results in
Data Mining Result Visualization SAS Enterprise Miner: scatter plots
y Presentation
Presentation of the results or knowledge obtained from
of the results or knowledge obtained from
data mining in visual forms
y Examples
y Scatter plots and boxplots (obtained from descriptive data
mining)
y Decision trees
y Association rules
y Clusters
y Outliers
y Generalized rules
Visualization of association rules in Visualization of a decision tree in
MineSet 3.0 MineSet 3.0
Visualization of cluster groupings in
IBM Intelligent Miner Data Mining Process Visualization
y Presentation
Presentation of the various processes of data mining in
of the various processes of data mining in
visual forms so that users can see
y How the data are extracted
y From which database or data warehouse they are extracted
y How the selected data are cleaned, integrated,
preprocessed, and mined
y Which method is selected at data mining
y Where the results are stored
y How they may be viewed
Visualization of Data Mining
Processes by Clementine
Interactive Visual Data Mining
y Using
Using visualization tools in the data mining process to
visualization tools in the data mining process to
help users make smart data mining decisions
y Example
y Display the data distribution in a set of attributes using
colored sectors or columns (depending on whether the
whole space is represented by either a circle or a set of
columns)
y Use the display to which sector should first be selected for
classification and where a good split point for this sector
may be
Interactive Visual Mining by Perception‐
Based Classification (PBC) Audio Data Mining
y Uses audio signals to indicate the patterns of data or the
g p
features of data mining results
y An interesting alternative to visual mining
y An inverse task of mining audio (such as music) databases
which is to find patterns from audio data
y Visual data mining may disclose interesting patterns using
graphical displays but requires users to concentrate on
graphical displays, but requires users to concentrate on
watching patterns
y Instead, transform patterns into sound and music and
listen to pitches, rhythms, tune, and melody in order to
identify anything interesting or unusual
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 51 J.Han & M.Kamber “Data Mining: Concepts and Techniques” 52
Is Data Mining a Hype or Will It Be Persistent? Life Cycle of Technology Adoption
y Data mining is a technology
Data mining is a technology y Data
Data mining is at Chasm!?
mining is at Chasm!?
y Technological life cycle y Existing data mining systems are too generic
y Innovators y Need business‐specific data mining solutions and smooth
y Early adopters integration of business logic with data mining functions
y Chasm
y Early majority
y j y
y Late majority
y Laggards
Data Mining: Merely Managers' Business or Social Impacts: Threat to Privacy
Everyone's? and Data Security?
y Data mining will surely be an important tool for managers’ y Is data mining a threat to privacy and data security?
g p y y
decision making
y “Big Brother”, “Big Banker”, and “Big Business” are carefully
y Bill Gates: “Business @ the speed of thought”
watching you
y The amount of the available data is increasing, and data
mining systems will be more affordable y Profiling information is collected every time
y Multiple personal uses y You use your credit card, debit card, supermarket loyalty card, or
frequent flyer card, or apply for any of the above
y Mine your family's medical history to identify genetically‐related
medical conditions y You surf the Web, reply to an Internet newsgroup, subscribe to a
y Mine the records of the companies you deal with magazine rent a video join a club fill out a contest entry form
magazine, rent a video, join a club, fill out a contest entry form,
y Mine data on stocks and company performance, etc. y You pay for prescription drugs, or present you medical care
number when visiting the doctor
y Invisible data mining
y Build data mining functions into many intelligent tools
y Collection of personal data may be beneficial for companies
and consumers, there is also potential for misuse
Protect Privacy and Data Security Trends in Data Mining (1)
y Fair information practices y Application
Application exploration
exploration
y International guidelines for data privacy protection
y development of application‐specific data mining system
y Cover aspects relating to data collection, purpose, use, quality,
openness, individual participation, and accountability y Invisible data mining (mining as built‐in function)
y Purpose specification and use limitation y Scalable data mining methods
y Openness: Individuals have the right to know what information is y Constraint‐based mining: use of constraints to guide data
collected about them, who has access to the data, and how the
mining systems in their search for interesting patterns
data are being used
data are being used
y Develop and use data security‐enhancing techniques y Integration of data mining with database systems, data
y Blind signatures warehouse systems, and Web database systems
y Biometric encryption
y Anonymous databases
Trends in Data Mining (2)
y Standardization of data mining language
g g g
y A standard will facilitate systematic development, improve
interoperability, and promote the education and use of data
mining systems in industry and society
y Visual data mining
y New methods for mining complex types of data
y More research is required towards the integration of data mining
More research is required towards the integration of data mining
methods with existing data analysis techniques for the complex
types of data
y Web mining
y Privacy protection and information security in data mining
J.Han & M.Kamber “Data Mining: Concepts and Techniques” 59