Вы находитесь на странице: 1из 19

Data Mining

Introduction
Data mining (Knowledge Discovery
from Data)
Why Suddenly Data Mining ?
• Dynamic business environment
– Markets are, or seem to be, saturated
– Customers are aggressive and disloyal
– Speed is essential
– “The quick and the dead”
• Availability of data
– Vast amounts of data are generated, stored electronically and are
waiting to be processed !
• Availability of tools and techniques
– Mathematical tools are available to “process” this data
– This processing is significantly different from MIS / EDP style of data
processing
– Software tools are available to implement some of these mathematical
models
Drivers
• Market: From focus on product/service to focus on
customer
• IT: From focus on up-to-date balances to focus on
patterns in transactions - Data Warehouses - OLAP
• Dramatic drop in storage costs : Huge databases
– e.g Walmart: 20 million transactions/day, 10 terabyte
database, Blockbuster: 36 million households
• Automatic Data Capture of Transactions
– e.g. Bar Codes , POS devices, Mouse clicks, Location data
(GPS, cell phones)
• Internet: Personalized interactions, longitudinal data
The Business Environment
• Customer Behaviour
– Customers have access to more channels
• Newer retail formats
• Online stores
– Customers have access to more suppliers
• Increased commoditisation
• Customer loyalty is not assured any more
• Market Saturation
– Multiple suppliers operating in each market
– Niche market based on demographics and preferences
• Competition has intensified
• Need for speed
– Product life cycles are getting shorter
– Time to market
New way of looking at customer
• Customer Relationship Management
– Intimacy, collaboration, one-to-one partnerships are necessary
• Need to ask ...
– What classes of customers do we have ? Are there
subclasses in terms of behaviour ?
– How can we sell more to existing customers ? What exactly
are they buying now ?
– Is there a pattern in the way our customers behave ?
– Who are my good customers ?
• To whom we should sell more
– Who are my bad customers ?
• Who are likely to default or defraud ?
Availability of vast amounts of data
• ERP and OLTP systems
– With their centralised RDBMSs are huge pools of firmwide
data that can overwhelm even the most dedicated
manager
• Datawarehouse
– Technology has resulted in equally huge pools of historical data
• Storage Capacity
– Inexpensive
– Ultra high capacity
Availability of vast amounts of data
• Cards
– Credit & Debit Cards
– Loyalty Card
– Result in capture of huge pools of data
• Transactional Data Capture
– Point of sales systems, Bar code readers
– Capture vast amount of transaction data at increasing
levels of granularity
• What was sold ? Product, SKU
• When was it sold ? Date, time
• How was it sold ? Discount, Promotions
– Beyond simple sales
• Telephone calls, frequent flyer data
Trends leading to Data Flood
• More data is generated:
– Web, text, images …
– Business transactions, calls, ...
– Scientific data: astronomy, biology, etc
• More data is captured:
– Storage technology faster and cheaper
– DBMS can handle bigger DB
Data Growth
• The explosive growth of data: from terabytes
to petabytes
– Data collection and data availability
• Automated data collection tools, database systems,
Web, computerised society
• We are drowning in data, but starving for
knowledge.
• It is also called knowledge discovery in
database.
Data Mining
• Data mining
– Is the process of extracting unknown, valid and
actionable information from large databases and
then using this information to make crucial business
decisions.
• Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through
large amounts of data stored in repositories, using
pattern recognition technologies as well as statistical
and mathematical techniques.”
• Knowledge Discovery in Databases (KDD)
Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful) patterns
or knowledge from huge amount of data.
• Combination of AI and statistical analysis to discover
information that is “hidden” in the data
– associations (e.g. linking purchase of pizza with drinks)
– sequences (e.g. tying events together: marriage and purchase of
furniture)
– classifications (e.g. recognizing patterns such as the attributes of
customers that are most likely to quit)
– forecasting (e.g. predicting buying habits of customers based on
past patterns)
• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Knowledge Discovery Definition
• Knowledge Discovery in Data is the non-trivial process of
identifying
– Valid
– Novel
– potentially useful
– and ultimately understandable patterns in data.
• Are these the same as data mining ?
– SQL queries against large databases
– Multidimensional database analysis
– Online analytical processing
– Sophisticated graphic visualisation
– “Classical” statistical analysis
• ANOVA ? Regression ? Correlation ?
• What is missing ?
– Discovery of information without a previously articulated or
formulated hypothesis
Sample Data Mining Applications
• Direct Marketing
– identify which prospects should be included in a mailing list
• Market segmentation
– identify common characteristics of customers who buy same
Products
• Customer rate of attrition
– Predict which customers are likely to leave your company for a
competitor
• Market Basket Analysis
– Identify what products are likely to be bought together
• Insurance Claims Analysis
– discover patterns of fraudulent transactions
– compare current transactions against those patterns
Integration of Multiple Technologies

Machine Pattern Recognition


Learning

Database
Management Statistics

Algorithms Visualization
Data
Mining
• Statistics (adapted for 21st century data sizes and speed requirements).
Examples:
– Descriptive: Visualization
– Models: Regression, Cluster Analysis
• Machine Learning: e.g. Neural Nets
• Data Base Retrieval: e.g. Association Rules
• Parallel developments: e.g. Tree methods, k Nearest Neighbors, OLAP-
EDA
Why not traditional data analysis?
• Tremendous amount of data
– Algorithm must be highly scalable to handle such as tera-bytes
of data
• High-dimensionality of data
• High complexity of data
– Data streams and sensor data
– Time-series data, temporal data, sequence data,
– Structure data, graphs, social networks and multi-linked data
– Spatial, wave , text, multimedia data
– Software program ,scientific simulations
• New applications
Data Mining: On what kinds of data?
• Database-oriented data sets and applications
• Relational database, data warehouse, transactional
database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data,
– Structure data, graphs, social networks and multi-
linked data
– Spatial, text, multimedia data
– The world wide web

Вам также может понравиться