Академический Документы
Профессиональный Документы
Культура Документы
BITS Pilani
Pilani Campus
DATA MINING
IS ZC415 Second Semester 2013-14
2/15/2014
IS ZC415
2
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reference Books
Han J & Kamber M, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, Second Edition, 2006 Dunhum M.H. & Sridhar S. Data MiningIntroductory and Advanced Topics, Pearson Education, 2006.
2/15/2014 IS ZC415 3
Agenda
Motivation: Why data mining? What is data mining? Data Mining: KDD Process? Data mining tasks Major issues in data mining Applications
2/15/2014
IS ZC415
"What was my total Computers, tapes, revenue in the last disks five years?" "What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC On-line analytic processing (OLAP), multidimensional databases, data warehouses Advanced algorithms, multiprocessor computers, massive databases
IS ZC415
IBM, CDC
"What were unit sales in New England last March? Drill down to Boston." "Whats likely to happen to Boston unit sales next month? Why?"
SPSS, Comshare, Retrospective, Arbor, Cognos, dynamic data Microstrategy,NCR delivery at multiple levels
2/15/2014
2/15/2014
IS ZC415
2/15/2014
IS ZC415
2/15/2014
IS ZC415
Software Agents Expert Systems Online Analytical Processing (OLAP) Statistical Analysis Tool Data visualization
IS ZC415 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2/15/2014
2/15/2014
11
Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from data in large databases
2/15/2014
IS ZC415
12
Input Data
Data Preprocessing
Data Mining
PostProcessing
Information
2/15/2014
IS ZC415
Input data 14
Data Mining
Data Warehouse
Data Cleaning
Data Integration
2/15/2014
Databases
IS ZC415
15
Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
2/15/2014 IS ZC415 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Making Decisions
Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Analysis, Querying and Reporting
End User
Data Warehouses / Data Marts OLAP, MDA Data Sources Paper, Files, Information Providers, Database Systems, OLTP
2/15/2014 IS ZC415
DBA
18
Machine Learning
Data Mining
Visualization
Information Science
2/15/2014 IS ZC415
Other Disciplines
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining:
2/15/2014
Large Data sets Efficiency of Algorithms is important Scalability of Algorithms is important Real World Data Lots of Missing Values Pre-existing data - not user generated Data not static - prone to updates Efficient methods for data retrieval available for use
IS ZC415 20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2/15/2014
IS ZC415
21
Data Warehouse: a centralized data repository which can be queried for business benefit. Data Warehousing makes it possible to extract archived operational data overcome inconsistencies between different legacy data formats integrate data throughout an enterprise, regardless of location, format, or communication requirements incorporate additional or expert information OLAP: On-line Analytical Processing Multi-Dimensional Data Model (Data Cube) Operations: Roll-up Drill-down Slice and dice Rotate
IS ZC415 22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
2/15/2014
OLAP
Summaries, trends and forecasts
Analysis Multidimensional data modeling, Aggregation, Statistics What is the average income of mutual fund buyers by region by year?
Data Mining
Knowledge discovery of hidden patterns and insights
Insight and Prediction Induction (Build the model, apply it to new data, get the result) Who will buy a mutual fund in the next 6 months and why?
Type of result
Method
Example question
2/15/2014
IS ZC415
23
4
5 6 7 8 9
rainy
rainy rainy overcast sunny sunny
70
68 65 64 72 69
96
80 70 65 95 70
false
false true true false false
yes
yes no yes no yes
10
11 12 13 14 2/15/2014
rainy
sunny overcast overcast rainy
75
75 72 81 71
80
70 90 75 91 IS ZC415
false
true true false true
yes
yes yes yes no 24
2/15/2014
IS ZC415
25
26
2/15/2014
IS ZC415
28
2/15/2014
29
2/15/2014
IS ZC415
30
2/15/2014
IS ZC415
32
Predictive Tasks
Objective: Predict the value of a specific attribute (target/dependent variable)based on the value of other attributes (explanatory). Example1: Judge if a patient has specific disease based on his/her medical tests results.
2/15/2014
IS ZC415
33
Predictive Model
Example2: Credit Card Company Every purchase is placed in 1 of 4 classes Authorize Ask for further identification before authorizing Do not authorize Do not authorize but contact police Two functions of Data Mining Examine historical data to determine how the data fit into 4 classes Apply the model to each new purchase
IS ZC415
2/15/2014
34
Descriptive Tasks
Objective: To derive patterns (correlation,trends,trajectories) that summarizes the underlying relationship between data. Example: Identifying web pages that are accessed together.
2/15/2014
IS ZC415
35
Regression
Predictive
DATA MINING
Time Series
Analysis Prediction
Clustering
Summarizatio n
Descriptive
Association
Rules Sequence
Discovery Figure taken from M H Dunham book on Data Mining
2/15/2014 IS ZC415 36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary
2/15/2014
IS ZC415
37
KDD Conferences ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
ACM SIGMOD
VLDB (IEEE) ICDE WWW, SIGIR
Journals
2/15/2014
IS ZC415
2/15/2014
39
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 B. Liu, Web Data Mining, Springer 2006. T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
IS ZC415 40
2/15/2014
Web Resources
Text book's website at:
http://wwwusers.cs.umn.edu/~kumar/dmbook/.
PowerPoints for the text book are available via anonymous ftp at:
ftp://ftp.aw.com/cseng/authors/tan
Other Resources:
http://web.ccsu.edu/datamining/resources.html
2/15/2014
IS ZC415
41