Академический Документы
Профессиональный Документы
Культура Документы
Introduction to Clementine
Tutors: Cecia Chan & Gabriel Fung
Data mining is
A process of extracting previously unknown, valid and actionable knowledge from large databases
A rule of thumb:
If we know clearly the shape and likely content of what we are looking for, we are probably not dealing with data mining
Data collection
You can learn nothing without data
Modeling
The core part of data mining
Evaluation
See what you have learn!
Problem Statement
Situation:
You are a researcher compiling data for a medical study You have collected data about a set of patients, all of whom suffered from the same illness Each patient responded to one of five drug treatments
Figure out which drug might be appropriate for a future patient with the same illness Here are the data collected:
Age Sex (M or F) BP (Blood pressure: High, normal, or low) Weight (The weight of the patient) Cholesterol (Blood cholesterol: Normal or high) Na (Blood sodium concentration) K (Blood potassium concentration) Drug (Drug to which the patient responded)
Clementine is located in
Start All Programs Clementine 6.0.2
Work-Space
Models
Nodes
Nodes in the workspace represent different objects and actions. You connect the nodes to form streams, which, when executed, let you visualize relationships and draw conclusions.
Other details
Note: Connect the nodes by click-and-drag the middle button of the mouse
Execution:
Replacing values:
Use Filler node:
Suppose we want to transform all weights to its log value (Note: we usually only transform variables to log when it is highly skewed):
Unsupervised Learning:
Train Kohonen (Self-Organized Map, SOM) Train KMeans (K-means Clustering) TwoStep (A kind of Hierarchical Clustering)
Others:
GRI (Association Rule mining) Apriori (Association Rule mining) Factor / PCA (Factor analysis, attribute selection technique)
Note:
There are many complex settings for each model In this tutorial, we use default setting Fine tuning a model requires solid experiences in data mining
It means NOTHING even if we have learned SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALID Remember:
The data set of training and testing are ALWAYS different (why?)
Different results:
Different models can yield a completely different results Choosing and tuning a good model is a difficult job In this tutorial, we only introduce the process of data mining only
Assignment 1
Situation:
You are a financial analyst of a bank You have to predict whether a customer is Good or Bad based on some demographic information
Data Set:
A data set about your past customers has been collected Each customer is either Good or Bad
MARITAL
PROPERTY AGE OTHER HOUSING EXISTCR JOB FOREIGN GOOD_BAD
input
input input input input input input input Output
Nominal
Nominal Interval Nominal Nominal Interval Nominal Binary Binary
Martial status
Type of Property Age in years Type of other installment plan Type of House Number of existing credits Job Nature Foreign worker or Local worker Good or bad credit rating
Data Collection
Please download CreditRisk data set from http://www.se.cuhk.edu.hk/~ect7470/ Two data sets: (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing
Data Preprocessing
Please explore the data and think critically whether any data preprocessing is necessary
Hints: Two of the interval variables are highly skewed
Modeling
Please build a prediction models using default settings:
C5.0 Decision Tree
Model Assessment
Please use the testing data set to evaluate the performance of the prediction models
Assignment 1 Submission
Deadline:
4 April 2004
This is an individual assignment Note: We strongly encourage you to submit this assignment during the class!!!