Introduction To Clementine1701

Data Mining Tutorial
Introduction to Clementine
Tutors: Cecia Chan & Gabriel Fung
A Brief Review of Data Mining (I)
Data mining is
A process of extracting previously unknown, valid and actionable knowledge from large databases
A rule of thumb:
If we know clearly the shape and likely content of what we are looking for, we are probably not dealing with data mining
A Brief Review of Data Mining (II)
Therefore, data mining is not

SQL queries against any number of disparate database or data warehouse SQL queries in a parallel or massively parallel environment Information retrieval, for example, through intelligent agents Multidimensional database analysis (MDA) OLAP Exploratory data analysis (EDA) Graphical visualization Traditional statistical processing against a data warehouse
However, they are all related to data mining
Data Mining Process

1. 2. 3. 4. 5.
Business objective(s) determination

What is your goal?
Data collection
You can learn nothing without data
Data preprocessing (or Data preparation)

Remove outlier / filter noise / modify fields / etc
Modeling
The core part of data mining
Evaluation
See what you have learn!
Data Mining Software
Existing Data mining software:

Clementine from SPSS (we have this software), Enterprise Minter from SAS (we have this software), Intelligence Miner from IBM (we have this software), MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc., PolyAnalyst from Megaputer Intelligence, StatServer from Mathsoft : :
Problem Statement
Situation:
You are a researcher compiling data for a medical study You have collected data about a set of patients, all of whom suffered from the same illness Each patient responded to one of five drug treatments
Step 1: Business objective

Figure out which drug might be appropriate for a future patient with the same illness Here are the data collected:
Age Sex (M or F) BP (Blood pressure: High, normal, or low) Weight (The weight of the patient) Cholesterol (Blood cholesterol: Normal or high) Na (Blood sodium concentration) K (Blood potassium concentration) Drug (Drug to which the patient responded)
Using Clementine (1)
Clementine is located in
Start All Programs Clementine 6.0.2
Work-Space
Models
Nodes
Using Clementine (2)
Nodes in the workspace represent different objects and actions. You connect the nodes to form streams, which, when executed, let you visualize relationships and draw conclusions.
Step 2: Data Collection (1)

Double Click
Nodes for inputting the collected data
Data Collection (2)

Location of your file Use how many columns from the file Is the first row specify the names of the fields or not
Other details
Step 3: Data Preparation Explore the Data (1)
Nodes for exploration/visualization:

Table (in the Output panel) Plot (in the Graphs Panel) Histogram (in the Graphs Panel) Distribution (in the Graphs Panel) Web (in the Graphs Panel)
Connect the nodes:

Double Click
Note: Connect the nodes by click-and-drag the middle button of the mouse
Execution:
Note: Right click on the table node to display this menu
Other nodes (Please try the other nodes yourself):

Histogram:
Step 3: Data Preparation Modify the Data (1)
Replacing values:
Use Filler node:
Suppose we want to transform all weights to its log value (Note: we usually only transform variables to log when it is highly skewed):
Derive a new value:

Use Derive node:
Suppose we want to combine Na and K:
Remove some fields

Use Filter node
Suppose we have derived a new field Na_Over_K, now we need to remove the field Na and K:
Step 4: Modeling Define fields
Define the fields

Use Type node:
Step 4: Modeling Build a Model (1)

It is the core part of data mining. Supervised Learning:

Train Net (Neural Network) C5.0 (C5.0 Decision Tree) Linear Reg. (Linear regression) C & R Tree (Classification and Regression Tree, CART)
Unsupervised Learning:
Train Kohonen (Self-Organized Map, SOM) Train KMeans (K-means Clustering) TwoStep (A kind of Hierarchical Clustering)
Others:
GRI (Association Rule mining) Apriori (Association Rule mining) Factor / PCA (Factor analysis, attribute selection technique)
Build what model?

Recall that our objective is to determine which type of drugs is suitable for a specific patient. Thus, it is a classification problem (supervised learning)
In this tutorial, we use:

C5.0 and C & R Tree
Note:
There are many complex settings for each model In this tutorial, we use default setting Fine tuning a model requires solid experiences in data mining
Step 5: Evaluation (1)
It means NOTHING even if we have learned SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALID Remember:
The data set of training and testing are ALWAYS different (why?)
Create the following flow
Note: Must have the same flow as the training stage
Different results:
Different models can yield a completely different results Choosing and tuning a good model is a difficult job In this tutorial, we only introduce the process of data mining only
Assignment 1
Assignment 1 Problem Statement
Situation:
You are a financial analyst of a bank You have to predict whether a customer is Good or Bad based on some demographic information
Data Set:
A data set about your past customers has been collected Each customer is either Good or Bad
Assignment 1 Field definitions

VARIABLE CHECKING HISTORY AMOUNT SAVINGS EMPLOYED INSTALLP ROLE input input input input input input DEFINITION Nominal Nominal Interval Nominal Nominal Nominal DESCRIPTION Checking account status Credit history Amount in Bank No. of Savings (bonds, stocks, etc) Employment Type (Gov., private, etc) Type of installment rate
MARITAL
PROPERTY AGE OTHER HOUSING EXISTCR JOB FOREIGN GOOD_BAD
input
input input input input input input input Output
Nominal
Nominal Interval Nominal Nominal Interval Nominal Binary Binary
Martial status
Type of Property Age in years Type of other installment plan Type of House Number of existing credits Job Nature Foreign worker or Local worker Good or bad credit rating
Assignment 1 Data Mining Process
Data Collection
Please download CreditRisk data set from http://www.se.cuhk.edu.hk/~ect7470/ Two data sets: (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing
Data Preprocessing
Please explore the data and think critically whether any data preprocessing is necessary
Hints: Two of the interval variables are highly skewed
Assignment 1 Data Mining Process
Modeling
Please build a prediction models using default settings:
C5.0 Decision Tree
Model Assessment
Please use the testing data set to evaluate the performance of the prediction models
Assignment 1 Submission
Save the stream as id.str

E.g, 00123456.str
Upload your stream to the course account
Deadline:
4 April 2004
This is an individual assignment Note: We strongly encourage you to submit this assignment during the class!!!

Introduction To Clementine1701

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Introduction To Clementine1701

Загружено:

Авторское право:

Доступные форматы

Data Mining Tutorial

A Brief Review of Data Mining (I)

A Brief Review of Data Mining (II)

Therefore, data mining is not

However, they are all related to data mining

Data Mining Process

Business objective(s) determination

Data preprocessing (or Data preparation)

Data Mining Software

Existing Data mining software:

Step 1: Business objective

Using Clementine (1)

Using Clementine (2)

Step 2: Data Collection (1)

Nodes for inputting the collected data

Data Collection (2)

Step 3: Data Preparation Explore the Data (1)

Nodes for exploration/visualization:

Step 3: Data Preparation Explore the Data (2)

Connect the nodes:

Step 3: Data Preparation Explore the Data (3)

Note: Right click on the table node to display this menu

Step 3: Data Preparation Explore the Data (4)

Other nodes (Please try the other nodes yourself):

Step 3: Data Preparation Modify the Data (1)

Step 3: Data Preparation Modify the Data (2)

Derive a new value:

Step 3: Data Preparation Modify the Data (3)

Remove some fields

Step 4: Modeling Define fields

Define the fields

Step 4: Modeling Build a Model (1)

It is the core part of data mining. Supervised Learning:

Step 4: Modeling Build a Model (2)

Build what model?

In this tutorial, we use:

Step 4: Modeling Build a Model (3)

Step 5: Evaluation (1)

Step 5: Evaluation (2)

Create the following flow

Note: Must have the same flow as the training stage

Step 5: Evaluation (3)

Assignment 1 Problem Statement

Assignment 1 Field definitions

Assignment 1 Data Mining Process

Assignment 1 Data Mining Process

Save the stream as id.str

Upload your stream to the course account

Вам также может понравиться