DataMiningFinalReport (Qamaruzzaman, Zulqarnain, Aidi)

0
UNIVERSITI UTARA MALAYSIA

School of Quantitative Sciences

SEMESTER II A132 2013/2014
KNOWLEGDE ACQUISITION IN DECISION MAKING
SQIT3033 GROUP A
Project Report : FISH

PREPARED FOR: DR. IZWAN NIZAL B MOHD SHAHARANEE

PREPARED BY:
Qamaruzzaman B Mohd Zain 204652
Muhammad Zulqarnain Hakim B Abd Jalal 212476
Muhamad Aidi Taufiq B Idris 212426

1

Contents
1.0 INTRODUCTION .......................................................................................................................... 1
1.1 Data Background .............................................................................................................. 2
2.0 PROBLEM STATEMENT ........................................................................................................... 3
2.1 Techniques To Be Used ................................................................................................... 3
3.0 RESEARCH METHODOLOGY ................................................................................................. 5
3.1 Knowledge Discovery in Database (KDD) ...................................................................... 5
3.1.1 Selection ................................................................................................................................ 5
3.1.2 Pre-processing ....................................................................................................................... 5
4.0 RESEARCH SOLUTION ............................................................................................................. 9
4.1 Data Mining Technique .................................................................................................... 9
4.2 Steps Involved .................................................................................................................. 9
4.2.1 Define EMDATA .................................................................................................................. 9
4.2.2 Data Partition ...................................................................................................................... 12
4.2.3 Decision Tree ...................................................................................................................... 13
4.2.4 Neural Network ................................................................................................................... 15
4.2.5 Regression ........................................................................................................................... 16
4.2.6 Assessment ........................................................................................................................... 21
5.0 SUMMARY AND DISCUSSION .......................................................................................... 22
REFERENCES .......................................................................................................................................... 24

2

1.0 INTRODUCTION

1.1 Data Background
The data that we received comes from Journal of Statistics Education Data Archive
(2006), "Fish Catch data set (1917). This data set contains measurements of 159 fish
caught in Finland's Lake Laengelmavesi. From the information we have got from the
data, it shows several variables that need to be look for. It contains 7 variables and 159
number of fish caught.
Species = species of fish
Weight = weight of the fish, in grams
Length1 = length of the fish from the nose to the beginning of the tail, in
centimeters
Length2 = length of the fish from the nose to the notch of the tail, in
centimeters
Length3 = length of the fish from the nose to the end of the tail, in centimeters
Height = maximum height of the fish, in centimeters
Width = maximum width of the fish, in centimeters
Category = category of fish
For variable species and category, the measurements are nominal and other variable are
in interval measurement. There are 7 types of species of fish, Bream, Roach, Whitefish,
Parkki, Perch, Pike and Smelt and 2 categories which is freshwater and marine. This data
set contains no missing value the distribution of each of variables are normal. The data
must be process using SAS Enterprise Miner.

3

2.0 PROBLEM STATEMENT

According to the data set from Journal of Statistics Education Data Archive (2006), the
fisherman does not have the knowledge on how to categorize the species of fish into
corresponding category, based on the variable listed.
Based on this problem, we must categorize the fish into their categories by using SAS
Enterprise miner.

2.1 Techniques To Be Used

a) Decision Tree
Decision trees are produced by algorithms that identify various ways of splitting a data
set into branch-like segments. These segments form an inverted decision tree that
originates with a root node at the top of the tree. The object of analysis is reflected in this
root node as a simple, one-dimensional display in the decision tree interface. The name of
the field of data that is the object of analysis is usually displayed, along with the spread or
distribution of the values that are contained in that field.
A decision tree can be used to clarify and find an answer to a complex problem. The
structure allows users to take a problem with multiple possible solutions and display it in
a simple, easy-to-understand format that shows the relationship between different events
or decisions. The furthest branches on the tree represent possible end results.

4

b) Regression
Regression is a statistical measure that attempts to determine the strength of the
relationship between one dependent variable (usually denoted by Y) and a series of other
changing variables (known as independent variables). Regression takes a group of
random variables, thought to be predicting Y, and tries to find a mathematical
relationship between them.
This relationship is typically in the form of a straight line (linear regression) that best
approximates all the individual data points. Regression is often used to determine how
many specific factors such as the price of a commodity, interest rates, particular
industries or sectors influence the price movement of an asset. The process ends when
none of the variables outside the model has a p-value less than the specified entry value
and every variable in the model is significant at the specified stay value.

c) Neural Network
A Neural Network is a set of connected input/output units where each connection has a
weight associated with it. The general purpose of neural network modeling is to
estimates, classify and make predictions. Neural network modeling is typically designed
in fitting data with an enormous number of records with numerous predictor variables in
the model. An artificial neural network (ANN), often just called a "neural network" (NN),
is a mathematical model or computational model based on biological neural networks, in
other words, is an emulation of biological neural system. It consists of an interconnected
group of artificial neurons and processes information using a connectionist approach to
computation. In most cases an ANN is an adaptive system that changes its structure based
on external or internal information that flows through the network during the learning
phase.

5

3.0 RESEARCH METHODOLOGY

Knowledge Discovery and Data Mining (KDD) is an interdisciplinary area focusing upon
methodologies for extracting useful knowledge from data. The challenge of extracting
knowledge from data draws upon research in statistics, databases, pattern recognition, machine
learning, data visualization, optimization, and high-performance computing, to deliver advanced
business intelligence and web discovery solutions.
There are several process in doping KDD including, selection, preprocessing, transformation,
data mining and interpretation and evaluation.
3.1 Knowledge Discovery in Database (KDD)

3.1.1 Selection

The main idea is that we select our sample from the data set given to get the
specific scope of our problem. We also need to select our sample how many
sample that we need to choose for training and test. For our data, we include all
variables into our calculation.

3.1.2 Pre-processing

In this stage, data reliability is enhanced. It includes data clearing, such as
handling missing values and removal of noise or outliers. There are many
methods explained in the handbook, from doing nothing to becoming the major
part (in terms of time consumed) of a KDD project in certain projects. It may
involve complex statistical methods or using a Data Mining algorithm in this
context. For example, if one suspects that a certain attribute is of insufficient
reliability or has many missing data, then this attribute could become the goal of a
data mining supervised algorithm. A prediction model for this attribute will be
6

developed, and then missing data can be predicted. The extension to which one
pays attention to this level depends on many factors.
There are several methods in KDD process which is data cleaning, data
integration, data transformation, and data reduction.

3.1.2.1 Data Cleaning

Data cleaning is a process used to determine inaccurate, incomplete, or
unreasonable data and then improving the quality through correction of detected
errors and omissions. It included handling incomplete, noisy and inconsistent in a
data set.
For this data set, we found that there are no missing values in the data across the
variables. There also no inconsistent in our data, no replication and other possible
redundancy. Also there are no outliers as our data are all in normal distribution.

3.1.2.2 Data Integration

Data integration is when a data comes from different sources with different
naming standard but have the same meaning between the two of them. If this
happen we need to handle them by combine the two source into one. Other
method is to do correlation analysis between two sources and see the strength of
relationship between the sources.

In our data set, there are several name sources but different meaning. For example
length 1, length 2 and length 3, these sources have almost the same name but it is
different in meaning. For length 1, it is the length of the fish from the nose to the
7

beginning of the tail, length 2 is the length of the fish from the nose to the notch
of the tail, and length 3 is the length of the fish from the nose to the end of the tail,
all measurements are in centimeter.

3.1.2.3 Data Transformation

In Data Mining pre-processes and specialty in metadata and data warehouse, we
use data transformation in order to convert data from a source data format into
destination data. Data transformation can involve:
i) Z-Score Normalization: where the attributes data are scaled so as to fall
within a small specified range. Useful when the extreme value is unknown
or outlier dominates the extreme values. The scale will be [0 to 1]
ii) Min-max Normalization: it performs a linear transformation on the
original data. This normalization preserves the relationship among the
original data values.
iii) Decimal Scaling: normalize by moving the decimal point of values of an
attribute.
In this project, there are no problems in the data set given as there are no missing
value, no outliers, no redundancy in data, all sources of data are in normality, and
no data integration between sources. So we do not need to do anything about
transforming the data and just continue with the processing.

3.2.1.4 Data Mining

Generally, data mining (sometimes called data or knowledge discovery) is the
process of analyzing data from different perspectives and summarizing it into
useful information - information that can be used to increase revenue, cuts costs,
8

or both. Data mining software is one of a number of analytical tools for analyzing
data. It allows users to analyze data from many different dimensions or angles,
categorize it, and summarize the relationships identified. Technically, data mining
is the process of finding correlations or patterns among dozens of fields in large
relational databases.
In this project, we are using SAS Enterprise Miner to build up the comparing
models. In the SAS Enterprise Miner, we are using decision tree method,
regression and neural network. At the end of this project, we want to compare
each method and identify which is the best method to solve our problem.

3.2.1.5 Interpretation and Evaluation Process

In interpretation and evaluation process, certain data mining output is non-human
understandable format and we need interpretation for better understanding. So, we
convert output into an easy understand medium so that other people with less
knowledge about this will easily know and understand. Evaluation is needed in
order to measure the validity of the generated model and to ensure the model is
correct.

9

4.0 RESEARCH SOLUTION

4.1 Data Mining Technique
As mentioned in problem statement, we used 3 different predictive modeling tools using
data mining techniques in SAS Enterprise Miner, decision tree, neural network and
regression.
4.2 Steps Involved
4.2.1 Define EMDATA
Open the SAS application. Create the Enterprise Miner by clicking Solution-
Analysis-Enterprise Miner to open the Enterprise Miner. Then click File-New-
Project to create a new project. We rename our project as Fish. After that, click
create and rename the untitled diagram as Models.

Figure 1

10

Figure 2

Figure 3

11

Figure 4
Then we just focus on the Enterprise Miner window. Firstly drag the Input Data
Source into the workspace. The Input Data Source node reads data sources and
defines their attributes for later processing by Enterprise Miner. Open the node
and browse the input data and click select and save the node.

Figure 5

12

4.2.2 Data Partition

Then we drag a Data Partition node and connect it with Input Data Source node.
The function of this node is to partition the input data sets of mushroom into a
training, validation and test model. The training data set is used for preliminary
model fitting. The validation data set is used to monitor and tune the free model
parameters during estimation and is also used for model assessment. The test data
set is an additional holdout data set that we can use for model assessment. Right
click on this data partition node then click open and we set the percentage of each
partition and we decide to set 70% for training, 0% for validation and 30% for
test.

Partition :

Figure 6

13

Figure 7
4.2.3 Decision Tree

Decision Tree node is used to fit decision tree models to the data. The
implementation includes features that are found in a variety of popular decision
tree algorithms such as CHAID, CART, and C4.5. The node supports both
automatic and interactive training. When we run the Decision Tree node in
automatic mode, it automatically ranks the input variables, based on the strength
of their contribution to the tree. This ranking can be used to select variables for
use in subsequent modeling. We can override any automatic step with the option
to define a splitting rule and prune explicit tools or sub-trees. Interactive training
enables us to explore and evaluate a large set of trees as we develop them.

Right click at Decision tree node and then click Open

Figure 8

14

Here are the results for decision tree:

Figure 9
Here we can see that the misclassification rate is the lowest at leave 3 and leave 4
which have the same value which is 0.036. Near to zero, the result is good.
Decision Tree Diagram:

Figure 10
From above diagram we can see that, there are three nodes that represent the class
label.

Then, right click at the blank space and choose view competing splits.
15

Figure 11
From the table above, we can conclude that Length 1 variable was used for the first
split. The other variable, width is used for the next split.

4.2.4 Neural Network
Neural Network node is used to construct, train, and validate multilayer, feed
forward neural networks. By default, the Neural Network node automatically
constructs a network that has one hidden layer consisting of three neurons. In
general, each input is fully connected to the first hidden layer, each hidden layer is
fully connected to the next hidden layer, and the last hidden layer is fully
connected to the output. The Neural Network node supports many variations of
this general form. In this project, we click Open at neural network node and this
is the result for:
16

Figure 12
Misclassification rate in training data set = 0.027
Misclassification rate in testing data set = 0.0208
Based on the result, we cannot say that the Neural Network model is the best as it
does contain error.

4.2.5 Regression
After that, we add regression node into the workspace and connect the data
partition node with regression node. The simplest form of regression, linear
regression, uses the formula of a straight line (y = mx + c) and determines the
appropriate values for m and b to predict the value of y based upon a given value
of x. We can use continuous, ordinal, and binary target variables, and can use both
17

continuous and discrete input variables. In our data, we set the variable category
will be our target in the measurement is in binary. The function of Regression
node is to fit both linear and logistic regression models to the data.
Double click at regression node and set our regression option.
Variables:

Figure 13
Model Options:

Figure 14
We choose logistic because our target variable is set to binary. Logistic regression
can also be applied to ordered categories (ordinal data), that is, variables with
more than two ordered categories.
Selection Method:
18

Figure 15
We choose stepwise selection method. Stepwise selection begins, by default, with
no candidate effects in the model and then systematically adds effects that are
significantly associated with the target. However, after an effect is added to the
model, Stepwise may remove any effect already in the model that is not
significantly associated with the target.
After setting the options, we run the regression node and analyze the result. Click
the statistics tab and there will be a window come out.

19

Results:

Figure 16
Here we focus on misclassification rate, for training data set, the misclassification
rate is 0.018 and for test data set, the misclassification rate is 0.
20

Figure 17
This is result viewer for regression node. From this result we can construct the
regression equation. The intercept is Marine (Y), the variables are Height (X1),
Width (X2) and Length3 (X3).
Regression equation:

21

4.2.6 Assessment
The last step is we add Assessment node and connect it with the three nodes
which are Decision tree node, Regression node, and Neural Network node.

Figure 18
Then, click Run at Assessment node to get the result.

Figure 19

From the result, we can see that Decision Tree is the best model for now, because
the regression is not included in this table .This is because the value of
misclassification rate from testing data set is 0.

22

5. SUMMARY AND DISCUSSION

Before this, from the Assessment result, we state that the Decision Tree is the best model
because the test misclassification rate is equal to zero. But now, we can see that, regression
model also have a zero misclassification rate for the test data.

Figure 21

Model Training: Misclassification
Rate
Test: Misclassification Rate
Decision Tree 0.036036036 0
Neural Network 0.027027027 0.0208333333
Regression 0.018 0
Figure 20
23

Figure 21 shows the Lift Chart for the regression model. From the lift chart, the cumulative %
response is 100% through the 30th percentile. At the 40th percentile, the next observation with
the highest predicted probability is a non-response, so the cumulative response drops to 91.25%.
Thus, Regression is the best model.

24

REFERENCES

1) Retrieved from : http://www.amstat.org/publications/jse/jse_data_archive.html
2) Journal of Statistics Education Data Archive (2006), "Fish Catch data set (1917)
3) http://support.sas.com/documentation/cdl/en/stsug/62259/HTML/default/viewer.htm#uga
ppdatasets_sect8.htm

DataMiningFinalReport (Qamaruzzaman, Zulqarnain, Aidi)

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

DataMiningFinalReport (Qamaruzzaman, Zulqarnain, Aidi)

Загружено:

Авторское право:

Доступные форматы

0

UNIVERSITI UTARA MALAYSIA

Вам также может понравиться