Вы находитесь на странице: 1из 13

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

102

Model Selection Criterions as Data Mining Algorithms Selector The Selection of Data Mining Algorithms through Model Selection Criterions
Dost Muhammad Khan1, Nawaz Mohamudally2 Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur, PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology Mauritius (UTM), MAURITIUS
2 1

Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology, Mauritius (UTM), MAURITIUS

Abstract
The selection criterion plays a vital role in the selection of right model for right dataset. It is a gauge to determine whether the dataset is under-fitted or over-fitted. If the dataset is either over-fitted or underfitted, both are the errors in the dataset and lead to produce the vague or ambiguous knowledge from the dataset and hence need to be addressed properly. The data is used either to predict future behavior or to describe patterns in an understandable form within discovered process. The major issue is that how to avoid from these problems. There are different approaches to avoid the problem of over and underfitting, namely, model selection, Jittering, Weight Decay, Early Stopping and Bayesian estimation. We talk about only the model selection criterions in this paper. Furthermore, we focus on how the value of model selection criterion is used to map with the appropriate data mining algorithm for the dataset.

Keywords: AIC, BIC, Overfitting, Underfitting, Model Selection 1. Introduction


The purpose of model selection is to identify a model that best fits the available dataset, with model complexity being corrected or penalized. There are two main issues working in data mining, first is bad or wrong data and the second is controlling the model capacity, making sure it is neither so small that we are missing useful and exploitable patterns, nor so large that we are confusing pattern and noise. The concept of over-fitting and under-fitting is important in data mining. The over and underfitting is due to missing and noisy, inconsistent and redundant values and number of attributes in a dataset. We can avoid these problems by using one of these techniques; apply upper or lower thresholds values, remove attributes below a threshold value and remove noise and redundant attributes. The best solution of these problems is use lots of training dataset and do not make too many or too few assumptions. The other possible solutions are, model selection, jittering, weight decay, early stopping and Bayesian estimation. The model selection criterion is discussed in this paper. There exists models for the selection of data mining algorithms, such as VC (Vapnik-Chervonenkis)dimension, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), (SRMVC) Structural Risk Minimize with VC dimension, CV (Cross-validation), Deviance Information Criterion, Hannan-Quinn Information Criterion, Jensen-Shannon Divergence, Kullback-Leibler Divergence models and many more. We select only two model selection criterions AIC and BIC. A model is better than another model if it has a smaller AIC or BIC value. Both AIC and BIC have solid theoretical foundations: Kullback-Leibler distance in information theory (for AIC), and integrated likelihood in Bayesian theory (for BIC). If the complexity of the true model does not increase with the size of the dataset, BIC is the preferred criterion, otherwise AIC is preferred. Since selecting the number of parameters and number of attributes is a model selection problem, so one has to take care of these important aspects of a dataset. Using too many parameters can fit the data perfectly, but it can be an overfitting. Using too few parameters may not fit the dataset at all, thus underfitting. This shows the importance of parameters and observed data in a given dataset. Variable selection by AIC or BIC will provide an answer to this problem. We illustrate the importance of comparing different models with different number of parameters by using AIC and BIC. The goal of this paper is to draw a comparison between the two selected model selection criterions namely AIC and BIC and to map the appropriate data mining algorithms with a particular dataset i.e. the right algorithm for the dataset. The idea of model selection using AIC or BIC has also been applied recently to epidemiology, microarray data analysis, and DNA sequence analysis [21][22][23][24][25].

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

103

The rest of the paper is organized as follows; section 2 discusses the model selection criterions AIC and BIC and section 3 is about the methodology. In section 4 the results are discussed and finally the conclusion is drawn in section 5.

2. Model Selection Criterions


A brief introduction of AIC and BIC is given below:
1. Akaike Information Criterion: The AIC is a criterion for the model selection, developed by Hirotsugu Akaike in 1974, under the name of Akaike Information Criterion. The AIC is based on information theory. Suppose that the data is generated by some unknown process f. We consider two candidate models to represent f: g1 and g2. If we knew f, then we could find the information lost from using g1 to represent f by calculating the KullbackLeibler divergence, DKL(f,g1); similarly, the information lost from using g2 to represent f would be found by calculating DKL(f,g2). We would then choose the candidate model that minimized the information loss. The AIC can tell nothing about how well a model fits the data in an absolute sense. If all the candidate models fit poorly, AIC will not give any warning. In other words, AIC is a difference of accuracy and the complexity of the model [7][9][10][15][17][18][20]. The AIC is defined below:

AIC 2(log(likelihood )) 2k , where 2k are the number of parameters and log(likelihood ) is the log of the likelihood and 2(log(likelihood )) due to perfect fitting, the value of log-likelihood
values gradually approach to 0 as the number of parameters are increased, is also called the Model Accuracy. Therefore, AIC is:

AIC ModelAccuracy No.ofParameters AIC No.ofParameters ModelAccuracy


2. Bayesian Information Criterion: The BIC is a criteria for the selection of model among class of models with different numbers of parameters. When estimating model parameters using maximum likelihood estimation, it is possible to increase the likelihood by adding parameters, which may result in overfitting. BIC resolves this problem by introducing a penalty term for the number of parameters in the model. This penalty is larger in the BIC than in the related AIC. BIC is widely used for model identification in time series and linear regression. The main characteristics of BIC are: It measures the efficiency of the parameterized model in terms of predicting the data, penalizes the complexity of the model where complexity refers to the number of parameters in model, is exactly equal to the minimum description length criterion but with negative sign and is closely related to other likelihood criteria such as AIC [4][8][12][13][16][19]. The formula for BIC is given below:

BIC 2. log(likelihood ) k log(n) , where k is the number of parameters, n is the sample size or the datapoints of the given dataset, 2. log(likelihood ) the values of log-likelihood gradually approach to 0 with the increase on number of parameters, is also known as the Model Accuracy and k log(n) is the
model size. Therefore, BIC is:

BIC ModelSize ModelAccuracy


3. Methodology
with an unknown probabilty density function p ( X | ) , where p ( X | ) is called a parametric model, in which all the parameters are in finite-dimensional parameter spaces. These parameters are collected together to form a single m-dimensional parameter vector (1 , 2 ,..., m ) . To use the method of maximum likelihood, one first specifies the joint density function for all observations. The joint density function of the given observation is given below. Suppose there is a sample X {x1 , x 2 ,..., x n } of n sequence of observations, coming from a distribution

p ( X | ) p ( x1 , x 2 ,..., x n | ) p ( x1 | ) p ( x 2 | ) p ( x n | )

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

104

where the observed values x1, x2, ..., xn are fixed parameters of this function and will be the functions variable and allowed to vary freely. From this point of view this distribution function will be called the likelihood:

likelihood ( | x1 , x 2 ,..., x n ) p ( x1 , x 2 ,..., x n | ) p ( xi | )


i 1

It is more convenient to work with the logarithm of the likelihood function, called the log-likelihood as shown below.

log(likelihood ( | x1 , x 2 ,..., x n )) log( p ( xi | )) ,


i 1

1 log(likelihood ) n

The method of maximum likelihood was first proposed by the English statistician and population geneticist R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the probability of observing the data given a specific model for the data. The idea behind maximum likelihood parameter estimation is to determine the parameters that maximize the probability (likelihood) of the sample data. From a statistical point of view, the method of maximum likelihood is considered to be more robust and yields estimators with good statistical properties. It is a flexible method and can be applied to most models and to different types of data. Although the methodology for maximum likelihood estimation is simple, the implementation is mathematically intense [1][2][3][5][6][11][14]. We select a dataset cars, which is about the different models of brands from different countries. The number of attributes is 9, number of datapoints or records or the sample size is 261 and number of parameters is, in this dataset, the brands from 3 countries, from US is 14, from Europe is 10 and from Japan is 6, are 30. The number of records from US is 62.45%, from Europe 18.00% and from Japan 19.54% of the whole dataset. The number of parameters/brands from US is 46.67%, from Europe is 33.33% and from Japan is 16.67% of the total number of brands or parameters. We use the stepwise variable selection method, starting with one variable and then add or remove variable if the value of AIC or BIC is reduced. The stepwise variable is a local optimal procedure and is tested with different starting sets of parameters so that the optimization is not carried to the extreme. The following steps explain the computation of the value of AIC and BIC: Step 1: Calculate the maximum likelihood of the dataset The likelihood function is simply the joint probability of observing the data. The joint probability is L(

| x1 , x 2 ,..., x n ) p( x1 , x 2 ,..., x n | ) p ( xi | ) . Take the log of this value will give the
i 1

value of model accuracy which is shown below:

ModelAccuracy log(likelihood )
Step 2: Compute the Model Size The formula to calculate the model size is and n is the datapoints.

ModelSize k (log n) where k is the number of parameters

Step 3: Compute the Minimum Description Length (MDL)

MDLScore ModelSize ModelAccuracy


Minimum Description Length (MDL) is also referred as BIC (Bayesian Information Criterion). Therefore, here we can say that AIC and BIC are:

AIC No.ofParameters ModelAccuracy BIC ModelSize ModelAccuracy


The smallest value of the model shows that it performs better than the other selection criterion. Therefore, smallest is the best [1].

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

105

4. Results and Discussion


Case 1: We compute the values of AIC and BIC with different sample size i.e. small, medium and large and the different number of parameters i.e. minimum 2 and maximum 9, where the small sample size is 50, medium is 100 and the large is of 400. In case 1, the number of attributes or the observed data are 9. Table 1 shows the values of AIC and BIC with respect to the different number of parameters, when the sample size n is 50. Table 1 Models Selection with n=50 No. of Parameters BIC AIC 2 15.02 13.10 3 32.30 29.44 4 54.47 50.64 5 76.87 72.09 6 100.49 94.75 7 124.95 118.26 8 150.18 142.53 9 176.10 167.49 The table 1 shows that as the number of parameters increases the values of AIC and BIC increase with the small sample size. At the beginning the gap between the two values is small but as the number of parameters increases the difference between the values of AIC and BIC also become large. For this sample dataset AIC is the best selection due to its low values for each parameters. The graph is drawn between the value of AIC and the value of BIC when the sample size is 50 as shown in figure 1 below.
Comaprison b/w AIC & BIC
200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 0.00 1 2 3 4 5 6 7 8 BIC AIC

Figure 1 A Comparison of AIC and BIC when Sample Size=50 The graph in figure 1 shows that as the number of parameters increase the values of AIC and BIC increase, when the sample size is small i.e. the datapoints in the given dataset are 50. But the value of AIC remains less than the value of BIC from beginning to the end. At the beginning the gap between the two lines is minute as the number of parameters increase the gap also increases and at the end of the graph it is clearly visible. So AIC is the preferred model for the dataset because of its less value. Table 2 shows the values of AIC and BIC with respect to the different number of parameters, when the sample size n is 100. Table 2 Models Selection with n=100 No. of Parameters BIC AIC 2 15.71 13.10 3 33.29 29.38 4 56.50 51.29 5 78.83 72.32 6 102.16 94.35

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

106

7 8 9

127.67 153.18 179.07

118.55 142.76 167.35

The table 2 shows that as the number of parameters increases the values of AIC and BIC increase with the small sample size. We notice that as the sample size changes from small to medium the values of AIC and BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters. The graph is drawn between the value of AIC and the value of BIC when the sample size is 100 as shown in figure 2 below.
Graph b/w AIC & BIC
200 180 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 BIC AIC

Figure 2 A Comparison of AIC and BIC when Sample Size=100 The graph in figure 2 shows that as the number of parameters increase the values of AIC and BIC increase, when the sample size is medium i.e. the datapoints in the given dataset are 100. But the line in the graph which shows the value of AIC remains less than the value of BIC from beginning to the end. At the beginning the gap between the two lines is minute as the number of parameters increase the gap also increases and at the end of the graph it is clearly visible. Another point in this graph is the gap between the two lines is wider as compared to the graph in figure 1 i.e. as the sample size increases the difference between the two values increases. Again for this sample dataset AIC is the best selection. Table 3 shows the values of AIC and BIC with respect to the different number of parameters, when the sample size n is 400. Table 3 Models Selection with n=400 No. of Parameters BIC AIC 2 17.06 13.09 3 35.33 29.37 4 59.06 51.12 5 82.92 72.99 6 107.33 95.42 7 133.37 119.48 8 159.05 143.16 9 185.17 167.30 The table 3 shows that as the number of parameters increases the values of AIC and BIC increase with the small sample size. We observe that as the sample size changes from medium to large the values of AIC and BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters. The graph is drawn between the value of AIC and the value of BIC when the sample size is 400 as shown in figure 3 below.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

107

Graph b/w AIC & BIC


200 180 160 140 120 100 80 60 40 20 0 1 2 3 4 5 6 7 8 BIC AIC

Figure 3 A Comparison of AIC and BIC when Sample Size=400 The graph in figure 3 shows that as the number of parameters increase the values of AIC and BIC increase, when the sample size is large i.e. the datapoints in the given dataset are 400. But the line in the graph which shows the value of AIC remains less than the value of BIC from beginning to the end. At the beginning the gap between the two lines is minute as the number of parameters increase the gap also increases and at the end of the graph it is clearly visible. Another point in this graph is the gap between the two lines is wider as compared to the graphs in figure 1 and 2 i.e. as the sample size increases the difference between the two values increases. Again for this sample dataset AIC is the best selection. We conclude this case that there is no problem of overfitting in this dataset. Case 2: In this case the number of attributes or the observed data is the same as in case 1 i.e. the value is 9. We compute the values of AIC and BIC on the large sample size of 400 but increase the number of parameters from 10 to 24 i.e. minimum 10 and maximum 24. As the number of parameters increase the value of both selection criterions increase. When the number of parameters reaches to 24 the values of AIC and BIC are infinity, which shows that at the dataset is over-fitted, although, the number of observed data in this case is not large. The dataset can produce the knowledge up to 23 number of parameters, of course, the number is again high but the values of the selection criterions are not infinite. This also proves that number of parameters is non-trivial for any dataset. If the user does not take care of the parameters, it is difficult to extract the knowledge from the given dataset. The table 4 shows the values of AIC and BIC with respect to the different number of parameters, when the sample size n is 400. Table 4 Models Selection with n=400 No. of Parameters BIC AIC 10 251.18 231.32 11 290.03 268.19 12 328.85 305.03 13 359.69 333.88 14 399.57 371.77 15 442.01 412.23 16 473.04 441.27 17 517.13 483.38 18 555.81 520.06 19 601.18 563.46 20 647.36 607.64 21 692.97 651.27 22 742.68 698.99 23 786.41 740.74 24 Infinity Infinity

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

108

The table 4 shows that as the number of parameters increases the values of AIC and BIC increase with the sample size is large i.e. the datapoints in the given dataset are 400. At the beginning the gap between the two values is minute but as the number of parameters increase the gap also increases and at the end it reaches to a maximum difference. The values of AIC and BIC is undetermined for the value of parameters from 24. For this sample dataset AIC is the best selection. The upper threshold value of this model is 23 and the lower threshold values of this model are 2, therefore, the range is between 2 and 23. The value outside the range will create the problem of over and underfitting. Case 3: We take another case in which the number of parameters is small, the observed data is large and the sample size is large. We observe that there is no such difference between the values of AIC and BIC. The choice for this dataset is AIC. The results are shown in the table 5 below. Table 5 Models Selection with n=600 No. of Parameters BIC AIC No. of Attributes 5 522.10 511.12 39 The table 5 shows that with the increase of number of attributes and sample size of the dataset there is no such problem of overfitting for this dataset and the value of AIC remains small as compared to BIC, therefore, AIC is the right choice for this dataset. The second example of this case is where the number of parameters is large, number of observed data is large and the sample size is medium. We notice that the value of AIC and BIC is infinity or undetermined. Which proves that dataset is over-fitted, although, the sample size is medium. The user has to modify the dataset in order to extract knowledge. The results are demonstrated in table 6 below. Table 6 Models Selection with n=150 No. of Parameters BIC AIC No. of Attributes 20 Infinity Infinity 72 The table 6 shows that with the increase of number of attributes and number of parameters, the dataset becomes overfitted and the value of both AIC and BIC is infinity. The third example (DNA dataset) of this case is where the number of parameters is small, number of observed data is large and the sample size is very large. We notice that there is no such difference between both the values of AIC and BIC but the value of AIC is still small as compared to BIC. Although the observed data in this dataset is in binary (0, 1) format but still the values of both model selections are computable. The results are demonstrated in table 6 below. Table 7 Models Selection with n=2000 No. of Parameters BIC AIC No. of Attributes 3 648.7343 640.333 180 The result of all these above discussed cases is that if the number of observed data or the number of parameters is small then the dataset is underfitted. Over and underfitting are the noise in the data which must be removed in order to extract the useful knowledge from the dataset. We conclude this case that the number of parameters and the number of attributes play a vital role for a dataset to become over or underfitting, therefore, in order to avoid from these problems, the user must set the datasets carefully. Case 4: The model selection criterion, AIC (Akaike Information Criterion), is used to map the appropriate algorithms with a particular dataset, in order to extract the knowledge. We select three data mining algorithms namely, K-means, C4.5 and Data Visualization and five different datasets, Iris, Diabetes, Breastcancer, DNA and Cars are chosen. The sample size of each dataset is different. We have to select appropriate algorithms for each dataset i.e. the best and suitable algorithm for a dataset. The value of AIC of these datasets is computed, shown in table 8. Table 8 The Value of AIC of Datasets No. of Parameters 2 2 No. of Attributes 9 10 Value of AIC 21.60 24.62 Sample Size 788 233 Dataset Diabetes Breastcancer

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

109

3 3 23

5 180 9

26.77 922.48 1054.75

150 2000 400

Iris DNA Cars

Table 8 shows that the value of AIC of the datasets, Diabetes, Breastcancer and Iris is small as compared to the datasets, DNA and Cars. We also notice in this table that although the number of attributes of dataset Cars is small but the value of AIC is high, which is due to the high number of parameters. Similarly, in case of dataset DNA, the number of parameters are small but again the value of AIC is high, which is due to the high number of attributes. The conclusion is that if either the number of attributes or the number of parameters are large, the value of AIC will be high, which shows that the dataset is not suitable for the extraction of knowledge and requires the cleansing. The computational and storage complexities of selected data mining algorithms are shown in table 9. Table 9 The Complexities of Data Mining Algorithms Data Mining Algorithm K-means C4.5 Data Visualization (2D Graph) Computational Complexity O(nkl) O(n.m) O(d.n) Storage Complexity O(n+k) O(n) O(n)

Table 9 is about the computational and storage complexities of K-means, C4.5 and Data Visualization, data mining algorithms. Where n is the sample size, m is the number of attributes, k is the number of clusters, l is the number of iterations and d is the dimension (in our case it is 2). We take the log of computational complexities of these algorithms; the value gradually approaches to zero, with the decrease of number of parameters. We take an example which explains the use of log; an input dataset containing 10 items takes one second to complete, a dataset containing 100 items takes two seconds, and a dataset containing 1000 items will take three seconds. This makes the use of log extremely efficient when dealing with large datasets. There are some other utilities of taking the log of a value, which are: the log is taken if the transferred data is closer to satisfy the assumptions of the statistical model, to analyze the exponential processes, because the log function is the inverse of the exponential function, to measure the pH or acidity of a chemical solution, to measure the intensity of earth quake on Richter scale, to model many natural processes with the statistical model and to model the value of computational complexity of a data mining algorithm with the value of model selection criterion AIC. This will help to select the right algorithm for the given dataset. Table 10 The value of AIC of Dataset Iris and Complexities of Algorithms Iris/Data Mining Algorithms K-means C4.5 Data Visualization Log of Complexities of Algorithms 19.39 16.78 15.46 Value of AIC 26.77 26.77 26.77

Table 10 shows the relationship between the computational complexity of data mining algorithms and the value of model selection criterion AIC for the dataset Iris. The value of AIC of the dataset is 26.77 and the value of the log of computational complexity of K-means is 19.39, which is closer to the value of AIC. The value of log of computational complexity of C4.5 is 16.78, which is closer to the value of AIC and similarly, the log of computational complexity of Data Visualization is 15.46, which is again closer to the value of AIC. The result of this table shows that the values of log of computational complexity of all the data mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset Iris. It is clear from the table that K-means algorithm is first choice for the dataset Iris, then C4.5 and finally data visualization. This is further illustrated in figure 4.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

110

Dataset 'Iris' 30.00 25.00 20.00 15.00 10.00 5.00 0.00 K-means C4.5 Data Visualization Data Mining Algorithm s

Value of AIC & Complexities of Algos.

Complexities AIC

Figure 4. The Graph between the value of AIC & Complexities of DM Algorithms The graph in figure 4 is a comparison between the computational complexity of data mining algorithms and the value of model selection AIC of the dataset Iris. The graph shows that the values of computational complexity of data mining algorithms are close to the value of AIC; therefore, these algorithms are the best choice for the dataset Iris. The difference between the values is not perfect but still there is not a huge difference. Table 11 The value of AIC of Dataset Breastcancer and Complexities of Algorithms Breastcancer/Data Mining Algorithms K-means C4.5 Data Visualization Log of Complexities of Algorithms 20.06 18.41 16.73 Value of AIC 24.62 24.62 24.62

Table 11 is a relationship between the computational complexity of data mining algorithm and the value of model selection criterion AIC for dataset Breastcancer. The value of AIC of the dataset is 24.62 and the value of the log of computational complexity of K-means is 20.06, which is closer to the value of AIC. The value of log of computational complexity of C4.5 is 18.41, which is closer to the value of AIC and similarly, the log of computational complexity of Data Visualization is 16.73, which is again closer to the value of AIC. The result of this table shows that the values of log of computational complexity of all the data mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset Breastcancer. It is clear from the table that K-means algorithm is first choice for the dataset Breastcancer, then C4.5 and finally data visualization. This is illustrated in figure 5.
Dataset 'Breastcancer' Value of AIC & Complexities of Algos. 30.00 25.00 20.00 15.00 10.00 5.00 0.00 K-means C4.5 Data Visualization Data Mining Algorithm s Complexities AIC

Figure 5. The Graph between the value of AIC & Complexities of DM Algorithms The graph in figure 5 is a comparison between the computational complexity of data mining algorithms and the value of model selection AIC of the dataset Breastcancer. The graph shows that the values of computational complexity of data mining algorithms are close to the value of AIC; therefore, these algorithms are the best choice for the dataset Breastcancer. The difference between the values is not perfect but still the difference is not huge. Table 12 The value of AIC of Dataset Diabetes and Complexities of Algorithms

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

111

Diabetes/Data Mining Algorithms K-means C4.5 Data Visualization

Log of Complexities of Algorithms 23.57 22.41 20.24

Value of AIC 21.60 21.60 21.60

Table 12 is a relationship between the computational complexity of data mining algorithm and the value of model selection criterion AIC for dataset Diabetes. The value of AIC of the dataset is 21.60 and the log value of complexities of K-means is 23.94, which is greater than the value of AIC but still it is closer to the value of AIC. The log value of complexities of C4.5 is 22.41, which greater than the value of AIC but the difference is not large and similarly, the log value of complexities of Data Visualization is 20.24, which is almost equal to the value of AIC. The result of this table shows that the log value of complexities of all the data mining algorithms is close to the value of AIC, so these algorithms are the right choice for the dataset Diabetes. Figure 6 illustrate this comparison.
Dataset 'Diabetes' 24.00 23.00 22.00 21.00 20.00 19.00 18.00 K-means C4.5 Data Visualization Data Mining Algorithm s

Value of AIC & Complexities of Algos.

Complexities AIC

Figure 6. The Graph between the value of AIC & Complexities of DM Algorithms The graph in figure 6 is a comparison between the computational complexity of data mining algorithms and the value of model selection AIC of the dataset Diabetes. The graph shows that the values of two data mining algorithms K-means and C4.5 are greater than the value of AIC and the value of Data Visualization is less than the value of AIC but there is no such difference between both values, they are very close to each other; therefore, these algorithms are the best choice for the dataset Diabetes. In this case the difference between these values is small as compare to figures 4 and 5 but the value of AIC of the given dataset is high for K-means and C4.5 data mining algorithms. Table 13 The value of AIC of Dataset DNA and Complexities of Algorithms DNA/Data Mining Algorithms K-means C4.5 Data Visualization Log of Complexities of Algorithms 26.84 29.42 22.93 Value of AIC 922.48 922.48 922.48

Table 13 is a relationship between the computational complexity of data mining algorithm and the value of model selection criterion AIC for dataset DNA. The value of AIC of the dataset is 922.48 and the log value of complexities of K-means is 26.84, there is a huge difference between both values. The log value of complexities of C4.5 is 29.42, there is a big difference between both values and similarly, the log value of complexities of Data Visualization is 22.93, again the difference between both values is very high. The result of this table shows that there is no comparison between the log value of complexities of all the data mining algorithms and the value of AIC, so these algorithms are not suitable for the dataset DNA. In other words, the selected data mining algorithms are not the right choice for this dataset. In order to make the dataset DNA usable, reduce the number of attributes of the dataset. This is further illustrated in figure 7.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

112

Dataset 'DNA' 1000.00 800.00 600.00 400.00 200.00 0.00 K-means C4.5 Data Visualization Data Mining Algorithm s Complexities AIC

Figure 7. The Graph between the value of AIC & Complexities of DM Algorithms The graph in figure 7 is a comparison between the computational complexity of data mining algorithms and the value of model selection AIC of the dataset DNA. The graph shows that there is a huge difference between the values of computational complexity of data mining algorithms and the value of AIC; therefore, these algorithms are not suitable for the dataset DNA. We can say that in this case there is no comparison between these values due to the huge difference. Table 14 The value of AIC of Dataset Cars and Complexities of Algorithms Cars/Data Mining Algorithms K-means C4.5 Data Visualization Log of Complexities of Algorithms 25.21 20.46 18.29 Value of AIC 1054.75 1054.75 1054.75

Table 14 is a relationship between the computational complexity of data mining algorithm and the value of model selection criterion AIC for dataset Cars. The value of AIC of the dataset is 1054.75 and the log value of complexities of K-means is 25.21, which is far away from the value of AIC. The log value of complexities of C4.5 is 11.81, there is a huge difference between both values and similarly, the log value complexities of Data Visualization is 18.29, again there is enormous gap between both values. The result of this table shows that there is no comparison between the log value of complexities of all the data mining algorithms and the value of AIC, so these algorithms are not suitable for the dataset Cars. In other words, the selected data mining algorithms are not the right choice for this dataset. In order to make the dataset Cars useable, reduce the number of parameters of the dataset. Figure 8 illustrate this comparison.
Dataset 'Cars' Values of AIC & CC 800.00 600.00 400.00 200.00 0.00 K-means C4.5 Data Visualization Computational Complexity AIC

Value of AIC & Complexities of Algos.

Data Mining Algorithm s

Figure 8. The Graph between the value of AIC & Complexities of DM Algorithms The graph in figure 8 is a comparison between the computational complexity of data mining algorithms and the value of model selection AIC of the dataset Cars. The graph shows that there is gigantic difference between the log value of complexities of data mining algorithms and the value of AIC; therefore, these algorithms are not suitable for the dataset Cars. We can say that in this case there is no comparison between these values due to the huge difference.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

113

5. Conclusion
This paper discusses the non-trivial and important role of parameters and observed data in a given dataset. If a dataset has a few number of parameters or small number of observed data then it is difficult to produce knowledge from that dataset, it is called under-fitted and if a dataset has large number of parameters or larger number of observed data then again it is a difficult to handle these parameters and to produce the knowledge, it is called over-fitted. The over-fitted and under-fitted are the errors in the dataset which show that the dataset is not properly cleansed. The conclusion is that the middle range i.e. number of parameters neither too small nor too large, is required to find knowledge from a dataset. In order to find that the dataset is either over-fitted or under-fitted, one has to use the model selection criterion. In this research paper we use two model selection criterions, AIC and BIC. These model selection criterions are tested over the sample size of small, medium and large and a comparison is also draw. The AIC performs better than BIC in all types of the sample sizes. As the sample size increases, from small to large, the difference between the value of AIC and BIC increases. Therefore, we opt for AIC as a selection criterion for our proposed model. The model selection is not a hypothesis testing, it does not draw conclusion whether a model is wrong; it explores and ranks various alternative models. In this paper we try to map the value of model selection criterion AIC of a dataset with the computational complexities of data mining algorithms Kmeans, C4.5 and Data Visualization, which helps to select the appropriate data mining algorithm(s) for a particular dataset. We test the value of AIC for algorithm selection over five datasets namely, Iris, Breastcancer, Diabetes, DNA and Cars. The number of parameters, sample size and number of observed data is different for each dataset. The bar graphs are plotted to compare the value of AIC of these datasets and the computational complexity of data mining algorithms. The conclusion is that the datasets, Iris, Breastcancer and Diabetes are suitable to extract the knowledge and there is no problem of over and under-fitting in these datasets. But on the other hand, the datasets, DNA and Cars are not suitable for knowledge extraction due to over-fitting problem. These datasets again require some cleansing, i.e. reduce the number of parameters from the dataset Cars and reduce the number of attributes from the dataset DNA. We also recommend a threshold value for the selection of data mining algorithm if the percentage difference between the value of AIC & complexities is 40 then the algorithm(s) is suitable for that dataset otherwise the dataset requires cleansing i.e. reduce the number of parameters or reduce the number of attributes. We conclude our paper, in this paper we use the model selection criterion AIC to avoid over and under-fitting problems of a dataset and also map this value with the complexities of data mining algorithms, which helps to select the right algorithm for the right dataset. The results are encouraging and satisfactory. Furthermore, the number of parameters of a dataset can also be used as the number of clusters (the value of k) for K-means clustering data mining algorithm.

Acknowledgement
The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial assistance to carry out this research activity under HEC project 6467/F II.

Reference
[1] URL: http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture08.pdf, 2011 [2] Aldrich, John., R. A. Fisher and the making of maximum likelihood 19121922, Statistical Science 12 (3): 162176. doi:10.1214/ss/1030037906. MR1617519, 1997. [3] Anderson, Erling B., Asymptotic Properties of Conditional Maximum Likelihood Estimators, Journal of the Royal Statistical Society B 32, 283301, 1970. [4] Andersen, Erling B., Discrete Statistical Models with Social Science Applications, North Holland, 1980. [5] Debabrata Basu., Statistical Information and Likelihood : A Collection of Critical Essays, by Dr. D. Basu ; J.K. Ghosh, editor. Lecture Notes in Statistics Volume 45, Springer-Verlag, 1988. [6] Le Cam, Lucien., Maximum likelihood an introduction. ISI Review 58 (2): 153171, 1990. [7] Burnham. Kenneth P., Anderson. David R., Model Selection and Multi-model Inference: a Practical Information-theoretic Approach, 2nd edition, Springer, ISBN: 0-387-95364-7, 2002. [8] Brockwell, P.J., and Davis, R.A., Time Series: Theory and Methods, 2nd ed. Springer, 2009. [9] Akaike, Hirotugu, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6): 716723. doi:10.1109/TAC.1974.1100705. MR0423716, 1974.

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

114

[10] Weakliem. David L. , A Critique of the Bayesian Information Criterion for Model Selection, University of Connecticut Sociological Methods Research, vol. 27 no. 3 359-397, 1999 [11] Cavanaugh. Joseph E., Statistics and Actuarial Science, The University of Iowa, URL: http://myweb.uiowa.edu/cavaaugh/ms_lec_6_ho.pdf, 2009. [12]Liddle, A.R., Information criteria for astrophysical model selection, http://xxx.adelaide.edu.au/PS_cache/astro-ph/pdf/0701/0701113v2.pdf [13] Ernest S. et al., HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN PROC LOGISTIC AND PROC GENMOD, Harvard Medical School, Harvard Pilgrim Health Care, Boston, MA, 2010. [14] In Jae Myung, Tutorial on maximum likelihood estimation, Department of Psychology, Ohio State University, USA, Journal of Mathematical Psychology 47 (2003) pp 90100. [15] Isabelle Guyon., A practical guide to model selection, ClopiNet, Berkeley, CA 94708, USA, 2010. [16] Vladimir Cherkassky, COMPARISON of MODEL SELECTION METHODS for REGRESSION, Dept. Electrical & Computer Eng. University of Minnesota, 2010. [17] Schwarz G., Estimating the dimension of a model. Ann Stat 6:461-464, 1978. [18] Burnham KP, Anderson DR., Model Selection and Inference, (Springer), 1998. [19] Parzen E, Tanabe K, Kitagawa G., Selected Papers of Hirotugu Akaike, (Springer), 1998. [20] Li W., DNA segmentation as a model selection process. Proc. RECOMB'01, in press, 2001. [21] Li W., New criteria for segmenting DNA sequences, submitted, 2001. [22] Li W, Sherriff A, Liu X., Assessing risk factors of human complex diseases by Akaike and Bayesian information criteria (abstract). Am J Hum Genet 67(Suppl):S222, 2000. [23] Li W, Yang Y., How many genes are needed for a discriminate microarray data analysis. Proc. CAMDA'00, in press, 2001. [24]Li W, Yang Y, Edington J, Haghighi F., Determining the number of genes needed for cancer classification using microarray data, submitted, 2001. [25] Wentian Li, Dale R Nyholt, Marker Selection by AIC and BIC, Laboratory of Statistical Genetics, The Rockefeller University, New York, NY, 2010.

Вам также может понравиться