Вы находитесь на странице: 1из 12
FACULTY OF AUTOMATION AND COMPUTER SCIENCE COMPUTER SCIENCE DEPARTMENT ÉVA KOVÁCS Contributions to the Development
FACULTY OF AUTOMATION AND COMPUTER SCIENCE COMPUTER SCIENCE DEPARTMENT ÉVA KOVÁCS Contributions to the Development

FACULTY OF AUTOMATION AND COMPUTER SCIENCE COMPUTER SCIENCE DEPARTMENT

ÉVA KOVÁCS

Contributions to the Development of Methods in Data Mining in Large Databases

PHD THESIS Abstract

Scientific advisor Prof. Dr. Ing. IOSIF IGNAT

Cluj-Napoca, 2007

Chapter 1.

Introduction

Contents

8

1.1. Definitions

9

1.2. Description of the process of data mining

11

1.3. Preprocessing and transformation of data for data mining

14

1.4. Operations in data mining

18

1.5. Clustering methods of the database

20

1.6. Predictive modeling methods

24

1.7. Link analysis methods…………

26

1.8. Structure of the thesis

33

Chapter 2.

Current State of Data Mining

35

2.1. Clustering methods

36

2.2. Classifications methods

39

2.3. Feature selection methods

46

2.4. Conclusions

47

Chapter 3.

Development of Clustering Methods with Clustering with Prototype Entity

Selection

50

3.1. Clustering with Prototype Entity Selection

51

3.2. Clustering with Prototype Entity Selection Algorithm

54

3.3. Optimized CPES using variable radius

59

3.4. CPES Algorithm with variable radius

60

3.5. K-means optimized by CPES

61

3.6. Experimental study…

64

3.7. Results of the experimental study for CPES

67

3.8. Results of the experimental study for CPES with variable radius

74

3.9. Results of the experimental study for K-means with CPES

78

3.10. Analysis of the results

83

3.11. Conclusions

88

Chapter 4.

Development of Preprocessing Methods by Reduct Equivalent Feature

Selection

90

4.1. Rough Set Theory

91

4.2. Reduct Equivalent Feature Selection

96

4.3. Reduct

Equivalent Feature Selection Algorithm

99

4.4. Experimental Study

107

4.5. Conclusions

111

Chapter 5.

Development of Classification Methods by Reduct Equivalent Rule

Induction

113

5.1. Classification from the perspective of Rough Set Theory

115

5.2. Reduct

Equivalent

Rule

Induction

117

5.3. Reduct Equivalent Rule Induction Algorithm

119

5.4. Experimental Study

 

124

5.5. Conclusions

 

129

Chapter 6.

Final Conclusions

6.1. Future research trends

131

137

Bibliography

138

Annexes

151

Chapter 1. Introduction

The development of hardware during the last two decades made possible the saving of large amounts of data on computers. The exact volume of these data is impossible to state, all estimations are mere suppositions. Researchers from the Berkley University have calculated that approximately one Exabyte (1 million Terabytes) of data are generated and saved each year. Real-time data mining of databases is one of the main research areas in databases. The volume of databases and their constant growth is an important problem in data mining. Data mining is a new

discipline in development which uses the resources and ideas of several fields. The abstract role of data mining is to discover new and useful information in databases. Data mining techniques are to develop models, structures, regularities, etc, in large databases. The models discovered in databases can be characterized according to: accuracy, precision, interpretability and expressivity. Knowledge can be defined in several ways, for example:

General term used to describe an object, idea, condition, situation or another fact that can

be a number, a letter or a symbol. It can be a chart, an image and/or alphanumeric characters. It suggests elements of information that can be processed or produced by a computer. Facts, known things, one can draw conclusions from.

There are several definitions for data mining; we mention only few of them:

“Data mining consists of the extraction of predictive information hidden in a large

database.” “Data mining is the process through which advantageous models are discovered in

databases” “Data mining is a rapidly growing field that combines the methods of databases, statistics,

supervised learning and other related fields in order to extract useful information from existing data.” “Data mining is the non-trivial process which identifies new, valid, useful and

interpretable models in existing data.” Algorithms of data mining can be categorized keeping in mind the representation of models, input data and the field in which the algorithm is used. The model can be represented by decision trees, regression functions, associations or others. The most common classification of data mining algorithms divides them in four operations: predictive modeling, database clustering, link analysis and deviation detection [6][7][8][9]. Data mining is a repetitive process, which contains several steps, beginning with the understanding and definition of the problem and ending with the analysis of the results and the application of a strategy for the use of the results [14]. A data mining process is illustrated in Fig. 1.

for the use of the results [14]. A data mining process is illustrated in Fig. 1

Fig. 1. The data mining process

- 1 -

Chapter 2. Current State of Data Mining

The second chapter describes the present state of research on data mining, and the most recent research in the field. The algorithms of the main methods in data mining are then presented, namely that of database clustering, predictive modeling and feature selection. From the analyzed methods used in database clustering, from the category of partitioning methods K-means, K-modes, K-medoid, CLARA (Clustering Large Applications) and CLARANS (Clustering Large Application based upon RANdomized Search) are mentioned. K-means algorithm is one of the best known clustering algorithms used in database clustering. The necessity that the user should specify the number of clusters is a disadvantage. The method is also inadequate for finding clusters of nonconvex forms or of different sizes and is sensitive to noises that influence the mean value of a cluster. K-medoid algorithm is also a clustering algorithm used in database clustering. It works efficiently with a small set of data, while it is unable to manage a large set. CLARA (Clustering LARge Applications) is used instead of K-medoid in order to manage a large set of data. From the analyzed methods used in database clustering SLINK (Single LINKage) and BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) belong to the category of hierarchical methods, and DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and OPTICS (Ordering Points To Identify the Clustering Structure) to the category of density-based methods. From the analyzed methods within predictive modeling from the category of algorithms that are based on decision trees we mention ID3, C4.5, CHAID (Chi-square automatic interaction detection), CART (Classification And Regression Trees) and QUEST (Quick, Unbiased and Efficient Statistical Tree). The most popular classification algorithm used in predictive modeling on neural networks is backward propagation algorithm. It learns on a multilayer feed-forward network. Neural networks are more efficient than decision trees due to the adjustment. A deficiency of neural networks is that they accept only numeric input, therefore categorical data must be recoded. Bayesian classifiers are statistical classifiers and belong to predictive modeling. Bayesian classifiers demonstrated high precision and speed, and are used for large databases. In practice several disagreements appear, for example, due to the supposition that attributes are independent from one another. The majority of feature selection methods belong to supervised learning. There are two types of feature selection algorithms: filter and wrapper type algorithms. RELIEF algorithm and its developments are representative for filter type. There are feature selection methods that use successfully Rough Set Theory.

Chapter 3. Development of Clustering Methods with Clustering with Prototype Entity Selection

This chapter presents three original clustering methods: Clustering with Prototype Entity Selection (or CPES) [16][19], Clustering with Prototype Entity Selection with variable radius [17] and K-means with Clustering with Prototype Entity Selection [18]. First the methods, then the experimental studies are presented, followed by the comparison of the results with other clustering methods from the specialized literature. CPES method will be used in data mining both as an independent clustering process and a combination of this method with K-means, a clustering algorithm often used in data mining.

- 2 -

Clustering with Prototype Entity Selection Algorithm

We propose the use of CPES method as a clustering method for data mining. When using this method the user does not need to specify the number of clusters, the algorithm will obtain this number. The method also ensures optimal clustering. This method will be efficient within the frame of a data mining process, because it is very important for a user that the used method should need only few input data. In short, the algorithm is presented as follows:

d av

1. Initialization of constants

2. Generation of clusters

3. repeat

, A, r and fitness function f

i

, for i=1,

,n

cluster(x ) = x

i

3.1.

3.2. if

For each

cluster x

(

f cluster x

(

(

i

))

i ) a pair is selected,

< f

(

x

j

) then

x j

set

cluster(x ) = x

i

j

until there are no more changes in the clusters

Then the steps of the algorithm are described:

1.

d av

Constants

each object of the data set, using:

, A, r are initialized and the values of fitness function f are calculated for

f

(

x

i

) =

d

av

=

n

k = 1

1

d

(

n n

∑∑

1

1

i =

j

=

x

i

d

,

(

x

x

i

k

,

) +

x

j

)

A

(

n n

1)

(1)

(2)

A =

d

av

n

(3)

r =

d

av

2

(4)

2.

3.

3.1.

After initialization n clusters are defined, each object will have its own cluster, namely

is the cluster belongs to. In this step

cluster(x ) = x

i

i

for each i=1,

,n.

cluster x

(

i

)

x

i

the value of cluster is just and there are n clusters with different values.

i

x

i

cluster x

(

)

The cycle is repeated until stop condition is carried out, i.e. until there are no more

changes in the values of

For each

cluster x

(

i

) for each i=1,

,n

from one step to the other.

cluster x

(

j

)

cluster x

(

so that

i

)

i

j

a pair is selected. This pair is chosen from the objects of

is in the radius r of

cluster x

(

j

)

and the condition that

object

cluster x

(

i

)

, i.e.

cluster(x )

j

V (cluster(x ), r)

i

(5)

- 3 -

For the selection, proportional selection is used and the chosen pair is marked with

3.2.

cluster x

(

p

The

values

cluster x

(

p

) .

)

of

and

, and the object that has the highest value of fitness function f is chosen. If

function

f

are

compared

for

the

objects

of

cluster x

(

i

)

the condition

are set with the new value

cluster

f cluster x

(

(

i

))

< f cluster x

(

(

p

))

is true, then the values of cluster

x

i

cluster x

(

p

)

. If the condition is not true, then the value of

x i remains unchanged.

In each step of the algorithm, the number of the different clusters will diminish. In the end there will

be only m distinct values in

be considered prototype entities for their cluster, and objects for which belong to

These distinct values can

cluster x

(

i

)

. These are marked with

x

i

c

j

, j=1,

,m.

cluster(x ) = c

i

j

cluster . The number of clusters thus obtained will be equal to m and it is determined for each object the

c

j

cluster it belongs to. By CPES method a clean clustering is obtained with an optimal number of clusters for input data. If the analyst is not satisfied with the accuracy of the clusters, another algorithm of clustering can be performed starting with the results already obtained by the use of the CPES method.

Optimized CPES Using Variable Radius

The improvement of the original clustering method, Clustering with Prototype Entity Selection, is presented. The radius with variable value is used to optimize the CPES method. The CPES method starts with an initialization phase in which radius r is calculated. During the

CPES algorithm, radius r remains constant and the partner of object will be chosen form the vicinity of

, there will be no

massive migration among clusters. Still there is a migration among clusters, therefore the CPES algorithm clusters badly some of the objects.

In order to optimize CPES algorithm instead of radius r, we use a radius with variable value. Thus at

. After each step of the

the beginning radius r is set with the average distance divided by 4,

x

i

V(x ,r)

i

of

x

i

. As the pair of object

x

i

is chosen from the vicinity of

V(x ,r)

i

=

d av

4

r = r

1

algorithm that is performed, the value of the radius increases, so the migration among clusters will be smaller than in the case of a radius with fixed value.

A diagram is used to increase the value of the radius, defined by the following equation:

r

k

+1

=

r

k

α

where α is a positive constant smaller than 1.

K-means Optimized by CPES

(6)

The improvement of the K-means algorithm is shown by using Clustering with Prototype Entity Selection, (CPES); thus the user does not need to introduce the number of optimal clusters. This method was developed to be used in data mining. It presents the advantages of the CPES algorithm in boosting the performances of K-means method and the experimental results obtained as compared to K-means method.

- 4 -

By using CPES method we find out both the optimal number of clusters in the data set proposed for data

mining, m, and the prototype entities, , j=1,

c

j

,m.

These data will be used as input data for K-means.

K-means method will not start with the arbitrary selection of a representative object for each cluster;

These prototype

entities, represent the mean, the centre of the clusters. As they are local maximums, there will be no

instead it will use the prototype entities obtained due to CPES algorithm,

c

j

,

j=1,

,m.

c

j

massive migration among the clusters formed during the first step. K-means algorithm will need a short running time. Because the number of clusters is generated by the CPES method, and not specified by the user, an optimal number of clusters will be generated. In what follows K-means method will be carried out as usual, namely each object is attributed to that cluster to which it resembles the most. The mean of each cluster is recalculated, then each object is reattributed to the clusters, having in mind the new mean of the clusters. Consequently K-means algorithm will be improved by CPES, due to the followings:

The user does not need to state the number of clusters, as they will be generated by CPES.

The algorithm does not need to choose accidentally the first representative objects, they

will be prototype entities generated by CPES. The described optimization was presented within the framework of K-means method. Nevertheless this optimization can be also applied to other clustering methods which need as input data the number of clusters, for example for K-medoid.

Chapter 4. Development of preprocessing methods by Reduct Equivalent Feature Selection

A major problem in data mining is the large dimension of data. Therefore it is very important that during the preprocessing of data for data mining, irrelevant attributes to be eliminated without modifying the consistency of the original set of data. Feature selection is the process by which an optimal subset of attributes that satisfy certain criteria is searched for. This chapter presents Reduct Equivalent Feature Selection [20], (REFS) a new heuristic method of feature selection based on Rough Set Theory. It describes mathematically the essence of Reduct Equivalent Feature Selection method, and presents the algorithm and the obtained experimental results.

Reduct Equivalent Feature Selection

Rough Set Theory is successfully used in data mining, or more precisely in prediction, in discovery of associations and in reduction of attributes. Having as motto “Let data speak for themselves”, this approach is not invasive for the processed data set as it uses only the information of the data set without presuming the existence of different models in it [28][30]. The reduction of attributes using Rough Set Theory means the finding of a reduct set from the original set of attributes. Data mining attributes will not run on the original set of attributes but on this reduct set, which will be equivalent with the original set. Two original theorems were devised: Theorem 1 and Theorem 2. They were developed to find a reduct quickly. Theorem 2 was used to find a set of reduct for a data set. The algorithm that will use this theorem is called Reduct Equivalent Feature Selection. It is a new heuristic method of feature selection that will be presented in the following.

- 5 -

Reduct Equivalent Feature Selection Algorithm

Reduct Equivalent Feature Selection is a new heuristic method of feature selection based on Rough Set Theory. A forward selection will be used that begins with an empty set of attributes to which in each step of the algorithm a new attribute is added until a set of reduct attributes is obtained. Reduct Equivalent Feature Selection generates in each step a new set of attributes and evaluates it, i.e. verifies if the generated set is or is not a reduct set. If the obtained set is not a reduct set, the generation is resumed and another element is added to the proposed set of attributes. This new set of attributes is evaluated again. If the set of attributes proposed is a reduct set, then it is validated and can be used in the future too as a set equivalent with the original set of attributes.

Input data of the algorithm will be data set , the attributes of condition and those

0

0

U

= U

C = R

of decision D. The method results in a reduct set, R of the attributes of condition C taking into consideration the attributes of decision D.

1.

2. repeat

set

R = ∅ , i = 0, U

2.1. repeat

0

= U ,

R

0

= C

if

CORE R

(

i

) = ∅

then set

else set

R

i

= R

i

{a},a R

R

= R CORE R

(

i

)

i

until

CORE R

(

i

) ≠ ∅

2.2. set i = i + 1

2.3.

2.4. set

set

R

i

=

R

i

U

′ =

i

π

1

R

i

(

CORE R

(

U

i 1

)

i

1

)

2.5. set

2.6.

set

U

U

i

inc

i

=

σ

f (x,R )

i

inc

=U′ −U

i

i

=

f ( y,R )

i

f (x,D)

f ( y,D)

until POS (D) POS (D)

R

=

C

(

U

i

)

In the first step of the algorithm reduct set, R is initialized with an empty set of attributes and variable i with number 0.

, i.e. until the set R of attributes will be

R i is an empty

a reduct. In this step the core set of the set of attributes

set, it means that any other attribute a can be removed from without modifying the relation of

The second step is repeated until

POS (D)

R

=

POS (D)

C

R i is calculated. If the core of set

R

i

indiscernibility. There is no attribute a in that would be indispensable. ,

for any

the core set is no more an empty set. This core set is added to set R. Then number i is increased with 1 and

the new

i is searched until

R

i

POS

C

(

D

)

=

R

POS

{

Ca

}

D

(

)

a R

R

i

i

and

. Thus an attribute is removed from

R

i

and the core set for the new

U i is calculated for the next step of the cycle.

- 6 -

Chapter 5. Development of Classification Methods by Reduct Equivalent Rule Induction

Rough Set Theory is successfully used in data mining for prediction, classification [1][3][22][23][29], discovery of associations [11] and reduction of attributes [27][32]. This chapter presents original contributions to the classifications of the objects of a database using the elements of Rough Set Theory. A new classification method based on Rough Set Theory is presented: Reduct Equivalent Rule Induction (or RERI). This chapter describes the mathematical essence of Reduct Equivalent Rule Induction method, as well as the algorithm and the obtained experimental results. This chapter presents the way in which Rough Set Theory was used to develop a new classification method. Defined reduct sets in Rough Set Theory produce minimal rules of classification among the attributes of the analyzed data set. Such a rule of classification determined by a reduct contains a minimal number of attributes in its condition part, so that the set of rules of classification defines the classes of decision.

Reduct Equivalent Rule Induction

We look for classification rules for a data set U. If all the attributes of the set are taken into consideration in order to generate rules of classification, too many rules of classification will appear. They are also going to be too detailed and incapable of predicting to which class a new entry belongs to. If too few attributes of the data set are taken into consideration in order to characterize the whole data set through rules of classification, the rules thus obtained will be too general. They will be also unable to predict correctly to which class a new entry belongs to. We propose the use of a reduct set to generate rules of classification. These rules can be then used to classify a new record. As a reduct is a minimal set of attributes that maintains the positive region of classes of decision U/IND(D) taking into consideration the attributes of condition in C, the quality of the rules of classification obtained remains uncompromised. By reducing the number of the used attributes in rules of condition a precise classification is ensured, because only the redundant or irrelevant attributes of classification are eliminated. It describes a new, original and efficiently used method to classify the data of a data set U, namely Reduct Equivalent Rule Induction. This method proposes the use of a reduct to generate rules of classification. These rules of classification will be used to classify new entries. The method proposed uses the Reduct Equivalent Feature Selection algorithm described in chapter 4 to generate the reduct set. The obtained reduct set will maintain the positive region of the classes of decision U/IND(D) taking into consideration the attributes of condition in C. From the reduct set Reduct Equivalent Rule Induction method generates the minimal rules of classification that will be used to classify a new entry.

Rule Induction method generates the minimal rules of classification that will be used to classify a

- 7 -

Chapter 6. Final Conclusions

The general objective of this thesis was the development of data mining methods in large databases. Data mining techniques are used to discover models, structures, regularities, etc, in very large databases. In order to accomplish the objectives of this theses new and original methods have been developed that are useful for data mining techniques. Chapter 2 describes the present state of research in data mining. This chapter presents the main data mining methods, as well as the most recent research in this field. It also presents the algorithms of main methods in data mining, namely that of database clustering, predictive modeling and link analysis. Chapter 3 presents original contributions to the development of data mining methods: Three new clustering methods are present: Clustering with Prototype Entity Selection (CPES), Clustering with Prototype Entity Selection with variable radius and K-means with Clustering with Prototype Entity Selection. These methods were developed to be used in data clustering. The main advantage of the methods proposed in chapter 3 is that they do not need the number of clusters as input data. The algorithms find by themselves the optimal number of clusters, then cluster all the objects in these clusters. The methods proposed in this chapter can be used easily by the analyst and have better or as good results as those methods that need complicated settings of the used parameters. Chapter 3 contains the following personal contributions:

1. I proposed the use of certain elements of genetic algorithms, such as fitness function and proportional selection to create a new clustering algorithm for data mining.

2. I proposed a new algorithm Clustering with Prototype Entity Selection to be used in data clustering, an algorithm that does not need as input data the number of clusters.

3. I proposed the use of a variable radius to optimize Clustering with Prototype Entity Selection, formulating the new algorithm as follows: Clustering with Prototype Entity Selection with variable radius.

4. I optimized K-means method by using Clustering with Prototype Entity Selection algorithm, formulating thus the new algorithm, K-means with Clustering with Prototype Entity Selection. In order to optimize K-means method I used the number of clusters and prototype entities generated by CPES.

5. The presented optimization in the context of K-means method can be applied to other methods of clustering that need as input data the number of clusters, for example in K-medoid method.

6. I implemented the three proposed methods: Clustering with Prototype Entity Selection, Clustering with Prototype Entity Selection with variable radius and K-means with Clustering with Prototype Entity Selection.

7. I compared the results of the proposed methods in chapter 3 with other clustering algorithms

used in data mining, respectively with K-means and SLINK algorithms. Chapter 4 presents original contributions to dimensionality reduction in preprocessing of data for

data mining. It describes Reduct Equivalent Feature Selection, a new, original, heuristic method of feature selection based on Rough Set Theory. This method eliminates irrelevant attributes without modifying the consistency of the original data set. Chapter 4 contains the following personal contributions:

1. I devised a new theorem that offers an original calculation method of a reduct set, Theorem 1.

2. Generalizing Theorem 1, I devised Theorem 2, this is also a new method for calculating a reduct set.

3. Using Theorem 2, I proposed a new algorithm: Reduct Equivalent Feature Selection to be used in feature selection.

4. I implemented the proposed method, Reduct Equivalent Feature Selection.

5. I tested the proposed method in Chapter 4 with 10 sets of different data. These data sets were created and used by several researchers and are reference examples for data mining and belong to the UCI collection.

- 8 -

Chapter 5 presents original contributions to the classification of the objects of a database using

elements of Rough Set Theory. A new classification method is presented, Reduct Equivalent Rule Induction (or RERI) based on Rough Set Theory. Chapter 5 contains the following personal contributions:

1. Using Theorem 2 described in Chapter 4, I proposed a new algorithm of classification, Reduct Equivalent Rule Induction.

2. I implemented the proposed method, Reduct Equivalent Rule Induction.

3. I tested the proposed method in Chapter 5 with 10 sets of different data. These data sets were created and used by several researchers and are reference examples for classification and data mining and belong to the UCI collection.

4. I compared the accuracy of the new algorithm RERI with other classification methods from

specialized literature, CHAID and QUEST. Data mining is a field that can be developed in many directions. Data mining process should be as automatic as possible, without the excessive intervention of the analyst. In order to achieve the highest possible automatism of the system, it must be capable of preprocessing correctly and optimally the data, so that data mining algorithms can extract useful information from it. Due to the study accomplished, based both on specialized literature and the results obtained in this thesis, I propose some new research directions:

Optimization of clustering methods that need as input data the number of clusters.

Use of Rough Set Theory in data mining within the framework of discovering of

associations. Use of Rough Set Theory in reducing the dimension of the data set proposed for data mining.

Selected references

[1] Abidi, S. S. R., Hoe, K. M., Goh, A., Analyzing Data Clusters: A Rough Set Approach to Extract Cluster Defining Symbolic Rules, Lecture Notes in Computer Science 2189: Advances in Intelligent Data Analysis, Fourth International Conference of Intelligent Data Analysis, Cascais Portugal. Springer Verlag, pages 248- 257, 2001.

[2]

Blake,

C.

L.,

Merz,

C.

J.,

UCI

Repository

of

machine

learning

databases.

Available

at:

Irvine,

CA:

University

of

California,

Department

of

Information and Computer Science, 1998.

 

[3]

Deogun, J. S., Raghavan, V. V., Sever, H., Rough Set Based Classification Methods and Extended Decision Tables, Third International Workshop on Rough Sets and Soft Computing, The Society for Computer Simulation, pp. 302-306, 1995.

[4]

Elkan, C., Using the Triangle Inequality to Accelerate k-Means, Proceedings of the Twentieth International Conference on Machine Learning (ICML'03), pages 147-153, 2003.

[5]

Farnstrom, F., Lewis, J., Elkan, C., Scalability for Clustering Algorithms Revisited, In ACM SIGKDD Explorations, Volume 2, Issue 1, pages 51-57, August 2000.

[6]

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., From Data Mining to Knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining, pages 1--43. AAAI Press/ The MIT Press, 1996.

[7]

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R., Advances in Knowledge Discovery and Data Mining, AAAI Press/ The MIT Press, 1996.

[8]

Fayyad, U., Shapiro, G., Smyth, P., From Data Mining to Knowledge Discovery in databases, AI Magazine 17 , pages 37-54, 1996.

[9]

Fayyad, U., Shapiro, G., Smyth, P., The KDD Process for extracting Useful Knowledge from Volumes of Data, COMUNICATION OF ACM , Volume 39. N 11., November 1996.

[10]

Fisher, R. A., The use of multiple measurements in taxonomic problems, Annual Eugenics, 7, Part II, pages 179-188, 1936.

- 9 -

[11]

Guan, J. W., Bell, D. A., Liu, D. Y., The Rough Set Approach to Association Rule Mining, Proceedings of the Third IEEE International Conference on Data Mining, pages 529–532, 2003.

[12]

Hegland,

M.,

Data

Mining

-

Challenges,

Models,

Methods

and

Algorithms,

May

2003,

 

[13]

Holte, R. C., Very Simple Classification Rules Perform Well on Most Commonly Used Datasets, Machine Learning, Vol. 11, pages 63-90, 1993.

[14]

John, G. H., Enhancements to the Data Mining Process, Stanford University, teza de doctorat, 1997.

 

[15]

Kanungo, T., Mount, D. M., Netanyahu, N. S., Piatko, C. D., Silverman, R., Wu, A. Y., An Efficient k-Means Clustering Algorithm: Analysis and Implementation, IEEE Transactions on Pattern Analysis and Machine Intelligence, v.24 n.7, pages 881-892, July 2002.

[16]

Kovács, É., Ignat, I., Clustering With Prototype Entity Selection Compared With K-Means, accepted by The Journal of Control Engineering and Applied Informatics, March 2007.

[17]

Kovács, É., Ignat, I., Clustering with Prototype Entity Selection in Data Mining, Proceedings of the IEEE International Conference on Automation, Quality and Testing, Robotics, Cluj-Napoca, Romania, pages 415- 419, 25-28 May 2006, ISBN 1-4244-0360-X.

[18]

Kovács, É., Ignat, I., Clustering with Prototype Entity Selection with variable radius, Proceedings of the 2nd IEEE International Conference on Intelligent Computer Communication and Processing, Cluj-Napoca, Romania, pages 17-21, 1-2 September 2006, ISBN (10) 973-662 -233-9.

[19]

Kovács, É., Ignat, I., Clustering with Prototype Entity Selection to Improve K-Means, Advances in Intelligent Systems and Technologies, Selected Papers of the 4th European Conference on Intelligent Systems and Technologies, Iasi, Romania, pages 37-47, 20-23 September 2006, ISBN (10) 973-730-246-X .

[20]

Kovács, É., Ignat, I., Reduct Equivalent Feature Selection Based On Rough Set Theory, Proceedings of the microCAD International Scientific Conference, Miskolc, Ungaria, pages 47-52, 22-23 March 2007, ISBN 978-

963-661-742-4.

 

[21]

Kovács, É., Ignat, I., Reduct Equivalent Rule Induction Based On Rough Set Theory, in print.

 

[22]

Lingras, P., Rough Set Clustering for Web Mining, Proceedings of 2002 IEEE International Conference on Fuzzy Systems, pages 12-17, Honolulu, Hawaii, May 12-17, 2002.

[23]

Lingras, P., Yan, R., West, C., Comparison of conventional and rough k-means clustering, Proceedings of Rough sets, fuzzy sets, data mining, and granular computing 9th international conference, RSFDGrC 2003, Chongqing, China, pages 130-137, May 26-29, 2003.

[24]

MacQueen, J., Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, volume 1, pages 281-297, Berkeley, Califonia. University of California Press, 1967.

[25]

Magidson, J., Vermunt, J.K., Latent class modeling as a probabilistic extension of K-means clustering, Quirk's Marketing Research Review, 77-80, March 2002.

[26]

Magidson, J., Vermunt, J.K., Latent class models for clustering: A comparison with K-means, Canadian Journal of Marketing Research, 36-43, 2002.

[27]

Mazlack, L., He, A., Zhu, Y., Coppock, S., A Rough Set Approach In Choosing Partitioning Attributes, Proceedings of the ISCA 13th International Conference (CAINE-2000), November, 2000.

[28]

Pawlak, Z., On Rough Sets, Bulletin of the European Association for Theoretical Computer sciences 24, pages 94-109, 1984.

[29]

Pawlak, Z., Rough Classification, International Journal of Man-Machine Studies 20, pages 469-483, 1984.

[30]

Raghavan, V. V., Sever, H., The State of Rough Sets for Database Mining Applications, Proceedings of

23rd Computer Science Conference Workshop on Rough Sets and Database Mining, pages 1-11, San Jose, CA,

1995.

[31]

Ultsch, A., Clustering with SOM: U*C, In Proc. Workshop on Self-Organizing Maps, Paris, France, pp. 75- 82, 2005.

[32]

Zhang, M., Yao, J. T., A Rough Sets Based Approach to Feature Selection, Proceedings of The 23rd International Conference of NAFIPS, Banff, Canada, pages 434-439, 2004.

- 10 -