Machine Learning and Data

Gerstner Laboratory for Intelligent Decision Making and Control Czech Technical University in Prague
Series of Research Reports

Report No:
GL 157/02
Machine Learning and Data Mining
Ji Palous r
palous@labe.felk.cvut.cz http://cyber.felk.cvut.cz/gerstner/reports/GL157.pdf
Gerstner Laboratory, Department of Cybernetics Faculty of Electrical Engineering, Czech Technical University Technick 2, 166 27 Prague 6, Czech Republic a tel. (+420-2) 2435 7421, fax: (+420-2) 2492 3677 http://gerstner.felk.cvut.cz
Prague, 2002 ISSN 1213-3000
Contents
1 Introduction 2 Machine Learning 2.1 Main Machine Learning Methods . . . . . . 2.1.1 Decision Trees . . . . . . . . . . . . 2.1.2 Neural Networks . . . . . . . . . . . 2.1.3 Bayesian Methods . . . . . . . . . . 2.1.4 Reinforcement Learning . . . . . . . 2.1.5 Inductive Logic Programing . . . . . 2.1.6 Case-Based Reasoning . . . . . . . . 2.1.7 Support Vector Machines . . . . . . 2.1.8 Genetic Algorithms . . . . . . . . . . 2.2 Machine Learning and Data Mining . . . . 2.3 Data Preprocessing . . . . . . . . . . . . . . 2.4 Learning . . . . . . . . . . . . . . . . . . . . 2.5 Testing . . . . . . . . . . . . . . . . . . . . 2.6 Results Evaluation & Model Exchange . . . 2.6.1 Area under ROC Curve . . . . . . . 2.6.2 Predictive Model Markup Language 3 iBARET Instance-BAsed REasoning 3.1 iBARET structure . . . . . . . . . . . 3.2 CQL Server . . . . . . . . . . . . . . . 3.3 Consultation . . . . . . . . . . . . . . 3.4 Testing Set Evaluation . . . . . . . . . 3.4.1 Classication Task . . . . . . . 3.4.2 Regression Task . . . . . . . . 3.5 IBR Model Tuning . . . . . . . . . . . 3.5.1 Sequential Algorithm . . . . . . 3.5.2 Genetic Algorithm . . . . . . . 3.6 Future Work . . . . . . . . . . . . . . 4 Experiments 5 Future Research 6 Conclusion References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 4 4 4 4 5 5 5 5 6 6 7 8 8 9 9 10 11 11 13 15 15 16 19 20 20 21 23 25 28 28 29
Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
List of Figures
1 2 3 4 5 6 7 ROC curve for example ad Table 1 (ROC area 0.8) . . . . . . . . . . . . . . . The iBARET blok structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example of prediction values of sinus function . . . . . . . . . . . . . . . . . . PMML utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Procedure 6 - iBARETs training error . . . . . . . . . . . . . . . . . . . . . . Procedure 6 - iBARETs performance on testing data . . . . . . . . . . . . . Procedure 6 - iBARETs performance with reduced number of patient groups 10 12 20 24 26 26 27
List of Tables
1 2 3 4 5 Example calculating ROC curve . . . . . . . . . . . Four-fold table . . . . . . . . . . . . . . . . . . . . . . Example of a symbol distance table . . . . . . . . . . . Example of the second type of a symbol distance table Example RMSE of dierent LWR method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 9 14 14 19
Introduction
Articial Intelligence (AI) is the area of computer science focusing on creating machines that can engage on behaviors that humans consider intelligent. The ability to create intelligent machines has attracted humans since ancient times, and today with the huge expansion of the computers and 50 years of research of AI programming techniques, the dream of intelligent machines is becoming a reality. Machine Learning (ML) is the area of Articial Intelligence that focuses on developing principles and techniques for automating acquisition of knowledge. Some machine learning methods can dramatically reduce the cost of developing knowledge-based software by extracting knowledge directly from existing databases. Other machine learning methods enable software systems to improve their performance over time with minimal human intervention. These approaches are expected to enable the development of eective software for autonomous systems that can operate in poorly understood environments. The aim of this work is to make short overview of most frequently used Machine Learning methods, and to introduce our research focus. This report can be divided in two main parts. The rst one concentrates mainly on used Machine Learning methods. Each important ML method is briey described and appraised its signicance. Then we show relation to the next branch of articial intelligence Data Mining. After that we focus on problems in each phase in Machine Learning process. Some opportunities of preprocessing are shown and a powerful preprocessing tool SumatraTT is introduced. Then we discuss problems in general learning phase, testing phase and show possibilities in results evaluation. At the end of the rst part, there is mentioned popular language for predictive model exchange PMML. The second part describes our solution for classication and prediction iBARET. All techniques that are covered by iBARET are explained, from consultation process to model tuning by genetic algorithm. Then we try to sketch future evaluation of the tool. At the end we show most recent experiment on SPA data made by iBARET. The nal part of this work includes reection on possible directions of further research. There we try to nd some interesting topic we would like to work on in frame of PhD thesis.
Machine Learning
The rst part of this section contains the overview of the machine learning methods. We focused on the basic, frequently used methods and several, at present, most popular methods. After this introduction to Machine Learning (ML) we will try to describe relationship between ML and Data Mining (DM). Then we focused on the whole ML/DM process from the data preprocessing, through the general learning, to the testing and the results evaluation. In each part of learning process we will try to sketch the problems we can encounter there.
2.1
Main Machine Learning Methods
This section contains only very short description of ML methods, and their brief characteristics. Most of them are in more detail described in [41, 19]. In recent years, attention has been paid to generating and combining dierent but still homogenous classiers with techniques called bagging, boosting or bootstrapping [11, 17]. They are based on repeated generation of the same type of model over evolving training data. These methods enable reduction of model variance. The authors prove that they can not be applied to instance-based learning cycle due to method stability with respect to perturbations of the data. 2.1.1 Decision Trees
Decision tree learning [43] is a method for approximating discrete function by a decision tree. In the nodes of trees are attributes and in the leaves are values of discrete function. The decision tree can be rewrited in a set of if-then rules. Trees learning methods are popular inductive inference algorithms, mostly used for variety of classication tasks (for example for diagnosing medical cases). For tree generation there is often used entropy as information gain measure of the attribute. The best known methods are ID3, C4.5, etc. 2.1.2 Neural Networks
Neural networks learning methods [9] provide a robust approach to approximating realvalued, discrete-valued and vector-valued functions. The well-known algorithm Backpropagation uses gradient descent to tune network parameters to best t to training set with input-output pair. This method is inspired by neurobiology. It imitates function of brain, where many neurons are inter connected. The instances are represented by many input-output pairs. NN learning is robust to errors in training data and has been successfully applied to problems such as speech recognition, face recognition, etc. 2.1.3 Bayesian Methods
Bayesian reasoning [8] provides a probabilistic approach to inference. Bayesian reasoning provides the basis for learning algorithms that directly manipulate with probabilities, as well as a framework for analyzing the operation of other algorithms. Bayesian learning algorithm that calculates explicit probabilities for hypothesis, such us the naive Bayes, are among the most practical approaches to certain type of learning problems. Bayes classier is competitive with other ML algorithms in many cases. For example for learning to classify text documents, the naive Bayes classier is one of the most eective classiers.
2.1.4
Reinforcement Learning
Reinforcement learning [28] solves the task how the agent (that can sense and act in environment) can learn to choose optimal actions to reach its goal. Each time the agent performs an action in its environment, a trainer may provide a reward or penalty to indicate the conveniency of the resulting state. For example, when agent is trained to play a game then trainer might provide a positive reward when the game is won, negative reward when it is lost, and zero reward in other states. The task of agent is to learn from this delayed reward, to choose sequences of actions that produce the greatest cumulative reward. An algorithm that can acquire optimal control strategies from delayed reward is called Qlearning. This method can solve the problems like learning to control mobile robot, learning to optimize operations in factories, learning to plan therapeutic procedures, etc. 2.1.5 Inductive Logic Programing
Inductive logic programming [18] has its roots in concept learning from examples, a relatively straightforward form of induction. The aim of concept learning is to discover, from a given set of pre-classied examples, a set of classication rules with high predictive power. The theory of ILP is based on proof theory and model theory for the rst order predicate calculus. Inductive hypothesis formation is characterized by techniques including inverse resolution, relative least general generalisations, inverse implication, and inverse entailment. This method can be used for creating logical programs from training data set. The nal program should be able to generate that data back. The creating logical programs is very dependent on task complexity. In many cases this method is not usable without many restrictions posed on the nal program. With success ILP is mostly used in Data Mining for nding rules in huge databases. 2.1.6 Case-Based Reasoning
Case-Based Reasoning (CBR) [1, 34] is a lazy learning algorithm that classies new query instance by analyzing similar instances while ignoring instances that are very dierent from the query. This method holds all previous instances in case memory. The instances/cases can be represented by values, symbols, trees, various hierarchical structures or other structures. It is non-generalization approach. The CBR works in the cycle: case retrieval reuse solution testing learning. This method is inspired by biology, concretely by human reasoning using knowledge from old similar situations. This learning method is also known as Learning by Analogy. CBR paradigm covers a range of dierent methods. Widely used is Instance-Based Reasoning (IBR) algorithm [2, 49] that diers from general CBR mainly in representing instances. The representation of the instances is simple, usually it is vector of numeric or symbolic values. Instance-based learning includes k-Nearest Neighbors (k-NN) and Locally Weighted Regression (LWR) methods. 2.1.7 Support Vector Machines
Support Vector Machines (SVM) has become very popular method for classication and optimization at the recent time. SVM were introduced by Vapnik et al. in 1992 [10]. This method combines two main ideas. The rst one is concept of an optimum linear margin classier, which constructs a separating hyperplane that maximizes distances to the training point. The second one is concept of a kernel. In its simplest form, the kernel is a function which calculates the dot product of two training vectors. Kernels calculate these dot product 5
in feature space, often without explicitly calculating the feature vectors, operating directly on the input vectors instead. When we use feature transformation, which reformulates input vector into new features, the dot product is calculated in feature space, even if the new feature space has higher dimensionality. So the linear classier is unaected. Margin maximization provides a useful trade o with classication accuracy, which can easily lead to overtting of the training data. SVM are well applicable to solve learning tasks where the number of attributes is large with respect to the number of training examples. 2.1.8 Genetic Algorithms
Genetic algorithms [40] provide a learning method motivated by an analogy to biological evolution. The search for an appropriate hypothesis begins with a population of initial hypothesis. Members of the current population give rise to the next generation population by operations such as selection, crossover and mutation. At each step, a collection of hypothesis called the current population is updated by replacing some fraction of the population by osprings of the most t current hypothesis. Genetic algorithms have been applied successfully to a variety of learning tasks and optimization problems. For example, Genetic algorithms can be used in other ML methods, such as Neural Network or Instance-Based Reasoning (see section 3.5.2), for optimal parameters setting.
2.2
Machine Learning and Data Mining
Data mining DM (also known as Knowledge Discovery in Databases KDD) has been dened as The nontrivial extraction of implicit, previously unknown, and potentially useful information from data [20]. It uses machine learning, statistical and visualization techniques to discover and present knowledge in a form which is easily comprehensible to humans. Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential. For examle, it can help companies and institutions to focus on the most important information in their data warehouses. Data mining tools predict future trends and behaviors, allowing businesses to make proactive, knowledge-driven decisions. DM tools can answer business questions that were traditionally too time consuming to resolve. They scour databases for hidden patterns, nding predictive information that experts may miss because it lies outside their expectations. DM technology can generate new business opportunities by providing these capabilities: Automated prediction of trends and behaviors. Data mining automates the process of nding predictive information in large databases. Questions that traditionally required extensive hands-on analysis can now be answered directly from the data quickly. Automated discovery of previously unknown patterns. Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. The most commonly used techniques in data mining are: Articial neural networks Decision trees Classication and Regression Trees (CART), Chi Square Automatic Interaction Detection (CHAID) . Genetic algorithms
Nearest neighbor method Rule induction The extraction of useful if-then rules from data based on statistical signicance. Here we can see that DM richly uses the ML methods. Going through a huge database and extract some advance knowledge can be managed only by an intelligent method that is able to learn. The capabilities of DM are now evolving to integrate directly with industrystandard data warehouse and OLAP (On-Line Analytical Processing) platforms. A Gartner Group Advanced Technology Research Note [22] listed data mining and articial intelligence at the top of the ve key technology areas that will clearly have a major impact across a wide range of industries within the next 3 to 5 years.
2.3
Data Preprocessing
Almost always when we solve a task, we get raw data that have to be prepared for concrete learning process. According to the problem domain, learning method and software tool, it is needed to determine appropriate data structure and then data preprocessing method. Very important thing is to dene how to deal with undened or unknown data and uncertainty. There are several possibilities how to tranform input data: simple mapping simplest method, mapping one eld to another (can not merge elds), expressions mapping one line from one source to one line in destination, interpreted language programming language that can operate with more sources, very powerful but more complex, compiler fast running, but highly dependent on platform, need of complex implementation. One of the most powerful preprocessing tools is SumatraTT [3, 4] developed at the Department of Cybernetics at CTU. SumatraTT is using SumatraScript language for dening data transformations and is platform independent. Sumatra language is inspired by C++ and Java that are well known to programmers, so it is easy to learn. Preprocessing as phase of ML or DM determine success of the whole learning/mining process. Frequently for automatic preprocessing so-called metadata are used. Metadata are informations about data, that can be often helpful. ML theory suggests many dierent approaches how to deal with available data when a model is generated. They can dier particularly in relation to the type of learning, which determines constant model characteristics, and size of data set itself. In all cases, they follow two fundamental intentions: to enable generation of model with predictive power as high as possible and to give a chance to independently estimate its performance on future data and guarantee the model validity over these unseen data. In order to meet these two goals, all the data can not be used within model generation (adjusting model settings). The hold-out method divides data into two distinct sets: training set is used in learning and testing set is used in evaluation. We can encounter with several unpleasant problems in data. Typically in classication task there can easily happen that the data are not equally distributed in nal classes. This fact, in the most of learning methods, leads to simple classicator which classication has high accuracy in prediction of best represented class, and it is not able to classify the less
represented class. This state can occur often in medical applications, for example when predicting results of Coronary Artery Bypass Graft surgery operation [30]. The most interesting for us is to predict patients that will probably die. But the pattern of those patients consists of only 1%. This can be solved by appropriate evaluation function, for example with ROC curve, or can be improved by data preprocessing. Those preprocessing should adjust data distribution into classes, for example by random removing cases from the class best represented. Other problem appears if we have many attributes and not much data, see section 2.4. In this case the Probably Approximately Correct (PAC) analysis [16] can help us to determine how good can be our solution. The question is with what probability we are able to predict results with given accuracy. This problem of data can be also partially eliminated by data preprocessing; mainly by detecting and removing irrelevant features, or by selecting most relevant ones [25, 26]. This method is called Feature Reduction. There are many methods for selecting relevant attributes, for example methods using PAC [23] or correlation [24]. Example of such data can be nd in section 4, where we try on SPA data to predict capacity required for therapeutic utilities.
2.4
Learning
The term learning usually corresponds to tting designed model. Through the process of learning, we are improving model prediction accuracy as fast and high as possible. After learning we expect the best tting model of input data. Some methods that are not much robust to noise usually give very good results on training data, but on testing or real data are rather poor. When this fact occurs we are talking about over-tting. The rate of over-tting is also very dependent on input data. This mostly become when we do not have much data compared with number of attributes, or when the data are noisy. The noise into data can be brought by, for example, subjective data providing, uncertainity, some acquisition errors, etc. If we have too many attributes and not much data then the state space for nding optimal model is too wide and we can easily lose the right way and nish in local optimum. The problem of over-tting can be partially eliminated by suitable preprocessing and by using adequate learning method. In this place we would like to mention very powerful tool for machine learning WEKA [51, 47]. Weka collected plenty of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform. The algorithms can either be applied directly to a dataset or called from ones own Java code. Weka is also well-suited for developing new machine learning schemes. Weka is open source software issued under the GNU General Public License. It incorporates about ten dierent methods for classication (Bayes, LWR, SVM, IBR, Perceptron, . . . ), other six methods for numeric prediction (linear regression, LWR, IBR, multi-layer perceptron, . . . ) and several so called meta-schemes (bagging, stacking, boosting, . . . ). Also included there are clustering methods, and an association rule learner. Apart from actual learning schemes, Weka also contains a large variety of tools that can be used for pre-processing datasets.
2.5
Testing
After learning before using our prediction model in the practice we have to check classication or prediction accuracy. The most frequently used method is dividing input data to training and testing data sets. N-cross-validation method divides the data into N partitions. Learning process runs in N steps, in each step i all the partitions except the ith are used in learning, the ith group is 8
used for testing. Leave-one-out method is a special case of cross-validation, each partition consists of just one case and so number of learning steps is equal to number of data records. This method is very easy to be implemented with the instance-based techniques as it is trivial to ignore single case when searching for the most similar cases.
2.6
Results Evaluation & Model Exchange
The evaluation of results is a key problem of learning process. The evaluation gives feedback to learning and thus the evaluation can distinctly aect progression of model improving. The tness function of some tasks is given by their principle, but for most of them is not. For numeric prediction tasks the measure of success is often some type of error rate RMSE, MAPE, or MAE (section 3.4.2). For classication tasks there is often used probabilistic approach and especially in medical applications ROC curves are used. 2.6.1 Area under ROC Curve
Receiver operating characteristic (ROC) is dened as a curve of false positive and false negative results at various threshold values indicating the quality of a decision method. This quality is expressed by the area under ROC curve [27]. This area can be between 0 and 1. Zero corresponds to perfectly inverted classication, 0.5 corresponds to no apparent accuracy of the tested method, 1.0 means perfect classication accuracy. In other words, a ROC curve is a graphical representation of the trade o between the false negative (F N ) and false positive (F P ) rates for every possible cut o. By tradition, the plot shows the F P on the X-axis and 1 F N (which is true positive (T P ) rate) on the Y -axis. Let us consider a simple example, a decision method predicts probability that testing examples belong among positive cases (class 1) as table 1 shows. Prediction Real class 0 0 0.05 0 0.2 1 0.22 0 0.3 1 0.31 0 0.4 0 0.42 1 0.7 1 0.8 1
Table 1: Example calculating ROC curve The area under ROC curve can be calculated by setting down of false positive and false negative rate for each possible decision threshold. First, the threshold is set to 0, i.e. all the examples are considered to be positive. Then, the threshold is set 0.025 so that the left most testing example is considered to be negative while the others remain positive. The third threshold is set to 0.125, the two left most examples are classied as negative. All in all, ten dierent thresholds can be derived. For each threshold it is calculated four-fold table and T P and F P are calculated. Class/Classication 0 1 0 a c 1 b d F P = b/(a + b) T P = d/(c + d)
Table 2: Four-fold table The resulting ROC curve can be seen in Figure 1 . The area under the curve is 0.8, it can be interpreted as an apparent relation between the decision method prediction and real classication. 9
RO C c urve
1.10 1.00 0.90 0.80 0.70 TP (sensitivity) 0.60 0.50 0.40 0.30 0.20 0.10 0.00 0.00 0.10 0.20 0. 30 0.40 0.50 0.60 0. 70 0.80 0.90 1.00 1.10 FP (1-speci ficity)
Figure 1: ROC curve for example ad Table 1 (ROC area 0.8)
2.6.2
Predictive Model Markup Language
After creation and tting predictive model we would like sometimes to be able to bring that model to another application or tool, for example for nal use, visualization, some kind of post-processing etc. For that purpose there has been recently developed a special language for holding dierent predictive models. Predictive Model Markup Language (PMML) - is an markup language based on XML which can be used for holding predictive models and their interchange between compliant vendors applications. PMML is dened by Data Mining Group (DMG) and the specication of PMML is published on the DMG web page [39]. PMML is platform and vendor independent forward-looking data mining standard. It allows users to develop models within one vendors application, and use other vendors applications to visualise, analyse, evaluate or otherwise use the models. PMML uses XML [14] to represent mining models. The structure of the PMML document is described in Document Type Denition (DTD). In PMML a single document can contain more than one mining model. If application supports mining model selection, then the user can specify the model for current use, otherwise the rst one is used as default. In addition, the PMML can hold data statistics, directives for feature normalization, results, and other meta-data. It is possible to include comments into PMML. This advantage can be used by some applications for improving their functionality. This language is easy to understand and manipulate. It is able to hold nearly all the most frequently used data-mining models. In the future this interchange format should support nearly all Data Mining applications.
10
iBARET Instance-BAsed REasoning Tool
Development of the iBARET system follow the work done in Master thesis [42]. That work was focused on development of a system for result prediction of cardiac operations using CBR method [1]. Since that time, we have been developing more general tool for use in many domains [32]. In contrast to learning methods that construct a general, explicit description of the target function when training examples are provided, instance-based learning (IBL) methods simply store the training examples. Generalizing beyond these examples is postponed until a new instance must be classied. Each time a new query instance is encountered, its relationship to the previously stored examples is examined in order to assign a target function value for the new instance. Instance-based learning includes k-nearest neighbor (kNN) and locally weighted regression (LWR) methods. These methods assume that instances can be represented as points in a Euclidean or other space. It also includes case-based reasoning (CBR) methods that use more complex, symbolic representations for instances. Instance-based methods are sometimes referred to as lazy learning methods because they delay processing until a new instance must be classied. A key advantage of this kind of delayed, or lazy, learning is that instead of estimating the target function once for the entire instance space, these methods can estimate it locally and dierently for each new instance to be classied [41]. kNN performance is highly sensitive to the denition of its distance function. In order to reduce this sensitivity, it is advisable to parametrize the distance function with feature weights. The authors in [48] argue, that methods which use performance feedback to assign weight settings demonstrate three advantages over other methods: they require less pre-processing, perform better in the presence of interacting features and generally require less training data to learn good settings. iBARET represents a batch performance feedback optimizer. It utilizes genetic algorithms (GAs) and sequential algorithm to eectively search the space of possible weight settings. The idea of GAs utilization to nd an optimal weight settings is not new [29, 44, 36]. iBARET concentrates on utilization of proper genetic operators with respect to time needed for processing and possibility to set its parameters through user interface, i.e. universality of its application. The universality is the other overall system accomplishment. iBARET can be used to solve classication as well as regression tasks. For classication tasks it oers two dierent methodologies to calculate weight setting tness function. The rst one is a probabilistic method, which derives a value of tness function of current weight setting of average prediction accuracy reached for dierent classes. The second one makes use of receiver operating characteristic (ROC) curves [27] taken from radiology and medicine. Both of them give a chance to eectively process tasks with non-uniform class distribution. More details about iBARET and used methods can be found in [31]. The data preprocessing represents the crucial issue of successful iBARET application. In particular, symbolic features must be properly transformed into numeric features or a proper distance table has to be dened and used within CQL Server. iBARET does not oer any possibilities of data preprocessing, therefore all the preprocessing has to take place outside of iBARET prior to its application. A user should be always aware whether he can use his data directly or after preprocessing.
3.1
iBARET structure
System iBARET consists of two main units, CQL Server and IBR Interface (see Figure 2). CQL Server receives queries in Case Query Language (CQL) format. Every single query corresponds to a single case. At the same time, it contains feature weights and number of 11
requested neighbors. CQL Server nds the nearest neighbors to the case contained in query and sends a CQL response to IBR Interface. CQL Server has its origin in CBR Works 3 that was used as a server application at rst. CBR Works 3 system is a commercial product of TECINNO GmbH created in frame of INRECA project. Later on we have revealed this system shows too slow reaction to CQL queries therefore we have developed our own server application CQL Server. In order to keep compatibility with IBR Interface, we use the same TELNET communication protocol and the same CQL. Nevertheless, CQL is not fully supported in our server, it implements only commands for domain model representation, command for sending query and command for generating answers. In fact, CQL Server is a special database engine that can identify the most similar cases (nearest neighborhood). It works with the domain model that can be loaded from a text le in the same way as the tested cases.
CQL Server IBR Interface Training or testing data set Attribute weights tuning Sequential alg. Genetic alg.
Experiment settings
Domain model
Attributes CQL Server Engine CQL CQL encoder CQL decoder Case memory Show answer CQL communicator Auto. Manual data
Case evaluation Test evaluation Probability ROC curve Evaluation unit
Figure 2: The iBARET blok structure
IBR Interface is used for nding optimal parameters of k-nearest neighbor method. It is namely feature weights and parameter k. The parameter k is set by a user through GUI (Graphical User Interface). IBR Interface generates a set of weights and applies them to queries sent to CQL Server. It uses these weights for all the samples included into a testing set. It processes responses from CQL Server, classies the testing examples according to reported neighborhood and nally it evaluates classication or regression accuracy reached with the given set of method parameters. IBR Interface consists of four main units. CQL communicator automatically generates queries and decodes answers from CQL server for further evaluation. It also enables to send a query manually. Evaluation unit derives a solution of a single query (classication or regression) from the received neighborhood, and after evaluating of all the cases it appoints overall classication or regression accuracy. The nal output of this unit is only one value that indicates quality of model and its ability to classify or predict. Probabilistic method or ROC curve can be used for this purpose. The value we get from evaluation unit is used in the next Unit for tuning of attribute weights. This unit implements two methods of feature weights optimization. The rst algorithm we have implemented is a sequential algorithm. Better and more useful algorithm is the genetic algorithm. Weights of attributes and other settings are stored in Experiment settings unit.
12
3.2
CQL Server
The CQL Server [46] is an application that is able to service requests in CQL syntax sent over any TCP/IP network for specic database (case-base) information. It runs on Windows 9x/NT/XP systems. On receiving a valid CQL request, it starts up a sequential search for the nearest neighbors of the case that was transmitted with the request. According to the parameters passed over by the client, the neighbors are found and the compiled results sent back over the net. CQL Server is actually a replacement for the CBR Works 3 database server which does not deliver satisfactory speed and power. When starting to work with a new task (i.e. a new case-base), a CQL Model has to be created and loaded rst. The CQL Model represents a description of a case-base in terms of CQL. It denes data types (integer, real, symbol, custom, . . . ) and slots (names and data types of case-base features) contained in the case-base le. At the same time it denes, whether is the given case-base feature (slot) used when distance is calculated or not (determined by a not discriminant statement). A distance weight can be adjoined to each slot as well. It is not implicitly specied which feature is a class within the model. This decision is postponed for IBR Interface. In case the class feature is constant within the processed domain, not discriminant statement can be added in its slot denition for safety reasons. According to this data, the system generates a custom binary format, which is stored in memory. This format saves memory space and it also makes searching easier and faster. After CQL Model denition, a case-base can be loaded into the server. The case-base is a simple text le consisting of cases, features are separated by commas, cases by line breaks. The rst feature on each line corresponds to the rst slot in CQL Model and so on. Number of features and their data types must correspond to predened values in CQL Model. No symbolic description (feature names, types) is included, the rst line represents directly the rst case. The CQL Server uses an Euclidean distance metric to identify the nearest neighbors. The distance between two cases is calculated by adding up all the distances based on the values of the respective slots. For Integers/Reals their distance on the real line is simply taken, and for Symbolic variables supplied table metrics are used (from external les). When a query is initiated, each slot is assigned a weight by the client. The respective distances are multiplied by this weight and the resulting value is then divided by the maximum distance between values of that slot in the entire database. The similarity value is obtained by dividing the sum of distances by the sum of all used weights and subtracting from 1. The similarity is then dependent on the context (other subjects) in the case-base.
m
(wk sim(i, j) = 1
k=1
impl m
wk
dis(xik , xjk )) (1) wk q )
(wk
k=1 th
impl
Where: sim(i, j) similarity between i and j th case, m number of discriminative features, wki mpl an implicit weight of k th feature taken from CQL Model, wkq a user weight of k-th feature taken from actual query, dis a distance function, xik a value of k th feature in ith case, xjk a value of k th feature in j th case. For numeric data types (integer, real), the distance function is dened as follows:
13
dis(xik , xjk ) =
|xik xjk | max xlk min xlk

l l
(2)
For symbolic unordered data types, the distance function must be dened with aid of a symbol distance table. The symbol table is specied by a lower triangular matrix rather than an upper triangular one (saves typing tabs). This implies that the values are symmetrical. The tables have a very simple and intuitive format. The tables are stored in a text format, tabs separate elds and line breaks separate table rows. The problematic of symbolic features weighting is discussed in [15]. Let us assume a symbolic type with the values {Renault, Buick, Rolls-Royce, Skoda} and the table will then give the similarities between these types as shows table 3. Renault 1 0.8 0.7 0.3 Buick 1 0.9 0.2 RollsRoyce Skoda
1 0.0
Table 3: Example of a symbol distance table Main diagonal is made of 1.0 only that means that a Buick is 100similar to a Buick and so on. Other similarity values express that e.g. Skoda is the most distant car type from the other types. There is one more type of a symbol distance table implemented in the system. The table enables to dene shortly a metric that returns a certain value if the feature values are the same and another one if they are dierent. This metric is symbolically specied in a table of another format. See table 4. + similarity if same similarity if different Table 4: Example of the second type of a symbol distance table This measure might be used when the dierence cannot be determined for individual values. Let us for example take up the car example again, in the symbol set {Blue, Red, Silver} it is more logical to set the distance to some value if the colors match and to some other if they do not. It might be a problem to quantify the distance Red-Blue and Red-Silver. Which is more dierent? In these cases the last metric may apply. If the tables are in the same directory as the model le and they have names in the form feature name.tbl then they can be auto-loaded. This means that the system automatically looks up these les, loads them into memory and binds them to the symbols of that particular type. Some types are implicitly considered non-discriminative (not taken into account in the total similarity measure). These are especially types, which have the parent type CQL SYMBOL directly. These types are implicitly shown in red, which indicates that they are non-discriminative. On clicking on such a type, a dialog will open requesting the lename of the table to be loaded for this type. The type is now discriminative. Range
14
checking is done on numeric variables. Out-of-range variables are reported in the system log. Symbols are checked against a supplied symbol list. Unrecognized symbols are reported in the system log. Simple statistics are provided to evaluate the eciency of the server and of the connection.
3.3
Consultation
CQL is a structured language used for communication between Interface and Server. This section explains a way to construct queries and to understand answers. The syntax of the CQL is completely described in the documentation to CBR-Works system [45]. IBR-Interface generates queries itself, so the user does not need to have knowledge of it. Let us overview all the base data necessary for automatic generation of queries. First of all, the names of attributes (slots), their values and weights are needed. The names of slots are read from the attribute le. Values of the attributes are taken from another input text le. The attribute values should be divided by commas or semi-colons at each row. There are two more data entry necessary for query generation threshold and number of cases to retrieve. Number of cases to retrieve denes maximum number of returned neighbors. The CQL Server can return fewer cases in case that fewer cases exceed threshold value of similarity with a queried case. Server Answers After sending query and its processing by Server, Interface gets an answer in CQL format. The answer has to be decoded, necessary data for classication of the queried case has to be acquired. First value that Interface reads is a return code. If this value is not zero, communication between Interface and Server fails and no case is appended. The most frequent fault code is -2, it means that query format has been wrong. As soon as return code is 0, the most similar cases with their similarity values and full description (all the attribute values) follow after key word cases. For the next processing there is considered only case similarity and value of attribute that represents result (class or predicted value). Just for easier control of process, Interface extracts also case names from answer. Sometimes, the same cases in Servers database and in the testing set for Interface can be used. Of course, when classifying the queried case the nearest case has to be removed of the corresponding Server answer as it is certainly the queried case itself. Otherwise the error estimation of the method would be optimistically biased. This technique of error estimation is referenced as Leave-One-Out Cross-Validation method (LOOCV). When the similarity of the nearest neighbor equals 1 then Interface ignores the nearest case from the answer. Consultation with server can be executed manually or automatically. Manual consultation means that the user can create and send the custom made query directly in CQL and the result is showed entirely again in CQL (without processing answer). This approach presumes the knowledge of CQL. It suits for the testing purpose mainly. Mostly there is used automatic consultation that can construct queries from input cases and send them step by step to the server. It can be used for model training, model testing or nal consultation. When domain model is trained, the consultation with whole training data set is executed repeatedly for many dierent experiment setting. In frame of automatic query sending, simple processing of answers for further evaluation is performed.
3.4
Testing Set Evaluation
After processing of the answer by CQL communicator unit, the queried case can be classied (or predicted when the desired value is numeric) by means of Locally weighted regression (LWR) method [13]. For this purpose, the CQL communicator unit outputs names of the nearest neighbors altogether with their similarity to the queried case and classication. At 15
rst, Interface determines so-called modied similarity of each of the nearest neighbors. This modied similarity weights the neighbor when classifying the queried case. The simplest way of weighting is direct utilization of returned similarity. The second possibility is to apply scaling. The scaling can be utilized namely when a lot of neighbors show very conformable similarity values and similarity has only little eect upon the classication of the queried case. In the present version of interface, user can choose linear rank scaling method (LRank) that calculates modied similarity from order of cases by similarity. The rst nearest neighbors modied similarity equals to number of returned neighbors, for the next nearest neighbors Interface decreases modied similarity step by step by one. mod simi = sim(i, q) ni+1 none scaling LRank scaling (3)
Where: mod simi modied similarity of ith nearest neighbor, q queried case, i index of ith neighbor of q, n total number of neighbors returned in answer. For the evaluation we have to know which task type is executed (classication or regression). For each task type the LWR and evaluation methods are dierent. For classication task we calculate for each classication class probability that new case belongs to that class. Class with highest probability is the predicted result. The training/testing set then can be evaluated probabilistically or with ROC curve. For regression task the result is determined by simple locally weighted regression method based on weighted average calculation. The evaluation of training/testing set is then calculated as mean absolute or relative prediction error. In the interface there is implemented possibility to work with relative values of attributes with normalizing by chosen attribute. 3.4.1 Classication Task
Provided that a task is classicatory, Interface calculates probability for each classication class that the queried case belongs to this class. This probability is calculated as follows:
N
mod simi rij Pq,j =

i=1 N
100 mod simi
[%],
(4)
i=1
Where: Pq,j probability that queried case q belongs to j th class, n total number of neighbors returned in answer, mod simi modied similarity of i-th nearest neighbor, 1 if classication of ith nearest neighbor is j rij = 0 otherwise From these probability values, Interface determines class of the queried case. It is the class with maximum probability. The index of class predicted for the queried case is: r = arg(max (Pq,j ))
j
(5)
16
Where: r index of classication class with maximum probability, Pq,j probability that queried case q belongs to j th class. That was the way Interface evaluates single query. For the purpose of the feature weights optimization, classication accuracy for the whole testing/training set has to be calculated. Interface oers two basic methodologies to do it probabilistic method and ROC curve method. Probabilistic Method In the frame of this method, user can choose how to deal with possible dierence between predicted class and right class of the queried case. For each queried case, Interface rst evaluates result Rj . The rst possibility is to set result directly to probability Pq,j , where j is the index of the right class. Otherwise result can be set to 100% if the classication is correct (j = r; where j is index of the right class) and to zero if classication is not correct. Then, the individual probability results of each case Rj has to be transformed to a uniform evaluation of the testing set evaluation. The most simple way of evaluating overall classication accuracy is calculating of a simple average over all results R i . This method can be used only if working with equally distributed classes. In the most of tasks the method using weighted average should be used. This method calculates probability of correct classication of each class separately and weights them according users setting. First, it is computed probability of successful classication of the k-th class. It is a simple average of partial success of classications:
N
Rj rjk Pk =
j=1 N
(6) rjk
j=1
Where: Pk success probability of k th classication class, N number of queried cases (number of cases in the testing set), Rj probability of correct classication of j th case, 1 if case j is classied in k th class rjk = 0 otherwise Then, Interface calculates weighted average on the bases of these probabilities. The weights of classication classes can be set by user. The nal probabilistic value P evaluates classication accuracy reached for the given setting on the entire testing set.
M
Pk w k P =
k=1 M
(7) wk
k=1
Where: P probability estimate of overall classication accuracy, Pk probability of successful classication for k th class, wk weight of k th classication class.
17
Obviously, utilization of the weighted average of classication accuracy reached for the individual classes decreases inuence of unequal distribution of training/testing cases among classication groups. Typical example of such a distribution can be a task of mortality prediction. Just a few percent of patients actually dies, but we want to be very exact in these cases. That is why, it is very benecial to increase remarkably a weight of successful classication to class dies, will die Area under Receiver Operating Characteristic (ROC) Curve The method of Area under ROC curve was introduced in section 2.6.1. This measure comes to use mainly when it is not desirable to make the model predictions distinct, although the nal classes are. The area under ROC gives a good chance to convert a complex and balanced comparison of all the predictions and real classications to a single number. It can be shown that the area represents the probability that a randomly chosen positive subject is correctly rated or ranked with greater suspicion than a randomly chosen negative subject. In medical imaging studies the ROC curve is usually constructed as follows: images from diseased and non-diseased patients are thoroughly mixed, then presented in this random order to a decision system which is asked to rate each on a scale ranging from denitely normal to denitely abnormal. The scale can be either continuous or discrete ordinal. The points required to produce the ROC curve are obtained by successively considering broader and broader categories of abnormal in terms of decision system scale and comparing the proportion of diseased and non-diseased patients. The ROC curve in an example in section 2.6.1 is not smooth because of too low number of testing examples and consequent low number of possible thresholds. In real application we mostly have thousands of cases so we should have ROC curve from thousands points. This number of points is too high, so we do not calculate each point for each possible decision threshold. We decrease the number of points by merging cases to groups. Interface simply uses users value dening number of points on ROC curve. There are two possibilities to divide testing examples among the dened number of intervals. First method Group by percent merges cases to equally sized groups, i.e. each interval contains the same number of testing examples. For example, if we have 1000 cases and we want to construct ROC curve from 50 points we make 50 intervals of 20 cases. Dividing boundaries are related to equal number of examples in each interval rather than to equal range of the classication accuracy probabilities Pq,j of individual examples. Another method Group by value merges cases by their values. Interface simply divides the interval of case values (classication accuracy probabilities of individual examples) to the same size sub-intervals and merges cases from same sub-intervals. In this method it can occur that examples will be spread very unequally among the individual intervals. Moreover, the method Group by percent can be further modied by using correction. When it is used then the cases having the same Pq,j must belong into the same interval. When Interface assigns the last case to an interval it also checks the next case. If the value Pq,j of that case is equal to the value of the last case, Interface adds that next case to the current interval. Then it continues with the next case. This option enables more appropriate and stabile estimate of the area under ROC curve for tasks where many cases show the same value of Pq,j . The best results can be usually acquired with Group by percent method with selected correction. Similarly to probabilistic method, the area under ROC curve is used as a tness function for automatic tuning of attribute weights.
18
3.4.2
Regression Task
Dealing with a regression task, Interface outputs a numeric value as a prediction for the queried case. There is used one of the simpliest locally weighted regression method simple weighted average of the output attribute values of the nearest neighbors found for the given query.
n
mod simi ri R=
i=1 n
(8) mod simi
i=1
Where: R numeric prediction for the queried case, n total number of neighbors returned in answer, mod simi modied similarity of ith nearest neighbor, ri output attribute value of the ith nearest neighbor of q. As weights there are used modied similarities of neighbors. This method is quite simple but gives good results. It is noise resistant and more accurate than simple average. In the following example there are shown several methods for calculating result of regression method. The task is to predict values of sinus function. From the function we know only 13 values and we would like to predict value of sinus between known points. We have sinus function with and without noise. In the real data there are often noise and uncertainity and the predictor should be able to handle these data. At rst we have tested 2-NN method that gets good results on clear (not noisy) data. But for the most tasks the parameter k = 2 is too low and as we can see not enough noise resistant. For our example the best solution is the 4-NN method with weight averaging. For clear data it has greater error (but not greater than simple average), but for noisy data it is the best one from all methods in this example. Prediction of all three methods on sinus function and their prediction errors is shown in Figure 3. Calculated Root Mean Square Error is in Table 5. Value 0 0,260 2-NN 0,0241 0,154 4-NN 0,121 0,085 4-avg 0,125 0,090
without noise with noise
Table 5: Example RMSE of dierent LWR method The evaluation of the current case has to be further converted into evaluation of the entire testing set. For the evaluation it is necessary to know the right results. We get them from input le with training cases. It is not possible to use the probability evaluation for regression task. The best solution is to calculate error of prediction. For this purpose we implemented method that calculates Mean Absolute Error (MAE) or Mean Absolute Percentage Error (MAPE).
n
n |Ri R0i | R0i
|Ri R0i | M AE =
i=1
M AP E =
i=1
That value can be directly used for tuning weights of attributes as a tness function.
19
Prediction of sinus function without noise

1,5
0,2
Errors of prediction
1 0,5
y
0,15 0,1
y
0,05
0 0 -0,5 -1 -1,5
x
10
12
0 14 0 -0,05 -0,1 -0,15 -0,2
10
12
14
Values
2-NN
4-NN
4-avg
Values
2-NN
4-NN
4-avg
(a) Prediction of the Sinus
(b) Error of prediction
Prediction of sinus function with noise

1,5
0,6
Errors of prediction
1 0,5
y
0,5 0,4 0,3

y
0,2 0,1 0 -0,1 0 -0,2 -0,3 -0,4 -0,5 2 4 6 8 10 12 14
0 0 -0,5 -1 -1,5 Values 2-NN 2 4 6 8 10 12
14
4-NN
4-avg
Values
2-NN
4-NN
4-avg
(c) Prediction of noisy Sinus
(d) Error of prediction
Figure 3: Example of prediction values of sinus function
3.5
IBR Model Tuning
By the time the training set is evaluated Interface can use this value to adapt learning parameters. Actually this learning process tries to tune-up domain model. The adaptation of the parameters lies mainly in tuning of the feature weights. The whole process of the feature weights tuning is following. An initial set of weights is generated and the training set is evaluated. From the knowledge of evaluation Interface generates another set of feature weights and evaluates it again. It iteratively continues until the appropriate weight setting is reached. Two algorithms for tuning weights have been implemented. The rst is a sequence algorithm and the second is a genetic algorithm. 3.5.1 Sequential Algorithm
This algorithm works in cycles. In each cycle it tries to change attribute weights of all attributes to the value that brings better evaluation. It starts with predened attribute weights that are loaded from text le.
20
At rst we evaluate training cases with the initial weight setting. Then we take original weight of the rst attribute, decrease it by some value (depends on setting and number of cycles) and evaluate it. We can decrease the weight only if its nal value is greater than zero. Now we take original weight of that attribute, increase it by the same value and evaluate it again. Then we compare the evaluations with original, decreased and increased weight of the rst attribute. The weight with the greatest evaluation is set as original. We hold the greatest evaluation because it is the evaluation for original weight of the second attribute. We try again to decrease original weight, evaluate it, increase original weight, evaluate it and compare all three evaluations. The weight with the greatest evaluation is again set as original weight of the second attribute. We repeat these steps for all attributes. This method is modication of the well known hill climbing algorithm. After the rst cycle is done it is needed to change the value by that we decrease and increase attribute weights (we call it dierence). In the most cases we would like to reduce that value. We can do it by decreasing or dividing. After selecting method we can insert the step for decreasing or the value by that we divide dierence. If we choose to decrease the dierence then the evolution of the dierence in next cycles can be written as arithmetic progression: di = d0 (i 1) ds Where: di dierence in ith cycle, d0 initial value of dierence, ds step of decreasing dierence. If we choose second method dividing dierence, then the evolution of dierence can be written as geometric progression: di = d0 d(i1) s Where: di dierence in ith cycle, d0 initial value of dierence, ds step of decreasing dierence. For each cycle we calculate dierence and try to change weights of attributes from the rst to the last. The algorithm runs until at least one of the termination conditions is reached. 3.5.2 Genetic Algorithm (10) (9)
Other chance to get optimal attribute weights is to use Genetic algorithm [40, 6, 7]. It is a stochastic optimization method [21] inspired by the nature. In this method we work with individuals and population compounded of them. Each individual represents a certain solution of the problem we solve. In our case one individual represents the weights of all attributes. Now we have to encode all weights in one string called chromosome. We use simple bit string, so the encoding is very easy. We can aect length of chromosome by setting number of bits per attribute. If we set k bits per attribute, then the total length of the chromosome (in bits) is L = kn, where n is the number of attributes. At the beginning of the algorithm it is necessary to initialize the population. We initialize it pseudo-randomly. We set the bits in chromosomes randomly but the generated population has to t this condition:
21
f1 (i) f0 (i) 1,
P opsize
for i = 1, . . . , L
f1 =
k=1
bit(k, i),
f0 = P opsize f1 .
(11)
Where: i index of bit in chromosome, f1 (i) frequency of ith bits set to 1 in all individuals chromosomes, f0 (i) frequency of ith bits set to 0 in all individuals chromosomes, k index of individual in population, bit(k, i) value of ith bit in k th individual, P opsize size of the population. At the beginning of the GA cycle the individuals are evaluated by numerical value we call tness. It is the number we get from training/testing set evaluation see Section 3.4. If we want, the tness values can be scaled. We have implemented only one simple scaling function Linear Ranking (LRank), where a scaled tness is assigned to each individual according to its order in the population: fi = P opSize i + 1 Where i denotes the order of the ith individual in the population after sorting by old tness value. Then the scaled tness of the best individual is set to PopSize and the worst one in assigned the value of 1. Next step of GA is a selection [5]. We have implemented several selection methods from the simplest to the more capable ones. First method Roulette wheel is implemented only from historical reason and is not much usable. Other implemented method is Remainder Stochastic Sampling with/without Replacement (RSS with/without R). For this method we have to count so-called expected values for individuals. The expected value is a real number that indicates a number of copies that individual should receive in the inter-population. It is usually dened by expression: EVi = fi /favg Where fi is tness of ith individual and favg is average tness over the whole population. In the rst phase of algorithm we select each individual so many times how much is the integer part of its expected value. The rest of individuals we select by roulette wheel, where the tness values are the fraction parts of expected values. When we set RSS without R, then each individual can be selected by roulette wheel only once. We do it simply - when we select a certain individual, we set its tness to zero, so we ensure that the individual will not be selected again. When we set RSS with R, then individual can be selected more than once. The last implemented selection method is Tournament. This method has parameter N that can be dierent for each parent. The parameter determines, from how much individuals we make tournament. So we randomly choose N individuals and select from them one parent (the best one). This sequence is repeated P opSize-times (P opSize is size of the population). If we have selected individuals we can apply recombination operator like crossover and mutation and then we complete new population from recombined individual. 22
It is possible to apply 1-point or 2-point crossover method. In additional, the 2-point crossover method has 2 pp option. When 2 pp is selected it means that the possible crossing points are only between attributes (cannot be inside binary representation of attribute weight). Crossover and mutation is proceed with probability set by the user. After crossover we apply next recombination operator - mutation. We take all ospring individuals and other selected individuals that have not been used for crossover and apply mutation to them. We take each bit of individual chromosome and negate it with probability Pmut . At last we should replace old population with the new one. For this we use a simple method. In the new population we insert all ospring individuals and individuals that have not been used for crossover (all individuals after mutation). Slightly dierent method is used when we check Elitism. Then we do not select inter-population with size P opSize but (P opSize 1) and as the last individual to new population we insert the best individual from the old population (without crossover and mutation). New populations are generated in cycles until some termination condition is fullled. The work can be terminated after certain number of generated populations or when we accomplish certain success. The newest GA we have implemented is Genetic Algoritm with Limited Convergence (GALCO) [35]. This algorithm is little dierent from GA mentioned above. This method does not replace all population at one time. Only two parents are selected with tournament selection. To those parents crossover (Pcross = 100%) is applied, mutation is skipped. The osprings are evaluated and if one of them is better than the best of their parents, then the parents are replaced with osprings. If not then some bad individual is chosen the worst one or by tournament (worse wins). The bad individual is then progressively replaced with the rst ospring, but the following condition has to be fulled. f1 (i) f0 (i) B, for i = 1, . . . , L (12)
Where i, f1 (i), f0 (i), L has the same meaning as in (11) and B is balance deviation parameter. The new individual is evaluated and if it is better than the old bad one, then replacement is conrmed otherwise the old individual is returned. The same procedure is applied to the second ospring. After processing of both osprings the new parents are selected and the cycle continues. This algorithm has one big advantage it does not converge to one local extreme and the nal population holds the dierent results where all results represents very good solutions. It has capability to nd very good solutions in shorter time than other implemented methods.
3.6
Future Work
We would like to expand this tool with several more powerful methods for learning and reasoning. Implementing possibility to handle with PMML [39] would be very helpful. This language could store the whole IBR model including the cases. The big advantages should be in ability of easy use of other applications which support PMML to cooperate with iBARET Figure 4. On the iBARETs input some PMML compliant application could for example preprocess data, make clusters, etc. If the model includes (instead of case) cluster representatives, the iBARET should be able to work with them as with classical data. If the iBARET is able to make output in PMML, then the model could be used in another PMML compliant application, for example for visualization, consultation, etc.
23
unclassified data
Clustering software
classified data
PMML clustering model
iBARET
Compliant applications (IBR classifier)
Figure 4: PMML utilization Advantages of PMML should be used in another method we plan to implement in case reduction techniques. There is a great number of suggested reduction techniques, their overview can be found in [50]. We would like to implement two case memory reduction algorithms. The rst of them was suggested in [12], its underlying idea is very similar to the idea of hierarchical clustering. In this case, application of this algorithm approaches loading of external PMML model generated by a hierarchical clustering algorithm. The algorithm deals with prototypes. Initially, each instance belonging to case memory represents a single prototype. Then, the nearest two prototypes that have the same class are merged into a single prototype using a weighted averaging scheme. The new prototype is located somewhere between two original prototypes. Representation of prototype is straightforward for linear features. In case of presence of symbolic features in instance descriptive vector, these features have to be dichotomized and the prototype is represented by competency to individual symbolic feature values. For example, having the symbolic feature Gender (with values male, female), this feature is replaced by two new features M ale and F emale. When merging prototypes P a(1) having M ale = 0, F emale = 1 (Pa represents single original instance) and P b(2) having M ale = 1 and F emale = 0 (P b represents two original instances), the new prototype P c(3) has values M ale = 0.66 and F emale = 0.33. Advantage of the algorithm is its applicability to regression tasks. In this case, the condition of the same classication of the merged prototypes is replaced by a requirement on similarity of their numerical outputs. The second algorithm would have its origin in TIBL system suggested in [52]. This algorithm attempts to save typical instances representing individual classes and thus enables huge instance reduction and smooth decision boundaries. Resulting PMML model is compact and robust in the presence of noise. Disadvantage of this algorithm is its overgeneralization on problems with complex decision surfaces. The typicality of instance is dened as the ratio of its average similarity to instances of the same class to its average similarity to instances of other classes. Similarity is dened in terms of the dened distance function (1 distance(i1 , i2 )). The reduction algorithm will proceed iteratively as follows. Start with the empty reduced instance set S. Pick the most typical instance x of the original set T which is not in S and it is not correctly classied by the instances in S. Find the most typical instance y in T S which causes x to be correctly classied and add it to S. Repeat this process until all instances in T are classied correctly. For improving evaluating regression task we would like to implement more evaluation functions. At least Mean Square Error (MSE) and Root Mean Square Error (RMSE). This measures should help us to compare results with results of other methods (applications). With help of PMML the user interface could be more intuitive in loading inputs. Now, before starting experiments two les has to be loaded into server and two les into interface. With use of PMML we could load only one le into interface and send data and model to
24
the server through CQL connection. We would like to make some other improvements in iBARET control to make it more user friendly. The basic idea behind the iBARET implementation is an eort to develop a general IBR tool. This tool should be easily applied to wide variety of classication and regression tasks. Our further plan is to use this system on benchmark data to nd out its performance and compare it with performance of other learning tasks.
Experiments
The rst real-life application where we tried an instance-based modelling tool was prediction of result of Coronary Artery Bypass Graft (CABG) surgery operation. This task was solved with commercial CBR-Works 3.0 Professional system [45] with an original module of automated interface. In that time we did not have CQL server. The results can be found in [42, 30] The most recent task we solve is challenge in the SPA domain. The object of research is a timely prediction of capacity requirements of the therapeutic utilities. In other words, it is interesting to know (predict) how many individual procedures are going to be prescribed for the following time period on the assumption that we know the basic patient group characteristics for the regarded period (overall number of patients and their structure). The presented work is based on the CTU weeks dataset (produced by SumatraTT [3]). This dataset is focused on week representation of the SPA problem. Each record represents a single week. It consists of attributes expressing number of the patients belonging to each of 128 pre-dened distinct patient groups in the given week (Gr1, Gr2, . . . ). These attributes are followed by numbers of the individual procedure applications in the given week. There are 38 dierent procedures we would like to predict (P roc1, P roc2, . . . ). Consequently, the prediction model generated upon this dataset is concerned with the patient structure (patients are divided among 128 distinct groups) and outputs weekly predictions of the individual procedures (totaly 38 procedures). Up to now, iBARET was applied to predict the sixth procedure (P roc6) only. The feature weights are learned rst at training data. Predictions for all the available weeks are constructed after it (LOOCV is used for this purpose). The results can be seen in Figure 5. The given model delivers results with following characteristics: M AE = M AP E = 31.3 9.8% (i.e., mean absolute error about 30 procedures per week), (i.e., the model makes about 10% error on average week).
The question is how good and reliable this output is. Let us try to apply a simple regression model that does not consider any (structural) patient information and deals purely with an overall patient number. This model looks as follows: P rocj numi pred = a1 Gr alli + a0 , (a1 = 0.185, a0 = 66)
Where P rocj numi pred is a predicted number of applications of procedure number j in week number i, Gr alli is overall number of patients staying in the spa in week number i. The given model delivers results with following characteristics: M AE = M AP E = 76.4 21.4% (i.e., mean absolute error about 76 procedures per week), (i.e., the model makes about 20% error on average week). 25
Proc6 - predictions vs. real value (original data)

800
700
600
no of Proc6 applications
500 Proc6 - real 400 Proc6_pred1 Proc6_pred2 300
200
100
0 1 11 21 31 41 51 61 week 71 81 91 101 111 121
Figure 5: Procedure 6 - iBARETs training error In Figure 5, the predictive result of the simple regression model is shown for comparison. Mutual comparison of predictive result quality of both models leads to a logical explanation. It suggests that overall number of patients is a very signicant attribute, but the result quality can still be improved considering structural information. The developed models mentioned above were tested on unseen data that were not available during training. These data represent another 21 week time period, i.e., the testing set consists of 21 additional records. The following graph (Figure 6) shows performances of dierent modications of iBARET predictor.
Proc6 - predictions vs. real value (testing data)
800 700 600 no of Proc6 applications 500 400 300 200 100 0 1 11 week 21
Proc6 -real Proc6-iBARET Proc6-iBARET-win Proc6-iBARET-equal weights
Figure 6: Procedure 6 - iBARETs performance on testing data The red line represents such a setting (further numbered as setting 1 ) in which set of feature weights learned on the training data was applied. Only the training data were included into the case memory when predicting the testing instances. The dark blue line deals with the same set of weights, but it uses a sort of windowing - the case memory always contains all the records that precede currently predicted testing record (setting 2 ). Finally, the green line represents a setting in which uniform weights are used (i wi = 1), no windowing is applied
26
(setting 3 ). The presented results leads to following conclusions: The performance of setting 1 (M AE = 100 procedures/week, M AP E = 23%) and its comparison with the performance of the setting 3 (M AE = 91 procedures/week, M AP E = 21%) suggest that the feature weights learned upon the training set are denitely over-trained (their utilization brings no gain in comparison with the uniform weights). At the same time it holds that the prediction error estimated above on basis of the training error was overly optimistic for the given type of predictor.
Proc6 - predictions vs. real value (testing data)
800 700 600 no of Proc6 applications 500 400 300 200 100 0 1 11 week 21
Proc6 -real Proc6-iBARET-win Proc6-iBARET-reduced
Figure 7: Procedure 6 - iBARETs performance with reduced number of patient groups Windowing can bring some gain, setting 2 shows M AE = 81 procedures/week, M AP E = 18%. it suggests that recent history (the last week namely) is important when predicting a forthcoming week. Obviously, learning 128 feature weights on 125 examples gives big space for over-training. That is why, we have tried to identify patient groups that are principal for the given procedure and thus reduce the number of features. It was decided to use 10 features only, the groups with the largest feature weights were used. This approach (without windowing) brought indispensable improvement (Figure 7). M AE = 52 procedures/week, M AP E = 16% on testing data. It follows that this approach dealing with windowing represents the most promising setting of iBARET. For the sake of simplicity, this section focuses on procedure 6 only. The other procedures were predicted as well (P roc8, P roc13, P roc16, P roc19, P roc22, and P roc39) and the results are in accord with the results reached on procedure 6. Generally, it can be concluded that we have constructed predictive models that are likely to predict with MAPE lower than 20%.
27
Future Research
Now, we try to summarize how our further research could be oriented, what we would like to work on. As it has been written in section 3.6, we would like to continue with iBARET developing. The iBARET should include further AI techniques to be more exible and powerful. At rst we would like to implement several case reduction techniques and make some benchmark experiments. It will require greater intervention in both parts of our client server application. On theoretical level there are several possibilities on which to focus our interest. Besides case-reduction there is a very interesting research area feature reduction. Both reduction techniques are very important for relevant results achievement. So we would like to give more information about these techniques and try to include them in ordinary learning process. Case reduction and feature reduction could work together with respect to PAC analysis. Other very interesting area of k-NN like methods is Locally Weighted Regression (LWR). There we would like to focus on a problem of selecting convenient LWR method in accordance with task and input data. There is interesting to nd out whether we are able to suggest LWR method from amount of features, data, and from some additional knowledge about data (meta-data). As it is shown in experiments in section 3.4.2, noise in data inuences results as well. The question is whether we are able to determine optimal LWR method from knowledge of data and noise in data (noise distribution function). First we could try to suggest k-parameter in k-NN method from data and noise estimation. It could be also helpful to learn more details about techniques mostly used in computer vision, such us Support Vector Machines, and try to utilize their advantages in machine learning. For example, we could nd out utilization of Kernel function (used in SVM) in some ML methods.
Conclusion
This work includes short state of the art of Machine Learning and summarizes the two years ongoing research. The rst part is mostly theoretical. It describes Machine Learning and Data Mining techniques and their common subtasks such us data preprocessing, testing, and results evaluation. The main part of this work describes the iBARET system. The system has been extensively expanded in last two years and now it covers many methods for LWR, case set evaluation, feature weights tunning, etc. Almost all methods are explained there. At the end of that section there can be found vision of iBARETs future development. The functionality of iBARET is shown in the last made experiment on SPA data. We have tried to show some problems linked with Machine Learning and Data Mining. We would like to focus on some of them and thus help to enlarge boundaries of Articial Intelligence.
28
References
[1] A. Aamodt and E. Plaza. Case-Based Reasoning: Foundational Issues, Methodological Variations, and System Approaches. AI Communications, 7(1):3959, 1994. [2] D. Aha, D.W. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:3766, 1991. [3] P. Aubrecht. Sumatra Basics. Technical report GL121/00 1, Czech Technical University, Department of Cybernetics, December 2000. [4] P. Aubrecht. Tutorial of Sumatra Embedding. Technical report GL101/00 1, Czech Technical University, Department of Cybernetics, October 2000. [5] J.E. Baker. Reducing Bias and Ineciency in the Selection Algorithm. In Hillsdale Lawrence Erlbaum Associates, editor, Proceedings of the Fourth International Conference on Genetic Algorithms, pages 1421, 1987. [6] D. Beasley, D.R. Bull, and R.R. Martin. An Overview of Genetic Algorithms: Part 1, Fundamentals. Technical report, University Computing, 1993. [7] D. Beasley, D.R. Bull, and R.R. Martin. An Overview of Genetic Algorithms: Part 2, Research Topics. Technical report, University Computing, 1993. [8] D.A. Berry. StatisticsA Bayesian Perspective. Duxbury Press, Belmont, California, 1996. [9] Ch.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, 1995. [10] B. Boser, I. Guyon, and V. Vapnik. A training algorithm for optim margin classier. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pages 14452. ACM, 1992. [11] L. Breiman. Bagging Predictors. Machine Learning, 24(2):123140, 1996. [12] C.L. Chang. Finding prototypes for nearest neighbor classiers. IEEE Transactions on Computers, 23:11791184, 1974. [13] W. Cleveland and S. Devlin. Locally-weighted regression: An approach to regression analysis by local tting. Journal of the American Statistical Association, 83:596610, 1988. [14] The World Wide Web Consortium(W3C). http://www.w3c.org. web page. [15] S. Cost and S. Salzberg. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. Machine Learning, 10:5778, 1993. [16] A. Dhagat and L. Hellerstein. PAC learning with irrelevant attributes. In Proc. of the 35rd Annual Symposium on Foundations of Computer Science, pages 6474. IEEE Computer Society Press, Los Alamitos, CA, 1994. [17] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, London, UK, 1993.
29
[18] P.A. Flach. The logic of learning: a brief introduction to Inductive Logic Programming. In Proceedings of the CompulogNet Area Meeting on Computational Logic and Machine Learning, pages 117. University of Manchester, 1998. [19] P.A. Flach. On the state of the art in machine learning: A personal review. Articial Intelligence, 13(1/2):199222, September 2001. [20] W.J. Frawley, G. Piatetsky-Shapiro, and C.J. Matheus. databases - an overview. Ai Magazine, 13:5770, 1992. Knowledge discovery in
[21] D.E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA, 1989. [22] Gartner Group. http://www.gartner.com. web page. [23] D. Guijarro, J. Tarui, and T. Tsukiji. Finding Relevant Variables in PAC Model with Membership Queries. In Proc. 10th International Conference on Algorithmic Learning Theory - ALT 99, volume 1720, pages 313322. Springer-Verlag, 1999. [24] M. Hall. Correlation-based Feature Selection for Machine Learning. PhD thesis, Waikato University, Department of Computer Science, Hamilton, NZ, 1998. [25] M. Hall and L. Smith. Practical feature subset selection for Machine Learning. In Proceedings of Australian Computer Science Conference. University of Western Australia, February 1996. [26] M.A. Hall. Feature selection for discrete and numeric class machine learning. [27] J.A. Hanley and B.J. McNeil. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology, 143:2936, 1982. [28] L.P. Kaelbling, M.L. Littman, and A.P. Moore. Reinforcement Learning: A Survey. Journal of Articial Intelligence Research, 4:237285, 1996. [29] J.D. Kelley and L. Davies. A Hybrid Genetic Algorithm for Classication. In Proceedings of the Twelth International Conference on Articial Intelligence IJCAI-91, volume 2, 1991. e a [30] J. Klma, L. Lhotsk, O. Stpnkov, and J. Palou. Instance-Based Modelling in e a a s Medical Systems. In R. Trappl, editor, Cybernetics and Systems 2000, 2, pages 365370, Vienna, Austria, April 2000. Austrian Society for Cybernetics Studies. ISBN 3-85206151-2. [31] J. Klma and J. Palou. iBARET Instance-Based Reasoning Tool. Research report e s GL 113/00, CTU FEE, Department of Cybernetics, The Gerstner Laboratory, 2000. [32] J. Klma and J. Palou. iBARET Instance-Based Reasoning Tool. In ELITE Foune s dation, editor, European Symposium on Intelligent Technologies, Hybrid Systems and Their Implementation on Smart Adaptive Systems, 1, page 55, Susterfelderstrasse 83, Aachen, December 2001. Verlag Mainz, Wissenschaftsverlag. [33] J. Klma and J. Palou. P e s rpadov usuzovn a rozhodovn In Sborn Znalosti 2001, e a a . k volume 1, pages 231238. Vysok kola Ekonomick, Praha, Jun 2001. ISBN 80-245a s a 0190-2.
30
[34] J. Kolodner. Case-Based Reasoning. Morgan Kaufmann, San Mateo, California, 1993. [35] J. Kubal L.J.M. Rothkrantz, and J. Laansk. Genetic Algoritms with Limited Conk, z y vergence. In To be published in proceedings of The Fourth International Workshop on Frontiers in Evolutionary Algorithms (FEA 2002), Research Triangle Park, North Carolina, USA, March 8-13 2002. [36] D. Lowe. Similarity metric learning for a varable kernel classier. Neural Computation, 7(1):7285, 1995. e a [37] V. Ma O. Stpnkov, and J. Laansk a kol. Uml inteligence (1). Academia, rk, a z y ea Praha, 1993. [38] Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg, second edition edition, 1994. [39] Data mining group. http://www.dmg.org. web page. [40] M. Mitchell. An introduction to genetic algorithms. The MIT Press, Camridge, Massachusetts, 1998. [41] T. Mitchell. Machine Learning. McGraw-Hill Co., 1997. [42] J. Palou. Vyuit p s z rpadovho usuzovn pro lkask aplikace. Master thesis, CTU e a e r e FEE, Prague, 2000. in czech. [43] J.R. Quinlan. Induction of Decision Trees. Machine Learning, 1:81106, 1986. [44] D. Skalak. Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms. In The 11th International Conference on Machine Learning, pages 293301, 1994. [45] TECINNO GmbH, Kaiserslautern, Germany. CBR Works 3.0 documentation, 1998. [46] M. Vejmelka. Cql server download, cql server description at http://phobos.spaceports.com/~vejmelka/cql/main.html. web page. [47] Weka 3 machine learning software in java, http://www.cs.waikato.ac.nz/~ml/weka/. web page. [48] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of featureweighting methods for a class of lazy learning algorithms. Articial Intelligence Review, 11:273314, April 1997. [49] D. Wilson. Advances in instance-based learning algorithms. PhD thesis, Brigham Young University, Provo, UT, 1997. [50] D.R. Wilson and T.R. Martinez. Reduction Techniques for Instance-Based Learning Algorithms. Machine Learning, 38(3):257286, 2000. [51] I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999. [52] J. Zhang. Selecting Typical instances in Instance-Based Learning. In Morgan-Kaufmann, editor, Proceedings of the Ninth International Machine Learning Workshop, Aberdeen, Escocia, pages 470479, San Mateo, Ca, 1992.
31

Machine Learning and Data

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Machine Learning and Data

Загружено:

Авторское право:

Доступные форматы

Gerstner Laboratory for Intelligent Decision Making and Control Czech Technical University in Prague

Series of Research Reports

Machine Learning and Data Mining

Prague, 2002 ISSN 1213-3000

Main Machine Learning Methods

Machine Learning and Data Mining

Results Evaluation & Model Exchange

Figure 1: ROC curve for example ad Table 1 (ROC area 0.8)

Predictive Model Markup Language

iBARET Instance-BAsed REasoning Tool

Case evaluation Test evaluation Probability ROC curve Evaluation unit

Figure 2: The iBARET blok structure

dis(xik , xjk )) (1) wk q )

|xik xjk | max xlk min xlk

Testing Set Evaluation

mod simi rij Pq,j =

100 mod simi

(8) mod simi

without noise with noise

Prediction of sinus function without noise

0 14 0 -0,05 -0,1 -0,15 -0,2

(a) Prediction of the Sinus

(b) Error of prediction

Prediction of sinus function with noise

0,5 0,4 0,3

0,2 0,1 0 -0,1 0 -0,2 -0,3 -0,4 -0,5 2 4 6 8 10 12 14

0 0 -0,5 -1 -1,5 Values 2-NN 2 4 6 8 10 12

(c) Prediction of noisy Sinus

(d) Error of prediction

Figure 3: Example of prediction values of sinus function

IBR Model Tuning

PMML clustering model

Compliant applications (IBR classifier)

Proc6 - predictions vs. real value (original data)

500 Proc6 - real 400 Proc6_pred1 Proc6_pred2 300

0 1 11 21 31 41 51 61 week 71 81 91 101 111 121

Proc6 -real Proc6-iBARET Proc6-iBARET-win Proc6-iBARET-equal weights

Proc6 -real Proc6-iBARET-win Proc6-iBARET-reduced

Вам также может понравиться