Вы находитесь на странице: 1из 2

Lenses Dataset

Table 1 shows the Lenses data from the UCI Machine Learning repository which can be used to predict a contact lens recommendation. The data conforms to the following encoding:1 -- 3 Classes 1 : the patient should be fitted with hard contact lenses, 2 : the patient should be fitted with soft contact lenses, 3 : the patient should not be fitted with contact lenses. 1. 2. 3. 4. age of the patient: (1) young, (2) pre-presbyopic, (3) presbyopic spectacle prescription: (1) myope, (2) hypermetrope astigmatic: (1) no, (2) yes tear production rate: (1) reduced, (2) normal

Credit Approval Dataset


The Credit Approval (CA) dataset describes credit worthiness data (e.g., a binary classication). 2 Much like the Lenses data, we have split the available data into a training set crx.data.training and a testing set crx.data.testing. The data set contains both numerical and categorical features, and that it is a relatively balanced dataset (meaning a roughly equal number of positive and negative examples). When you look at the data set, you will see that there are some missing values (question marks), The rst step to work with the CA dataset is to process the data. For example, we are now considering the characteristics of feature 1. If the value is missing. We may choose a or b to replace the question mark. You may simply choose the most frequent one among all the records or those records which have the same label. 3 For real-valued features, just replace missing values with the label-conditioned mean (i.e., (x1 |+) for instances labeled as positive). The second aspect you need to consider is normalizing features. Nominal features can be left in their given form where we dene the distance to be a constant value (e.g., 1) if they are dierent values, and 0 if they are the same. However, it is often wise to normalize real-valued features. For the purpose of this assignment, we will use z-scaling, where
(m) zi
1

xi

(m)

i i

(1)

This text was taken directly from the UCI website (modulo a clear typo in the class encoding) http://archive.ics.uci.edu/ml/datasets/Lenses 2 http://archive.ics.uci.edu/ml/datasets/Credit+Approval 3 Note that you will also have to do this with the testing data.

id age 1 1 2 1 5 1 6 1 7 1 8 1 9 2 10 2 12 2 13 2 15 2 16 2 18 3 19 3 20 3 21 3 23 3 24 3 3 4 11 14 17 22 1 1 2 2 3 3

prescription 1 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 1 2

Training astigmatic 1 1 1 1 2 2 1 1 2 1 2 2 1 2 2 1 2 2 Testing 2 2 2 1 1 1

tear rate label 1 3 2 2 1 3 2 2 1 3 2 1 1 3 2 2 2 1 1 3 1 3 2 3 2 3 1 3 2 1 1 3 1 3 2 3 1 2 1 2 1 2 3 1 3 2 3 2

Table 1: Lenses data for Problem 2 such that zi indicates feature i for instance m (similarly xi is the raw input), i is the average value of feature i over all instances, and i is the corresponding standard deviation over all instances.
(m) (m)