Вы находитесь на странице: 1из 13

Introduction to Data Mining

Spring 2015
Hand-On Class Assignment-1
Data Preprocessing is a key step when you build data mining models. In real-world business settings, a great
proportion of the time you spend in a data mining project will be associated to tasks like data consolidation, data
cleaning, feature construction/transformation, feature selection, etc. More importantly, the quality of theresulting
predictive models will largely depend on your ability to adequately preprocess the raw data and to create
meaningful features from it.In a recent cross-selling application, experiments conducted in a mailing campaign in
the publishing industry shown that about 50-70% of the accuracy of the predictive models built in these
experiments can be -at least indirectly- explained by data preprocessing decisions (sampling, coding of categorical
variables, scaling, etc.).
The purpose of this assignment is to familiarize you with some of the preprocessing tools you may need for your
projects .You should already have Weka installed.For this assignment, you will need to download the TRAIN2.arff
and TRAIN2.csv datasets from the course website.
PART I FEATURE CONSTRUCTION
Open the TRAIN2.arff file found on the course websitein Weka
You should see 9 attributes in the attributes section on the Preprocess tab. Click on each attribute one by one.
You should notice that the statistics in the selected attributesection change according to the attribute you select.
You should be able to see information about the number of missing values, the attribute type (Nominal vs.
Numeric), the number of unique values, and so on.
Transformation
Now click on pgift. You should see from the selected attribute window that pgift is a numeric attribute. You
should also see that the distribution is skewed. Lets transform this attribute. Go to the filter section and click on
the Choose button. Go to the folder filters.unsupervised.attribute Click on the wordNumericTransformin the
white text box. In the filter section, click on the box right next to the Choose button, as indicated in the figure:

In the popup (see the figure below), change the method name to logto take the logarithm of the values in attribute
pgift.Set the invert selection flag to False. Put the number 4 for attribute 4 in the attributIndices text box. Click on
the More button to see how you might transform multiple attributes at a time. Click the OK button. Now click
Apply.

Click on pgift again. You should see that the distribution is normal now.
Nominal to Binary
2

Now click on attribute rfa_2f. In the selected attribute window, you should see that rfa_2f is a nominal attribute
and it has fourpossible values. Go to theNominalToBinary filter the same way you went to the
NumericTransform filter above. Set the parameters as follows.

Apply the filter.

Question 1: How many attributes do you now see in the attributes window? What possible values do the new
attributes take?
There are Twelve attributes in the attributes window which are as follows:

New Attributes take Four possible values which are as follows:


Attribute: rfa_2f = 1

Attribute: rfa_2f = 2

Attribute: rfa_2f = 3

Attribute: rfa_2f = 4

Now click the Undo button to roll back the change. Go back to the NominalToBinary filter and set the
binaryAttributesNominal flag to True. Apply the filter again.
Question 2: Now, how many attributes do you see in the attributes window? What possible values do the new
attributes take?
There are Twelve Attributes in the attributes window which are:

New Attributes take Two possible values which are as follows:

Attribute: rfa_2f = 1

Attribute: rfa_2f = 2
5

Attribute: rfa_2f = 3

Attribute: rfa_2f = 4

Discretize
Now click on attribute firstdate. You will see that type is numeric attribute. Lets discretize this attribute using
the Discretize filter. Set the parameters as follows:

Click on the More button to learn more about the parameters that you may set. For right now, we will leave the
default bins setting at 10.
Question 3: Select the attribute and look at the selected attribute box. What type of attribute do you now
have? What is the label for the first category? What is the category with the least number of observations?
The attribute (Firstdate) is of type Nominal.

Label for the first category is '(-inf-7719.3]'

Category with the least number of observation is '(7719.3-7928.6]'

PART II: SAMPLING


Unsupervised Sampling
Select the Resample instance filter in the filters.unsupervised.instance folder. Notice that in the current relation
section it shows you have 9541 instances. For the Resample, set the parameters as indicated in the figure:

Select OK and Apply.


Q4: How many instances does Weka show in your dataset after sampling?
There are 6678 number of instances Weka show in our dataset after sampling.

Remove an attribute
Click on the check box next to target_d. Now,Click the Remove button at the bottom of the window.
Supervised Sampling
Weka assumes that the last column in your data is the target variable (NOTE: you can change this to another
attribute when running classification and feature selection methods). Our data has a uniform class distribution (I
already sampled from a larger data set so that we would have approximately the same number of ones and zeros).
However, if your dataset were skewed with respect to your class label, you could perform supervised sampling to
bias your sample to a uniform class.
Youll perform supervised sampling. Select the Resample instance filter in the filters.supervised.instance folder,
and use the parameter settings as indicated in the figure:

Select OK and Apply. Save your updated .arff file now. Click on the Save button to save the .arff file as
TRAIN2new.arff.
PART III: FEATURE SELECTION
Now click on the Select Attributes tab in Weka. The default evaluator is CfsSubsetEval and the default Search
Method is BestFirst. Change these to InfoGainAttributeEval and Ranker respectively. Click Start. The
attributes in your data set are ranked by information gain with respect to the class.
Question 5: What are the first three attributes ranked by information gain?
Following are the first three attributes ranked by information gain:

You may also try Principal Components (PC) as the evaluator. Note that PC creates dummies for all of the
attribute/value pairs before perfoming the analysis.
Go back to the default settings of CfsSubsetEval and BestFirst. (These settings will perform the forward selection
method we discussed in class). Press Start.
Question 6: Which attributes were selected?
Following attributes were selected:

Now go back to the Preprocess tab and select the attributes that you found in the step above. You will need to
check the check boxes next to the attributes as well as the check box next to your target variable (target_b).Now,
click on the Invert button at the top of the attributes section. Click Remove button. Feel free to go to the Classify
tab and play around with some of the classification methods we will discuss in class (Decision Trees, Nave Bayes,
MultiLayerPerceptrons, LogisticRegression, K-Nearest Neighbor, etc) or some of the unsupervised methods like
Clustering. For fun, click through the folders to explore the algorithms that are part of the Weka package.
Question 7: (NOT A QUESTION BUT REQUIRES ACTION) Open the TRAIN2new.arff file you created in a
text editor (e.g. MS Word). Cut and paste the first 20 lines of the file to your homework assignment.
@relation
learn-weka.filters.unsupervised.attribute.Remove-R10-weka.filters.supervised.instance.ResampleB1.0-S1-Z10.0-weka.filters.unsupervised.attribute.NumericTransform-R4-Cjava.lang.Math-Mlogweka.filters.unsupervised.attribute.Discretize-B10-M-1.0-R2-weka.filters.unsupervised.instance.Resample-S1Z70.0-no-replacement-weka.filters.unsupervised.attribute.Remove-R9-weka.filters.supervised.instance.ResampleB1.0-S1-Z80.0
@attribute Income {0,1,2,3,4,5,6,7}
@attribute
Firstdate
{'\'(-inf-7719.3]\'','\'(7719.3-7928.6]\'','\'(7928.6-8137.9]\'','\'(8137.9-8347.2]\'','\'(8347.28556.5]\'','\'(8556.5-8765.8]\'','\'(8765.8-8975.1]\'','\'(8975.1-9184.4]\'','\'(9184.4-9393.7]\'','\'(9393.7-inf)\''}
@attribute Lastdate numeric
@attribute pgift numeric
@attribute rfa_2f {1,2,3,4}
@attribute rfa_2a {A,B,C,D,E,F,G}
@attribute pepstrfl {X,0}
@attribute target_b {1,0}
@data
4,'\'(9393.7-inf)\'',9603,-2.054125,3,F,0,1
4,'\'(9393.7-inf)\'',9602,-2.484911,1,F,0,1
0,'\'(9393.7-inf)\'',9602,-2.639051,1,F,0,1
1,'\'(9393.7-inf)\'',9511,-2.351376,1,F,0,0
4,'\'(9184.4-9393.7]\'',9601,-2.079442,2,G,0,0
4,'\'(9393.7-inf)\'',9602,-1.871803,2,F,0,1
3,'\'(9393.7-inf)\'',9510,-1.568618,4,E,X,0
5,'\'(9393.7-inf)\'',9512,-1.504078,4,G,0,0
10

6,'\'(8556.5-8765.8]\'',9602,-0.889262,3,D,X,1
PART IV: DATA CLEANING
So, if you havent already noticed, Weka uses .arff data files. If you open the TRAIN2.arff data file in a text editor,
you will see that it has the following header:
@relation learn-weka.filters.unsupervised.attribute.Remove-R10weka.filters.supervised.instance.Resample-B1.0-S1-Z10.0
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute

Income {0,1,2,3,4,5,6,7}
Firstdate numeric
Lastdate numeric
pgift numeric
rfa_2f {1,2,3,4}
rfa_2a {A,B,C,D,E,F,G}
pepstrfl {X,0}
target_b {1,0}
target_d numeric

@data
In .arff files, the first line must start with @relation, followed by a line for each attribute. Each attribute line
begins with @attribute followed by the name of the attribute and then either the word numeric if numeric or a set
of attribute values separated by commas enclosed in curly brackets if nominal. After the attributes are declared, a
line with @data follows indicating the end of the header.
Following the header, you will see a comma delimited set of rows. You may be familiar with comma separated
files from Excel. If not, .csv is a filetype that you may use to both save and read files in Excel.
You can save comma separated files (csv) using Excel and then read them into Weka easily. The nice thing is that
Weka will automatically detect most nominal attributes and their corresponding values. Once you read a .csv file
into Weka, you can save it in .arff format and edit the heading according to your needs in a text editor.
This PART addresses the top 5 things that will stump you (in Weka) when working with new dirty data.The error
descriptions can be a bit cryptic at times (Afterall the software is free). But here are some things to be aware of.
1. Records of different length
2. Missing values not set to question mark (All missing values must be denoted by a question mark as
opposed to a space). For example, a row with 5 columns and 2 missing values like 4,A,,,B must be
formatted to 4,A,?,?,B for Weka.
3. Non-alphanumeric characters must be removed
4. Non-nominal target variable. For classification, you want your target value (the attribute you are trying to
predict) to be of type Nominal. If Weka detects your target attribute to be numeric, you can discretize the
attribute into two bins. However, you can also make sure that the values are detected as non-numeric from
your .csv file by giving the values text names. For example. You can call the positive examples (pos) and
negative examples (neg) instead of assigning them values of ones and zeros respectively.
11

5. Incompatible training and test sets. You make transformations to attributes in your training set and forget to
make the same transformations to the attributes in your test set (We wont deal with this problem just yet).
Open TRAIN2.csv in Weka
Weka will complain. Open TRAIN2.csv in Excel and inspect the data for errors:
1. Make sure all records have the same length
2. Make sure there are no blank cells (You may need to find blanks and replacewith ?)
3. Make sure there are no bothersome characters (,*,@, etc.). ALSO, for future reference, note numeric
values with commas cause major trouble!
4. In the last column, target_b, replace all ones with pos and all zeros with neg.
Question 8: (NOT A QUESTION BUT REQUIRES ACTION) Once the data are clean, open the file in Weka
and save the file as a TRAIN2new2.arff file. Open the file in a text editor and cut and paste the heading plus
the first 4 lines of data of to your homework assignment.
@relation TRAIN2
@attribute Income numeric
@attribute Firstdate numeric
@attribute Lastdate numeric
@attribute pgift numeric
@attribute rfa_2f numeric
@attribute rfa_2a {D,F,G,E}
@attribute pepstrl {X,0.0}
@attribute target_b numeric
@data
1,8609,9601,0.463415,?,D,X,1
1,?,9512,0.27027,?,?,X,1
0,9301,9504,0.0625,1,F,0.0,1
0,?,9507,?,1,F,X,0

12

These exercises were meant to get you familiar with Weka (Not to cause you data cleaning pain). Feel free to play
around with additional filters and feature selection methods. Next week we will actually start building some
models!

13

Вам также может понравиться