Вы находитесь на странице: 1из 4

EXPERIMENT NO.

11
5
OBJECTIVE: Implementation of binning method of data cleaning.

Binning or discretization is the process of transforming numerical variables into categorical


counterparts. An example is to bin values for Age into categories such as 20-39, 40-59, and
60-79. Numerical variables are usually discretized in the modeling methods based on
frequency tables (e.g., decision trees). Moreover, binning may improve accuracy of the
predictive models by reducing the noise or non-linearity. Finally, binning allows easy
identification of outliers, invalid and missing values of numerical variables.

There are two types of binning, unsupervised and supervised.

Binning methods sorted data value by consulting its “neighbor- hood,” that is, the values
around it.The sorted values are distributed into a number of “buckets,” or bins.

For example

Price = 4, 8, 15, 21, 21, 24, 25, 28, 34

Partition into (equal-frequency) bins:

Bin a: 4, 8, 15

Bin b: 21, 21, 24

Bin c: 25, 28, 34

In this example, the data for price are first sorted and then partitioned into equal-frequency
bins of size 3.

Smoothing by bin means:


Bin a: 9, 9, 9

Bin b: 22, 22, 22

Bin c: 29, 29, 29

In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.

Smoothing by bin boundaries:

Bin a: 4, 4, 15

Bin b: 21, 21, 24

Bin c: 25, 25, 34

In smoothing by bin boundaries, each bin value is replaced by the closest boundary value.
EXPERIMENT NO. 12
6
OBJECTIVE: Implementation of z score of data cleaning.

Z-scores are linearly transformed data values having a mean of zero and a standard
deviation of 1.Z-scores are also known as standardized scores; they are scores (or data
values) that have been given a common standard. This standard is a mean of zero and
a standard deviation of 1.

Z-Scores - Standardization

We suggested earlier on that giving scores a common standard of zero mean and unity
standard deviation facilitates their interpretation. We can do just that by

 first subtracting the mean over all scores from each individual score and
 then dividing each remainder by the standard deviation over all scores.
These two steps are the same as the following formula:
Zx=Xi−X¯¯¯¯Sx

Example.

A group of 100 people took some IQ test. My score was 5. So is that good or bad? At this
point, there's no way of telling because we don't know what people typically score on this
test. However, if my score of 5 corresponds to a z-score of 0.91, you'll know it was pretty
good: it's roughly a standard deviation higher than the average (which is always zero for z-
scores).
What we see here is that standardizing scores facilitates the interpretation of a single test
score. Let's see how that works.

our 100 scores have a mean of 3.45 and a standard deviation of 1.70.

By
entering these numbers into the formula, we see why a score of 5 corresponds to a z-score
of 0.91:
Zx=5−3.451.70=0.91Zx=5−3.451.70=0.91

In a similar vein, the screenshot below shows the z-scores for all distinct values of our first
IQ test added to the data.

Вам также может понравиться