Вы находитесь на странице: 1из 176

Statistics Overview

11.8.17
Course evaluation

Component PESU PESIT


Test 1 15 20
Test 2 15 20
Assignments & quizzes 10 10
Attendance 0 0
Project 10 10
Final exam 50 40
Total 100 100

The question papers for PESIT and PESU students may even be different.
Different Sections will have different question papers.
Questions
1. What are the two important properties of
statistical distribution that is common across
all distributions?
2. The procedure to compute the mean and
standard deviation is different for different
distributions [T/F]
3. What is the relationship between standard
deviation and variance?
Questions
4. In case of a uniform distribution the
probability density function = 1 throughout
the range. When you add up probabilities
what happens?
5. How do you measure inter-quartile range
value?
6. How to find outliers in a distribution?
Questions
7. What statistical approach is generally used to do the
grading of students for a subject?
8. Can we use the concept of outliers to decide who should
get S grade and F grade? Try it out for any example
distribution of marks?
9. What is central limit theorem? Why is it important? What is
its significance?
10. What is the difference between exponential function and
exponential distribution?
11. What is coefficient of variation?
12. What is coefficient of determination?
13. What is a Variate when we talk about distributions?
Measure of variability
Range difference between min and max
Simplest measure of variability
Sensitive to changes at the extremes
Interquartile range
Difference between third quartile and first quartile
Range for the middle 50% of data
Overcomes sensitivity
53, 55, 70, 58, 64, 57, 53, 69, 57, 68, 53
= 68-53 = 15

Variance = average of the squared difference between data and


mean value
Coefficient of variation indicates how large the variance is in
relation to the mean (x/)*100
Todays Lecture (11/8/2016)
Practical Data Science with R
R Data types
Chapter 2 (Loading Data into R)
Chapter 3 (Exploring Data)
Chapter 4(Managing Data)
Basic statistics
Basic statistics
sampling, normalization, transformation

1. Mean (and median, mode)


2. Standard deviation (and variance)
3. Skew (a mention of kurtosis)
4. Checking for missing data
5. Concept of outliers
6. Sampling data (random sampling with/without replacement, systematic sampling, i.e., selecting
the kth element, stratified sampling - pros and cons); downsampling and upsampling (unless we
collect new data, there cannot be any 'increase' in the data; usually, sampling in statistics implies
selecting a subset from a given set of data); recap of Central limit theorem
7. Normalization - range normalization - why is it needed? how can numerical data in the range (a,
b) be mapped to a different range (c,d)? Also, zero mean and unit variance normalization and
energy normalization (square root of the sum of squares adding to unity).
8. Transformations - sqrt, 1/x, log, |x|, x^k, e^x, etc.
R Basics
R is object base
Types of objects (scalar, vector, matrices and
arrays
Assignment of objects
Building a data frame
R is like Game of Chess easy to learn, hard to
master
R as a Calculator
> 1550+2000
[1] 3550

or various calculations in the same row

> 2+3; 5*9; 6-6


[1] 5
[1] 45
[1] 0
Operation Symbols

Symbol Meaning
+ Addition
- Subtraction
* Multiplication

/ Division
Modulo (estimates remainder in a
%%
division)
^ Exponential
Numbers in R: NAN and NA
NAN (not a number)
NA (missing value)
Basic handling of missing values
>x
[1] 1 2 3 4 5 6 7 8 NA
> mean(x)
[1] NA
> mean(x,na.rm=TRUE)
[1] 4.5
Objects in R
Objects in R obtain values by assignment.
This is achieved by the gets arrow, <-, and not
the equal sign, =.
Objects can be of different kinds.
Built in Functions
R has many built in functions that compute
different statistical procedures.
Functions in R are followed by ( ).
Inside the parenthesis we write the object
(vector, matrix, array, dataframe) to which we
want to apply the function.
Vectors
Vectors are variables with one or more values
of the same type.
A variable with a single value is known as
scalar. In R a scalar is a vector of length 1.
There are at least three ways to create vectors
in R: (a) sequence, (b) concatenation function,
and (c) scan function.
Arrays
Arrays are numeric objects with dimension
attributes.
The difference between a matrix and an array
is that arrays have more than two dimensions.
Matrices
A matrix is a two dimensional array.
The command colnames
Factors
Category Values- For example blood group can
be A, B, AB, O,..
Data Frame
Data Frame
Specifically for datasets
Rows = observations (persons), instances
Columns = variables (age, name, )
Contain elements of different types
Elements in same column: same type
Data Frame
name age child
Anne 28 FALSE
Pete 30 TRUE
Frank 21 TRUE
Julia 39 FALSE
Cath 35 TRUE
> name <- c("Anne", "Pete", "Frank", "Julia", "Cath")
> age <- c(28, 30, 21, 39, 35)
> child <- c(FALSE, TRUE, TRUE, FALSE, TRUE)
> df <- data.frame(name, age, child)
data.frame()
> df
name age child
1 Anne 28 FALSE
2 Pete 30 TRUE
3 Frank 21 TRUE
4 Julia 39 FALSE
5 Cath 35 TRUE
column names match variable names
Plots
str(countries)
'data.frame': 194 obs. of 5 variables:
$ name : chr "Afghanistan" "Albania" "Algeria" ...
$ continent : Factor w/ 6 levels "Africa","Asia", ...
$ area : int 648 29 2388 0 0 1247 0 0 2777 2777
...
$ population: int 16 3 20 0 0 7 0 0 28 28 ...
$ religion : Factor w/ 6 levels
"Buddhist","Catholic" ...
Plots
Categorical Religion
Numerical Population or Area
2x Numerical look at both Population and
Area
2x Categorical look at religion and continent
at the same time
String Characters
In R, string variables are defined by double
quotation marks.
> letters
[1] "a" "b" "c"
Subscripts and Indices
Select only one or some of the elements in a
vector, a matrix or an array.
We can do this by using subscripts in square
brackets [ ].
In matrices or dataframes the first subscript
refers to the row and the second to the
column.
Dataframe
Researchers work mostly with dataframes .
With previous knowledge you can built
dataframes in R
Also, import dataframes into R.
Summaries
Statistics -> Summaries
Mean & Standard Deviation
Mean Everybody knows

Standard Deviation ( ): The standard


deviation of the list x1, x2, x3...xn is given by the formula:
= sqrt(((x1-(avg(x)))^2 + (x2-(avg(x)))^2 + ... +(xn(avg(x)))^2)/n). It is a
measure of dispersion and square-root of variance.

The formula is used when all of the values in the population


are known.

If
the values x1...xn are a random sample chosen from the population, then
the sample Standard Deviation is calculated with same formula,
except that (n-1) is used as the denominator.
Median & Mode
Arithmetic, Statistics. the middle number in a
given sequence of numbers, taken as the
average of the two middle numbers when the
sequence has an even number of numbers:4 is
the median of 1, 3, 4, 8, 9.
Mode is the value of the variate at which a
relative or absolute maximum occurs in the
frequency distribution of the variate
Why divide by (n 1) instead of by n when we are calculating
the sample standard deviation?

The POPULATION VARIANCE 2 is a


PARAMETER of the population.
The SAMPLE VARIANCE s2 is a STATISTIC of
the sample.
We use the sample statistic to estimate the
population parameter
The sample variance s2 is an estimate of the
population variance 2
Suppose we have a population with N individuals or items.
Suppose that we want to take samples of size n individuals or items
from that population
IF we could list all possible samples of n items that could be
selected from the population of N items, then we could find the
sample variance for each possible sample.
We would want the following to be true: We would want the
average of the sample variances for all possible samples
to equal the population variance.
It seems like a logical property and a reasonable thing to happen.
This is called unbiased sample.
When we divide by (n 1) when calculating the sample variance,
then it turns out that
the average of the sample variances for all possible samples is equal
the population variance.
So the sample variance is what we call an unbiased estimate of the
population variance.
If instead we were to divide by n (rather than n 1) when
calculating the sample variance, then the average for all possible
samples would NOT equal the population variance.
Dividing by n does not give an unbiased estimate of the
population standard deviation.
Dividing by n1 satisfies this property of being unbiased, but
dividing by n does not. Therefore we prefer to divide by n-1 when
calculating the sample variance.
Skewness is a measure of symmetry, or more precisely,
the lack of symmetry. A distribution, or data set, is
symmetric if it looks the same to the left and right of
the center point.
Kurtosis is a measure of whether the data are heavy-
tailed or light-tailed relative to a normal distribution.
That is, data sets with high kurtosis tend to have heavy
tails, or outliers. Data sets with low kurtosis tend to
have light tails, or lack of outliers. A uniform
distribution would be the extreme case.
The histogram is an effective graphical technique for
showing both the skewness and kurtosis of data set.
Standard Normal Distribution
A normal distribution where mean is zero and
stddev is 1.
Quartiles
Go from low to higher. For example salaries
are distributed in 4 ranges. Lowest is Q1 and
highest is Q4.
skewness
For univariate data Y1, Y2, ..., YN, the formula for
skewness is:g1=Ni=1(YiY)3/Ns3
where Y is the mean, s is the standard deviation,
and N is the number of data points. Note that in
computing the skewness, the s is computed with N in
the denominator rather than N - 1. The above formula
for skewness is referred to as the Fisher-Pearson
coefficient of skewness. There is an adjusted skewness
G1 where you multiply by g1 with Sqrt(N*N-1)/N-1.
The adjustment approaches 1 for large N.
The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero. Negative
values for the skewness indicate data that are skewed left
and positive values for the skewness indicate data that are
skewed right. By skewed left, we mean that the left tail is
long relative to the right tail. Similarly, skewed right means
that the right tail is long relative to the left tail. If the data
are multi-modal, then this may affect the sign of the
skewness.
Some measurements have a lower bound and are skewed
right. For example, in reliability studies, failure times
cannot be negative.
It should be noted that there are alternative definitions of
skewness in the literature.
Basic statistics and
data preprocessing

Population
Mean, median, mode
Standard deviation, variance
Skew and kurtosis
Missing data
Outliers
Sampling data and the CLT
Normalization
Population
A collection of people, items or events about
which you want to make an inference
Expectation a belief that something is going
to happen
Weighted average of all the possible values a
random variable x can take
E(x) =
Moments in statistics
nth moment of a real valued continuous function f(x) of
a real variable x about c

1st moment about 0 of a pdf is the mean


2nd moment about the mean of a pdf is the variance
3rd moment about the mean of a pdf is the skewness
4th moment about the mean of a pdf is the kurtosis (or
the flatness)
Kurtosis
For univariate data Y1, Y2, ..., YN, the formula for
kurtosis is:kurtosis=Ni=1(YiY)4/Ns4 - 3
where Y is the mean, s is the standard deviation,
and N is the number of data points. Note that in
computing the kurtosis, the standard deviation is
computed using N in the denominator rather than N -
1.
This definition is used so that the standard normal
distribution has a kurtosis of zero. In addition, with the
second definition positive kurtosis indicates a "heavy-
tailed" distribution and negative kurtosis indicates a
"light tailed" distribution.
17/8/2016 Class
Standard Normal Distribution
Normal distribution with mean o and stddev
1.
On Y-axis you can have probability density
function or frequency function.
Y-axis scale then is from 0 to 1.
Z-score or Score
Z-scrore = x-/
This is useful to normalize any distribution
because at mean, it becomes zero and at ,
the z-score becomes 1.
Sample statistics
Statistics computed on the sample!
Statistic computed on the sample is called a point
estimator of the population statistic
Mean
Simple average
Median
Odd number of values, even number of values
Whenever data has extreme values, median is the
preferred measure of the central value
Mode
Can there be more than one mode?
Modes
There can be more than one place where
maxima for frequency exist, such distributions
are multi-modal.
Percentile
i=(p/100)n
Arrange n data points in ascending order
Value in the ith position

If i is not an integer round up


If i is an integer, percentile is the average of the ith and
(i+1)th entry

Quartile are specific percentiles


Q1 First quartile 25th percentile
Q2 Second quartile 50th percentile (median)
Q3 Third quartile 75th percentile
Measure of variability
Range difference between min and max
Simplest measure of variability
Sensitive to changes at the extremes

Interquartile range
Difference between third quartile and first quartile
Range for the middle 50% of data
Overcomes sensitivity
53, 55, 70, 58, 64, 57, 53, 69, 57, 68, 53
= 68-53 = 15

Variance = average of the squared difference between data and mean


value
Coefficient of variation indicates how large the variance is in relation to
the mean (x/)*100
Is this histogram unimodal or bimodal?
Different bin sizes (i.e., intervals)?
Univariate Gaussian
Importance of standard deviation

https://en.wikipedia.org/wiki/Gaussian_function
Bivariate Gaussian

https://en.wikipedia.org/wiki
/Gaussian_function
Sections of a Bivariate Gaussian

http://www.byclb.com/TR/Tutorials/neural_networks/ch4_1.htm
Covariance matrix for the Bivariate
Correlation coefficient
= cov(x,y)/x y

= x y /x y

Chalkboard three possibilities (zero, +ve, -ve)


A biomodal function
mixture of 2 Gaussians

Mixtools
package in R

http://exploringdatablog.blogspot.in/2011/08/fitting-mixture-distributions-with-r.html
Otsu method

http://groveslab.cchem.berkeley.edu/programming/20140310_imagej2/
Separating mixtures or
decision boundaries

https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/gaussClassif.pdf
Boundaries in higher dimensions

http://www.cs.toronto.edu/~urtasun/courses/CSC411/09_naive_bayes.pdf
Bivariate Gaussians not equiprobable
Indistinguishable mixture

http://cacm.acm.org/magazines/2012/2/145412-disentangling-gaussians/abstract
Tests for bimodality
Necessary condition:
Kurtosis (skewness)2 1 (Pearsons criterion)
Equality holds in the extreme case of two diracs

Is the distribution anything other than


Unimodal?
Dip test
p < 0.05 significant multimodality,
0.05 < p < 0.10 multimodality with marginal significance
Mixtools, flexmix, mcclust, mcdist
Density estimation
If the density function is known

Maximum likelihood estimation (MLE) an


example
Notion of correlation

http://www.jerrydallal.com/lhsp/corr.htm
Sampling a population
It is not always convenient to or possible to
examine every member of an entire population
Bruises on every apple picked in an orchard
vs
Bruises on a subset of apples picked in an orchard
Sampling = selecting a subset of data from a
population about which we can make an
inference
Sampling
Systematic sampling pick every kth sample
Simple random sampling without replacement
Simple random sampling with replacement
Cluster sample
Stratified sampling
CSR example
Software engineers divided into strata by age
Samples more likely to be representative
Sampling in DSP (reconstruction of a signal)
Downsampling vs upsampling
Sampling
Sampling is the process of selecting a subset of a population to
represent the whole, during analysis and modeling. In the current
era of big datasets, some people argue that computational power
and modern algorithms let us analyze the entire large dataset
without the need to sample.
We can certainly analyze larger datasets than we could before, but
sampling is a necessary task for other reasons. When youre in the
middle of developing or refining a modeling procedure, its easier to
test and debug the code on small subsamples before training the
model on the entire dataset. Visualization can be easier with a
subsample of the data; ggplot runs faster on smaller datasets, and
too much data will often obscure the patterns in a graph.And often
its not feasible to use your entire customer base to train a model
Sampling
Its important that the dataset that you do use is an accurate representation
of your population as a whole. For example, your customers might come
from all over the United States. When you collect your custdata dataset, it
might be tempting to use all the customers from one state, say
Connecticut, to train the model. But if you plan to use the model to make
predictions about customers all over the country idea to pick customers
randomly from all the states, because, its a good what predicts health
insurance coverage for Texas customers might be different from what
predicts health insurance coverage in Connecticut. This might not always
be possible (perhaps only your Connecticut and Massachusetts branches
currently collect the customer health insurance information), but the
shortcomings of using a nonrepresentative dataset should be kept in mind.
More on Sampling
Sampling can be used as a data reduction technique
because it allows a large data set to be represented by a
much smaller random data sample (or subset). Suppose
that a large data set, D, contains N tuples. Lets look at the
most common ways that we could sample D for data
reduction
Simple random sample without replacement (SRSWOR) of size
s: This is created by drawing s of the N tuples from D (s < N),
where the probability of drawing any tuple in D is 1/N, that is, all
tuples are equally likely to be sampled.
Simple random sample with replacement (SRSWR) of size s: This
is similar to SRSWOR, except that each time a tuple is drawn
from D, it is recorded and then replaced. That is, after a tuple is
drawn, it is placed back in D so that it may be drawn again.
Test and Training Samples
When youre building a model to make
predictions, like our model to predict the
probability of health insurance coverage, you
need data to build the model. You also need
data to test whether the model makes correct
predictions on new data. The first set is called
the training set, and the second set is called
the test (or hold-out) set.
Sampling
Cluster sample: If the tuples in D are grouped into M
mutually disjoint clusters, then an SRS(Simple
random sample) of s clusters can be obtained, where s
< M. For example, tuples in a database are usually
retrieved a page at a time, so that each page can be
considered a cluster. A reduced data representation
can be obtained by applying, say, SRSWOR to the
pages, resulting in a cluster sample of the tuples. Other
clustering criteria conveying rich semantics can also be
explored. For example, in a spatial database, we may
choose to define clusters geographically based on how
closely different areas are located.
Sampling
Startified Sampling
Stratified sample: If D is divided into mutually
disjoint parts called strata, a stratified sample of D
is generated by obtaining an SRS at each stratum.
This helps ensure a representative sample,
especially when the data are skewed. For
example, a stratified sample may be obtained
from customer data, where a stratum is created
for each customer age group. In this way, the age
group having the smallest number of customers
will be sure to be represented.
Stratified Sampling
The training set is the data that you feed to the
model-building algorithmregression,
decision trees, and so onso that the
algorithm can set the correct parameters to
best predict the outcome variable. The test set
is the data that you feed into the resulting
model, to verify that the models predictions
are accurate.
Systematic sampling
Systematic sampling is a statistical method
involving the selection of elements from an
ordered sampling frame. The most common
form of systematic sampling is an equal-
probability method. In this approach,
progression through the list is treated
circularly, with a return to the top once the
end of the list is passed
Systematic Sampling
Example: We have a population of 5 units (A to E). We
want to give unit A a 20% probability of selection, unit
B a 40% probability, and so on up to unit E (100%).
Assuming we maintain alphabetical order, we allocate
each unit to the following interval:
A: 0 to 0.2 B: 0.2 to 0.6 (= 0.2 + 0.4) C: 0.6 to 1.2 (= 0.6
+ 0.6) D: 1.2 to 2.0 (= 1.2 + 0.8) E: 2.0 to 3.0 (= 2.0 +
1.0) If our random start was 0.156, we would first
select the unit whose interval contains this number
(i.e. A). Next, we would select the interval containing
1.156 (element C), then 2.156 (element E). If instead
our random start was 0.350, we would select from
points 0.350 (B), 1.350 (D), and 2.350 (E).
Sampling
An advantage of sampling for data reduction is
that the cost of obtaining a sample is
proportional to the size of the sample, s, as
opposed to N, the data set size. Hence, sampling
complexity is potentially sublinear to the size of
the data. Other data reduction techniques can
require at least one complete pass through D. For
a fixed sample size, sampling complexity
increases only linearly as the number of data
dimensions, n, increases, whereas techniques
using histograms, for example, increase
exponentially in n.
Sampling
When applied to data reduction, sampling is most
commonly used to estimate the answer to an
aggregate query. It is possible (using the central
limit theorem) to determine a sufficient sample
size for estimating a given function within a
specified degree of error. This sample size, s, may
be extremely small in comparison to N. Sampling
is a natural choice for the progressive refinement
of a reduced data set. Such a set can be further
refined by simply increasing the sample size
Central Limit Theorem
In probability theory, the central limit theorem (CLT) states that, given certain
conditions, the arithmetic mean of a sufficiently large number of iterates
of independent random variables, each with a well-defined (finite) expected
value and finite variance, will be approximately normally distributed, regardless of
the underlying distribution.[1][2] To illustrate what this means, suppose that
a sample is obtained containing a large number of observations, each observation
being randomly generated in a way that does not depend on the values of the
other observations, and that the arithmetic average of the observed values is
computed. If this procedure is performed many times, the central limit theorem
says that the computed values of the average will be distributed according to
the normal distribution(commonly known as a "bell curve"). A simple example of
this is that if one flips a coin many times, the probability of getting a given number
of heads should follow a normal curve, with mean equal to half the total number
of flips.
The central limit theorem has a number of variants. In its common form, the
random variables must be identically distributed. In variants, convergence of the
mean to the normal distribution also occurs for non-identical distributions or for
non-independent observations, given that they comply with certain conditions.
IID variables
Lot of machine learning algorithms assume IID
distribution.
Independent, Identically distributed.
Otherwise if you learn from one sample, it
may have no bearing to rest of the population.
Contingency Table & Correlation
A table showing the distribution of one variable in rows
and another in columns, used to study the correlation
between the two variables.

Handed-
ness Right handed Left handed Total
Gender
Male 43 9 52
Female 44 4 48
Total 87 13 100

The term contingency table was first used by Karl Pearson in "On the Theory of
Contingency and Its Relation to Association and Normal Correlation in 1904.
Little bit of R
Loading data into R
Will accept a variety of files such as csv.
Tabular forms refered to as structured values.
Car data viewed as a table
Reading UCI data
uciCar <- read.table(
'http://www.win-
vector.com/dfiles/car.data.csv',
sep=',',
header=T
)
Examining the Car data
class()Tells you what type of R object you
have. In our case, class(uciCar) tells us the
object uciCar is of class data.frame.
help()Gives you the documentation for a
class. In particular try help (class(uciCar)) or
help("data.frame").
summary()Gives you a summary of almost
any R object. summary(uciCar) shows us a lot
about the distribution of the UCI car data.
Summary gives out distribution of
different variables in the data-set
summary(uciCar)
buying maint doors
high :432 high :432 2 :432
low :432 low :432 3 :432
med :432 med :432 4 :432
vhigh:432 vhigh:432 5more:432
persons lug_boot safety
2 :576 big :576 high:576
4 :576 med :576 low :576
more:576 small:576 med :576
rating
acc : 384
good : 69
unacc:1210
vgood: 65
Other data formats
XLS/XLSXhttp://cran.r-project.org/doc/manuals/
R-data.html#Reading-Excel-spreadsheets
JSONhttp://cran.r-
project.org/web/packages/rjson/index.html
XMLhttp://cran.r-
project.org/web/packages/XML/index.html
MongoDBhttp://cran.r-
project.org/web/packages/rmongodb/index.html
SQLhttp://cran.r-
project.org/web/packages/DBI/index.html
Transforming data in R First Load the
data
d <-
read.table(paste('http://archive.ics.uci.edu/ml
/',
'machine-learning-
databases/statlog/german/german.data',sep='
'),
stringsAsFactors=F,header=F)
print(d[1:3,])
Setting Column Names
colnames(d) <- c('Status.of.existing.checking.account',
'Duration.in.month', 'Credit.history', 'Purpose',
'Credit.amount', 'Savings account/bonds',
'Present.employment.since',
'Installment.rate.in.percentage.of.disposable.income',
'Personal.status.and.sex', 'Other.debtors/guarantors',
'Present.residence.since', 'Property', 'Age.in.years',
'Other.installment.plans', 'Housing',
'Number.of.existing.credits.at.this.bank', 'Job',
'Number.of.people.being.liable.to.provide.maintenance.for',
'Telephone', 'foreign.worker', 'Good.Loan')
d$Good.Loan <- as.factor(ifelse(d$Good.Loan==1,'GoodLoan','BadLoan'))
print(d[1:3,]
Mapping Loan Codes
mapping <- list(
'A40'='car (new)',
'A41'='car (used)',
'A42'='furniture/equipment',
'A43'='radio/television',
'A44'='domestic appliances',

)
Transforming data in R
for(i in 1:(dim(d))[2]) {
if(class(d[,i])=='character') {
d[,i] <- as.factor(as.character(mapping[d[,i]]))
}
}
Data
You can also work with relational databases and load data into R.
options( java.parameters = "-Xmx2g" )
library(RJDBC)
drv <- JDBC("org.h2.Driver",
"h2-1.3.170.jar",
identifier.quote="'")
options<-";LOG=0;CACHE_SIZE=65536;LOCK_MODE=0;UNDO_LOG=0"
conn <- dbConnect(drv,paste("jdbc:h2:H2DB",options,sep=''),"u","u")
dhus <- dbGetQuery(conn,"SELECT * FROM hus WHERE ORIGRANDGROUP<=1")
dpus <- dbGetQuery(conn,"SELECT pus.* FROM pus WHERE pus.SERIALNO IN \
(SELECT DISTINCT hus.SERIALNO FROM hus \
WHERE hus.ORIGRANDGROUP<=1)")
dbDisconnect(conn)
save(dhus,dpus,file='phsample.RData')
Key take aways
Data frames are your friend.
Use read_table() to load small, structured
datasets into R.
You can use a package like RJDBC to load
data into R from relational databases, and to
transform or aggregate the data before
loading using SQL.
Always document data provenance
Exploring Data
Suppose your goal is to build a model to predict which
of your customers dont have health insurance;
perhaps you want to market inexpensive health
insurance packages to them. Youve collected a dataset
of customers whose health insurance status you know.
Youve also identified some customer properties that
you believe help predict the probability of insurance
coverage: age, employment status, income,
information about residence and vehicles, and so on.
Youve put all your data into a single data frame called
custdata that youve input into R.1 Now youre
ready to start building the model to identify the
customers youre interested in.
Missing Values
Typical problems revealed by data summaries
missing values, invalid values and outliers,
and data ranges that are too wide or too
narrow..
Missing Values
A few missing values may not really be a
problem, but if a particular data field is largely
unpopulated, it shouldnt be used as an input
without some repair.
Rows could be dropped
Unpopulated fields could be dropped.
Missing Values
The variable is.employed is missing for about a third of
the data. Why?
Is employment status unknown? Did the company start
collecting employment data only recently? Does NA
mean not in the active workforce (for example,
students or stay-at-home parents)
is.employed
Mode :logical
FALSE:73
TRUE :599
NA's :328
Missing Values
The variables housing.type, recent.move, and
num.vehicles are only missing a few values.
Its probably safe to just drop the rows that
are missing values especially if the missing
values are all the same 56 rows.
Homeowner free and clear :157
Homeowner with mortgage/loan:412
Occupied with no rent : 11
Rented :364
NA's : 56
Both variables have 56 rows with
missing data
recent.move num.vehicles
Mode :logical Min. :0.000
FALSE:820 1st Qu.:1.000
TRUE :124 Median :2.000
NA's :56 Mean :1.916
3rd Qu.:2.000
Max. :6.000
NA's :56
Invalid values and Outliers
Even when a column or variable isnt missing any
values, you still want to check that
the values that you do have make sense. Do you
have any invalid values or outliers?
Examples of invalid values include negative values
in what should be a non-negative
numeric data field (like age or income), or text
where you expect numbers. Outliers
are data points that fall well out of the range of
where you expect the data to be.
Negative Values
Negative values for income could indicate
bad data. They might also have a special
meaning, like amount of debt.
Either way, you should check how prevalent
the issue is, and decide what to do: Do you
drop the data with negative income? Do
you convert negative values to zero?
Outliers
Customers of age zero, or customers of an age
greater than about 110 are outliers. They fall
out of the range of expected customer values.
Outliers could be data input errors. They
could be special sentinel values: zero might
mean age unknown or refuse to state.
And some of your customers might be
especially long-lived.
Often, invalid values are simply bad data input.
Negative numbers in a field like age, however,
could be a sentinel value to designate
unknown. Outliers might also be data errors
or sentinel values. Or they might be valid but
unusual data pointspeople do
occasionally live past 100.
As with missing values, you must decide the
most appropriate action: drop the data field,
drop the data points where this field is bad, or
convert the bad data to a useful value. Even if
you feel certain outliers are valid data, you
might still want to omit them from model
construction (and also collar allowed
prediction range), since the usual achievable
goal of modelling is to predict the typical case
correctly.
Data Range
Income ranges from zero to over half a million dollars; a
very wide range.
summary(custdata$income)
Min. 1st Qu. Median Mean 3rd Qu.
-8700 14600 35000 53500 67000
Max.
615000
You also want to pay attention to how much the values in
the data vary. If you believe that age or income helps to
predict the probability of health insurance coverage, you
should make sure there is enough variation in the age and
income of your customers for you to see the relationships.
Is the data range wide? Is it narrow?
Data Range
Data can be too narrow, too. Suppose all your
customers are between the ages of
50 and 55. Its a good bet that age range
wouldnt be a very good predictor of the
probability of health insurance coverage for that
population, since it doesnt vary
much at all.
Data Range
How narrow is too narrow a data range?
Of course, the term narrow is relative. If we were predicting
the ability to read for children between the ages of 5 and
10, then age probably is a useful variable as-is.
For data including adult ages, you may want to transform or
bin ages in some way, as you dont expect a significant
change in reading ability between ages 40 and 50. You
should rely on information about the problem domain to
judge if the data range is narrow, but a rough rule of thumb
is the ratio of the standard deviation to the mean. If that
ratio is very small, then the data isnt varying much.
Unit of measurement
Checking units can prevent inaccurate results
hourly wage, annual wage or monthly wage
Key Takeaways
Take the time to examine your data before diving into
the modelling.
The summary command helps you spot issues with
data range, units, data type, and missing or invalid
values.
Visualization additionally gives you a sense of data
distribution and relationships among variables.
Visualization is an iterative process and helps answer
questions about the data. Time spent here is time not
wasted during the modeling process.
Managing Data
Cleaning Data
Treating Missing Values
To Drop or not to Drop
Remember that we have a dataset of 1,000 customers; 56 missing values
represents about 6% of the data. Thats not trivial, but its not huge,
either. The fact that three variables are all missing exactly 56 values
suggests that its the same 56 customers in each case. Thats easy enough
to check.
is.employed value is missing...
...assign the value "missing". Otherwise...
...if is.employed==TRUE,
assign the value "employed"...
...or the value "not employed".
When values are missing randomly
> meanIncome <- mean(custdata$Income, na.rm=T)
> Income.fix <- ifelse(is.na(custdata$Income),
meanIncome,
custdata$Income)
> summary(Income.fix)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 35000 66200 66200 66200 615000
Dont forget the argument "na.rm=T"! Otherwise, the
mean() function will include the NAs by default, and
meanIncome will be NA.
Assuming that the customers with missing income are distributed
the same way as the others, this estimate will be correct on
average, and youll be about as likely to have overestimated
customer income as underestimated it. Its also an easy fix to
implement.
This estimate can be improved when you remember that income is
related to other variables in your datafor instance, you know
from your data exploration in the previous chapter that theres a
relationship between age and income. There might be a
relationship between state of residence or marital status and
income, as well. If you
have this information, you can use it.
Note that the method of imputing a missing value of an input
variable can be used even for categorical data
Missing values
Its important to remember that replacing missing values by
the mean, as well as many more sophisticated methods for
imputing missing values, assumes that the customers with
missing income are in some sense random (the faulty
sensor situation).
Its possible that the customers with missing income data
are systematically different from the others. For instance, it
could be the case that the customers with missing income
information truly have no incomebecause theyre not in
the active workforce.
If this is so, then filling in their income information by
using one of the preceding methods is the wrong thing to
do. In this situation, there are two transformations you can
try.
Transformations
Method Math Operation Good for: Bad for:

Log ln(x) Right skewed data Zero values


log10(x) log10(x) is especially good at Negative values
handling higher order powers
of 10 (e.g. 1000, 100000)

Square root x Right skewed data Negative values

2
Square x Left skewed data Negative values

1/3
Cube root x Right skewed data Not as effective at
Negative values normalizing as log transform

Reciprocal 1/x Making small values bigger Zero values


and big values smaller Negative values
Transformations
In this situation, there are two transformations
you can try.
WHEN VALUES ARE MISSING SYSTEMATICALLY
One thing you can do is to convert the numeric
data into categorical data, and then use the
methods that we discussed previously. In this
case, you would divide the income into some
income categories of interest, such as below
$10,000, or from $100,000 to $250,000 using
the cut() function, and then treat the NAs as we
did when working with missing categorical values
What if NA s are replaced with zeros?
What about actual zeros
You need a way to differentiate the two cases. You
need extra categorical variable.
Normalization
custdata$income.norm <- with(custdata,
income/Median.Income)
summary(custdata$income.norm)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.1791 0.2729 0.6992 1.0820 1.3120 11.6600
summary(medianincome)
State Median.Income
: 1 Min. :37427
Alabama : 1 1st Qu.:47483
Alaska : 1 Median :52274
Arizona : 1 Mean :52655
Arkansas : 1 3rd Qu.:57195
California: 1 Max. :68187
(Other) :46
custdata <- merge(custdata, medianincome,
by.x="state.of.res", by.y="State")
Merge median income information into the
custdata data frame by matching the column
custdata$state.of.res to the column
medianincome$State
summary(custdata[,c("state.of.res", "income",
"Median.Income")])
state.of.res income Median.Income
California :100 Min. : -8700 Min. :37427
New York : 71 1st Qu.: 14600 1st Qu.:44819
Pennsylvania: 70 Median : 35000 Median :50977
Texas : 56 Mean : 53505 Mean :51161
Michigan : 52 3rd Qu.: 67000 3rd Qu.:55559
Ohio : 51 Max. :615000 Max. :68187
(Other) :600
The need for data transformation can also
depend on which modeling method you plan to
use. For linear and logistic regression, for
example, you ideally want to make sure that the
relationship between input variables and output
variable is approximately linear, and that the
output variable is constant variance (the variance
of the output variable is independent of the input
variables). You may need to transform some of
your input variables to better meet these
assumptions
Other transformations
Converting continuous to discrete
Converting age into ranges
Normalization and rescaling
Normalization is useful when absolute quantities are
less meaningful than relative ones.
Weve already seen an example of normalizing
income relative to another meaningful quantity
(median income). In that case, the meaningful
quantity was external (came from the analysts
domain knowledge); but it can also be internal
(derived from the data itself).
Centering on the mean age
> summary(custdata$age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 38.0 50.0 51.7 64.0 146.7
> meanage <- mean(custdata$age)
> custdata$age.normalized <-
custdata$age/meanage
> summary(custdata$age.normalized)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.7350 0.9671 1.0000 1.2380 2.8370
A value for age.normalized that is much less
than 1 signifies an unusually young customer;
much greater than 1 signifies an unusually old
customer. But what constitutes
much less or much greater than 1? That
depends on how wide an age spread your
customers tend to have.
Standard Deviation
The common interpretation of standard deviation as a unit
of distance implicitly assumes that the data is distributed
normally. For a normal distribution, roughly two-thirds of
the data (about 68%) is within plus/minus one standard
deviation from the mean. About 95% of the data is within
plus/minus two standard deviations from the mean. In
figure 4.3, a 35-year-old is (just barely) within one standard
deviation from the mean in population1, but more than
two standard deviations from the mean in population2.
You can still use this transformation if the data isnt
normally distributed, but the standard deviation is most
meaningful as a unit of distance if the data is unimodal and
roughly symmetric around the mean.
LOG Transformations
LOG TRANSFORMATIONS FOR SKEWED AND
WIDE DISTRIBUTIONS
Monetary amountsincomes, customer value,
account, or purchase sizesare some of the most
commonly encountered sources of skewed
distributions in data science applications. The log
of the data is normally distributed. This leads us
to the idea that taking the log of the data can
restore symmetry to it.
Density vs Income skewed to the left can be
made more symmetric using density vs log
income (Here log to the base of 10)
Of course, taking the logarithm only works if the data is
non-negative. There are
other transforms, such as arcsinh, that you can use to
decrease data range if you havezero or negative values. We
dont always use arcsinh, because we dont find the values
of the transformed data to be meaningful. In applications
where the skewed data is
monetary (like account balances or customer value), we
instead use what we call a signed logarithm. A signed
logarithm takes the logarithm of the absolute value of the
variable and multiplies by the appropriate sign. Values
strictly between -1 and 1 are mapped to zero.
Signed log lets you visualize non-positive data.
Sample Group Column
A convenient way to manage random sampling is to add a
sample group column to the data frame. The sample group
column contains a number generated uniformly from zero
to one, using the runif function. You can draw a random
sample of arbitrary size from the data frame by using the
appropriate threshold on the sample group column.
For example, once youve labeled all the rows of your data
frame with your sample group column (lets call it gp), then
the set of all rows such that gp < 0.4 will be about four-
tenths, or 40%, of the data. The set of all rows where gp is
between 0.55 and 0.70 is about 15% of the data (0.7 - 0.55
= 0.15). So you can repeatably generate a random sample
of the data of any size by using gp.
R also has a function called sample that draws
a random sample (a uniform random
sample, by default) from a data frame. Why
not just use sample to draw training and
test sets? You could, but using a sample group
column guarantees that youll draw the
same sample group every time.
This reproducible sampling is convenient when
youre debugging code. In many cases, code will
crash because of a corner case that you forgot
to guard against. This corner case might show up
in your random sample. If youre using a different
random input sample every time you run the
code, you wont know if you will tickle the bug
again. This makes it hard to track down and fix
errors.
You also need repeatable samples for regression
testing.
Record Grouping
One caveat is that the preceding trick works if every object
of interest (every customer, in this case) corresponds to a
unique row. But what if youre interested less in which
customers dont have health insurance, and more about
which households have uninsured members? If youre
modeling a question at the household level rather than the
customer level, then every member of a household should
be in the same group (test or training). In other words,
the random sampling also has to be at the household level.
Dataset as per households
Here gp is the group number
Data Provenance
Youll also want to add a column (or columns) to
record data provenance: when your dataset was
collected, perhaps what version of your data
cleaning procedure was used on the data before
modeling, and so on. This is akin to version
control for data. Its handy information to have,
to make sure that youre comparing apples to
apples when youre in the process of improving
your model, or comparing different models or
different versions of a model
Key Take Aways
What you do with missing values depends on how many
there are, and whether theyre missing randomly or
systematically.
When in doubt, assume that missing values are missing
systematically.
Appropriate data transformations can make the data
easier to understand and easier to model.
Normalization and rescaling are important when relative
changes are more important than absolute ones.
Data provenance records help reduce errors as you
iterate over data collection, data treatment, and modeling
Class of 19/8/2016
Correlation
Correlation Coefficient
Cor(A,B) = (i=1 to n) (ai mean of A)*(bi mean of B)/n * stddev of A
*stddev of B
n = number of tuples
ai, bi are respective values of the ith tuple
A and B are tuples
Cor(A,B) =( ( (i=1 to n) aibi) n*Mean of A* Mean of B)/ n * stddev of
A *stddev of B
If Cor(A,B) is >0 they are positively correlated if Cor(A,B) < 0 they are
negatively correlated.
If Cor(A,B) = 0 , they are independent.
Higher the value stronger correlation
With high correlation we can consider one of them redundant.

Correlation does not imply causality.


Covariance
E(A) = Mean of A and E(B)=Mean of B
Covariance between A and B is defined as
Cov(A,B) = E((A-Mean of A)*(B-Mean of B)]=
Cov(A,B) = (i=1 to n) (ai mean of A)*(bi mean
of B)/n
Cov(A,B) = E(A.B) (Mean of * Mean of B)
Cor(A,B) = Cov(A,B) /Std dev of A* Std dev of B
What happens to Cor and Cov when both are
standard normal distribution?
Dimensionality Reduction
Nowadays, accessing data is easier and cheaper than ever before. This has
led to the proliferation of data in organizations' data warehouses and on
the Internet.
Analyzing this data is not a trivial task, as its quantity often makes analysis
difficult or impractical. For instance, the data is often more abundant than
available memory on the machines.
The available computational power is also often not enough to analyze the
data in a reasonable time frame. One solution is to have recourse to
technologies that deal with high dimensionality in data (Big Data).
These solutions typically use the memory and computing power of several
machines for analysis (computer clusters). But most organizations do not
have such an infrastructure.
Therefore, a more practical solution is to reduce the dimensionality of the
data while keeping the essential information intact.
Dimensionality Reduction
Another reason to reduce dimensionality is that, in some
cases, there are more attributes(columns) than
observations(rows). If scientists were to store the genome
of all inhabitants of Europe and the United States, the
number of cases (approximately 1 billion) would be much
less than the 3 billion base pairs in the human DNA.
Most analyses do not work well or at all when observations
are fewer than attributes. Confronted with this problem,
data analysts might select groups of attributes that go
together (for instance, height and weight) according to
their domain knowledge, and reduce the dimensionality of
the dataset.
Excessive sparseness when there are too many dimensions.
Sampling vs Dimensionality Reduction
Sampling is essentially selecting a sub-set of
rows.
Dimensionality Reduction has to do with
columns, or coming up with dimensions.
Attributes(columns) and Dimensions(columns)
are not one and the same.
We need fewer dimensions.
Dimensionality reduction
With a fixed number of training samples, predictive
power reduces with increase in dimensions
Curse of dimensionality - overfitting vs generalization

http://blog.fliptop.com/blog/2015/03/02/bias-variance-and-overfitting-machine-learning-
overview/
Curse of dimensionality
When dimensionality increases, data becomes
increasingly sparse in the space that it
occupies
Definitions of density and distance
between points, which is critical for clustering
and outlier detection, become less meaningful
Some approaches
Too many missing values in one or more attributes
Drop columns

An attribute with very low variance


Not really separating data points in anyway
drop attribute

Highly correlated attributes


Perhaps one gives the same information as the other
drop an attribute
Some more approaches
Step wise discriminant analysis or feature-wise
selection
Choose an attribute, add another and see if there is a
significant improvement

Step wise dropping of features


Drop features one at a time and see if any combination
gives results that are as good or better

A linear combination of features that gives a good


approximation to the data without as many attributes?
Dimensionality Reduction
Purpose:
Avoid curse of dimensionality
Reduce amount of time and memory required by data
mining algorithms
Allow data to be more easily visualized
May help to eliminate irrelevant features or reduce
noise
Techniques
Principle Component Analysis
Singular Value Decomposition
Others: supervised and non-linear techniques
Principal Component Analysis
The data is often structured across several relatively independent dimensions,
with each dimension measured using several attributes. This is where Principal
Component Analysis (PCA) is an essential tool, as it permits each observation to
receive a score on each dimension (determined by PCA itself) while allowing us
to discard the attributes from which the dimensions are computed.

What is meant here is that, for each of the obtained dimensions, values (scores)
are produced that combine several attributes. These can be used for further
analyses.

For example, participants' self-reports on items such as lively, excited,


enthusiastic, (and many more) can be combined in a single dimension we call
positive arousal.

In this sense, PCA performs both dimensionality reduction (discards the


attributes) and feature extraction (computes the dimensions).
PCA
Another use of PCA is to check that the underlying
structure of the data corresponds to a theoretical
model. For instance, in a questionnaire, a group of
questions might assess construct A, another group
construct B, and so on.
PCA will find two different factors, if indeed there is
more similarity in the answers of participants within
each group of questions compared to the overall
questionnaire. Researchers in fields such as psychology
use PCA mainly to test their theoretical model in this
fashion
PCA
Principal Component Analysis aims at finding
the dimensions (principal components) that
explain most of the variance in a dataset.
Once these components are found, a principal
component score is computed for each row
and each principal component.
These scores can be understood as
summaries (combinations) of the attributes
that compose the data frame.
PCA
PCA produces the principal components by
computing the eigenvalues of the covariance
matrix of a dataset. There is one eigenvalue
for each row in the covariance matrix.
The computation of eigenvectors is also
required to compute the principal component
scores.
PCA
The eigenvalues and eigenvectors are
computed using the following equation,
where A is the covariance matrix of interest, I
is the identity matrix, k is a positive integer,
is the eigenvalue and v is the eigenvector:
(A I)k v = 0
PCA
What is important to understand for current purposes
is that the principal components are sorted by
descending order as a function of their eigenvalues
(each row is a principal component): the higher the
eigenvalues, the higher the variance explained.
The more the variance is explained, the more useful
the principal component is in summarizing the data.
The part of variance explained by a principal
component is computed as a function of the
eigenvalues by dividing the eigenvalue of the principal
component of interest by the sum of the eigenvalues.
The equation is as follows, where
partVar is the part of variance explained and
eigen is the eigenvalue:
Command
myPCA = function (df) {
eig = eigen(cov(df))
means = unlist(lapply(df,mean))
scores = scale(df, center = means) %*%
eig$vectors
list(values = eig$values,vectors = eig$vectors,
scores = scores)
}
In this function, we first compute in line 2 the eigenvalues
and eigenvectors using function eigen(). We then create on
line 3 a vector called means containing the means
for each of the attributes of the original dataset. In line 4,
we then use this vector to center the values around the
means and multiply (matrix multiplication) the resulting
data frame by the eigenvectors data frame to produce the
principal component scores.
In line 5, we create an object (then returned by the
function) of a class list containing the eigenvalues, the
eigenvectors, and the principal component scores
Try it with iris dataset
my_pca = myPCA(iris[1:4])
my_pca
Iris(Plants) dataset created by
R.A.Fisher+
Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes
and the class.
Attribute Information:
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
5. class:
-- Iris Setosa -- Iris Versicolour -- Iris Virginica
Iris dataset
Missing Attribute Values: None Summary
Statistics:
Min Max Mean SD Class Correlation
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565(high!)
Class Distribution: 33.3% for each of 3 classes
PCA computation using mypca
function for Iris data-set
PCA of Iris dataset
The purpose of principal component analysis
is to find the best low-dimensional
representation of the variation in multi-variate
data set. To carry out a PCA on a multi-variate
data set, first standardize the variables under
study using scale() function. This is necessary
if input variables have very different variances.
pc1 = prcomp(iris[,1:4], scale=T)
Here scale option scales all variables such that
they have zero mean and variance 1
Rotation
Standard Deviations
1.7083611 0.9560494 0.3830886 0.1439265
PC1 PC2 PC3 PC4
Sepal- 0.521 -0.377 0.719 0.261
length
Sepal- -0.269 -0.923 -0.244 -0.123
width
Petal- 0.580 -0.024 -0.142 -0.801
length
Petal- 0.564 -0.066 -0.063 0.523
width
Explaining Rotations
PC1 PC2 PC3 PC4
Sepal- 0.521 -0.377 0.719 0.261
length
Sepal- -0.269 -0.923 -0.244 -0.123
width
Petal- 0.580 -0.024 -0.142 -0.801
length
Petal- 0.564 -0.066 -0.063 0.523
width

PC1 = 0.521*z1-0.269*z2+0.580*z3+0.564*z4
PC2 = -0.377*z1-0.923*z2-0.024*z3-0.066*z4
And so on.
Here Z1,Z2, Z3 and Z4 are standardized versions of attributes i.e with mean = 0 and standard
deviation =1.
Square of the loadings sum to 1. (Vertically sum squares)
Importance of components

PC1 PC2 PC3 PC4

Standard 1.7084 0.9560 0.38309 0.14393


Deviation
Proportion of 0.7296 0.2285 0.0369 0.00518
variance
Cumulative 0.7296 0.9581 0.99482 1.0000
Variance

This gives us the proportion of each component and


standard deviation explained by each component. THIS
SHOWS THAT THE FIRST TWO or at most first 3
components should be retained
Now let's compare the eigenvalues in our
function and the corresponding values when
using princomp(). We first run princomp() on the
data and assign the result to the pca object. The
sdev component of the princomp objects is the
square root of the eigenvalues. In order to obtain
a comparable metric, we apply this
transformation to the eigenvalues in the my_pca
object for comparison using the sqrt() function:
pca = princomp(iris[1:4], scores = T)
cbind(unlist(lapply(my_pca[1], sqrt)), pca$sdev)
PCA
Importance of components
Comp. 1 Comp. 2 Comp. 3 Comp. 4
Standard deviation
2.0494032 0.49097143 0.27872586 0.153870700
Proportion of variance
0.9246187 0.05306648 0.01710261 0.005212184
Cumulative proportion
0.9246187 0.97768521 0.99478782 1.000000000
First component explains the most
variance
Principal component analysis
Simplify data
Understand relationship between variables
Get an insight to patterns
PCA example

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Principal component analysis
Get data
Subtract mean
Compute the covariance matrix
Find Eigen values and Eigen vectors
Select principal Eigen vectors (PCA)
Project data onto selected Eigen vectors
Plot data
Covariance, Eigen analysis

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Choosing an appropriate axis

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
A new representation

http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
Other preprocessing
Categorical data - alphabetical ordering, regular
expressions to identify typographical errors -
reassigning id's, etc., will not make a difference to
the information

Numerical data
quantization (binning) ; real to discrete, ranges
binarization or thresholding (loss of information)

Using ratios - scale invariance

Вам также может понравиться