Вы находитесь на странице: 1из 5

##Learning Themes: Missing Data and Outliers - Finding and Modifying

Imagine that you are comparing the temperature highs in August for Dallas, Fort
Worth, Houston, and El Paso. You've hired a research assistant to visit weather.com
and record the values. The data is stored in a csv file and read into the data frame
below.
```{r}
competition <- read.csv(file = file.choose())
dim(competition)
names(competition)
head(competition)
str(competition)

```

### Finding Missing Values and Errors


Missing values (from the technical perspective of R) means the row/column exists
but there is no value assigned to the row/cell. NA can be thought of as 'not
available'.

To identify the existence of a missing value, we can use the is.na function and pass
as a parameter whatever set of values we wish to test.
```{r}
is.na(competition) # returns a vector of booleans and shows which slot is empty
sum(is.na(competition)) # Returns the total amount of NA values in the data
any(is.na(competition)) # a boolean for the overall data frame
```

To check which row has the null, we can start with a column and use the which
command to test for NA and return the row if one is found.
```{r}
which(is.na(competition[,1:26]))

# Side note: which can work on a data frame but turns all the vectors into a single
intenral vector and returns the row # which does not map cleanly to an individual
vector.

competition[which(is.na(competition[,1:26])),] # which returns the row, then we use


this result as an index to reference in the data frame and display to row

competition[which(!complete.cases(competition)),] # returns the rows which do not


have complete cases

```

Moving through column by column to look through rows is not an efficient approach
to finding NA's. Using the which function with a data frame returns the index as if
the data frame were one long vector. Changing the value of the arr.ind function
(array index) will return the row and column numbers.
```{r}
which(is.na(competition), arr.ind = TRUE) #shows exactly where the NA values are
```

The data entry clerk meant used 0 and the string 'none' to represent missing
values, but from an R technical point of view these values are not considered to be
missing.

We will consider the 0 as an outlier for the moment and seek to identify 'none' as an
incorrect data type.
```{r}
competition$region
competition$Ti.C

competition[which(competition$region == 'NA'),] # which returns the row # for the


offending value, then we use this result as a parameter to display to row

```

Scanning the factor values may not be practical as each unique value will be
displayed. Still another approach we can use is to convert the factor vector to a
numeric field in which case the text will be converted to a missing value.
```{r}
competition.region.temp <- ?strtoi(competition$region) # not a memory efficient
solution but not a concern with this volume of data
competition.region.temp # the none has been turned into an NA since there is no
integer value for 'none'

competition.tic.temp <- as.numeric(competition$Ti.C)


# this conversion will not work as R stores factors as integers 'in the background'
and maps these integers to the values we see. When we convert the factor values
to numeric, R converts the underlying integers.
```

### Finding Outliers


```{r}
boxplot(competition$Ti.C) # Outlier identified
boxplot(competition$avg.maintenance.cost.monthly) # Outlier identified
boxplot(competition$Pi.Psia) # Outlier identified
boxplot(competition$Vx.g) # Outlier identified
boxplot(competition$Vy.g) # Outlier identified
boxplot(competition$Tm.C) # Outlier identified
boxplot(competition$MOR.Ohm) # Outlier identified
boxplot(competition$Lv.V) # Outlier identified

```

### Addressing Missing Values

One option to address errors is to ignore the row with the error. Some functions will
not work with missing values but many of these functions include a parameter that
if set will ignore the missing values. In some cases, simply ignorning a missing value
may be acceptable.
```{r}
mean(competition$Ti.C, na.rm = TRUE) #using na.rm ignores the rows with NA
values
sd(competition$Ti.C, na.rm = TRUE)
sum(competition$Ti.C, na.rm = TRUE)
```

### Addressing Errors - Using Representative Values


To create values for missing data or replace outliers we've decided are errors (as
opposed to naturally occuring extreme values), we need an approach to determine
representative values to use. One candidate approach is to simply use the average
of the values (without the outlier or missing value). If we want to be a bit fancier, we
can include calculate a stochastic average.
```{r}

competition$Ti.C[is.na(competition$Ti.C)] <- mean(competition$Ti.C, na.rm = TRUE)


#replaces all NA values in Ti.C with the average of all the values in the column.
sum(is.na(competition$Ti.C))

```

Finally, the last column tracks the number of days that high temperatures have
been recorded. Because the data included a comma, R read the data in as a factor.
While we can tell R not to use factors using the parameter and value
stringsAsFactors = FALSE, the data will still be read in as characters. We cannot use
the as.numeric(as.character) functions as R does not know what to do with the
commas during the numeric conversion. We must use still another function to
remove the commas and then perform the conversion.
```{r}
summary(competition$Ti.C) #These summary functions show the quartiles and
min and max
summary(competition$avg.maintenance.cost.monthly)

summary(competition$Pi.Psia)
summary(competition$Vx.g)
summary(competition$Vy.g)
summary(competition$Tm.C)
summary(competition$MOR.Ohm)
summary(competition$Lv.V)

#possible way to remove outliers (basically turn them into NA which would then
exclude them from the boxplot)
competition$Ti.C[competition$Ti.C > .9] <- NA
competition$avg.maintenance.cost.monthly[competition$avg.maintenance.cost.mo
nthly>8000] <- NA

```

Other packages of interest for tidy: tidyr (especially useful for switching rows to
columns and vice versa), mice

Вам также может понравиться