Академический Документы
Профессиональный Документы
Культура Документы
Imagine that you are comparing the temperature highs in August for Dallas, Fort
Worth, Houston, and El Paso. You've hired a research assistant to visit weather.com
and record the values. The data is stored in a csv file and read into the data frame
below.
```{r}
competition <- read.csv(file = file.choose())
dim(competition)
names(competition)
head(competition)
str(competition)
```
To identify the existence of a missing value, we can use the is.na function and pass
as a parameter whatever set of values we wish to test.
```{r}
is.na(competition) # returns a vector of booleans and shows which slot is empty
sum(is.na(competition)) # Returns the total amount of NA values in the data
any(is.na(competition)) # a boolean for the overall data frame
```
To check which row has the null, we can start with a column and use the which
command to test for NA and return the row if one is found.
```{r}
which(is.na(competition[,1:26]))
# Side note: which can work on a data frame but turns all the vectors into a single
intenral vector and returns the row # which does not map cleanly to an individual
vector.
```
Moving through column by column to look through rows is not an efficient approach
to finding NA's. Using the which function with a data frame returns the index as if
the data frame were one long vector. Changing the value of the arr.ind function
(array index) will return the row and column numbers.
```{r}
which(is.na(competition), arr.ind = TRUE) #shows exactly where the NA values are
```
The data entry clerk meant used 0 and the string 'none' to represent missing
values, but from an R technical point of view these values are not considered to be
missing.
We will consider the 0 as an outlier for the moment and seek to identify 'none' as an
incorrect data type.
```{r}
competition$region
competition$Ti.C
```
Scanning the factor values may not be practical as each unique value will be
displayed. Still another approach we can use is to convert the factor vector to a
numeric field in which case the text will be converted to a missing value.
```{r}
competition.region.temp <- ?strtoi(competition$region) # not a memory efficient
solution but not a concern with this volume of data
competition.region.temp # the none has been turned into an NA since there is no
integer value for 'none'
```
One option to address errors is to ignore the row with the error. Some functions will
not work with missing values but many of these functions include a parameter that
if set will ignore the missing values. In some cases, simply ignorning a missing value
may be acceptable.
```{r}
mean(competition$Ti.C, na.rm = TRUE) #using na.rm ignores the rows with NA
values
sd(competition$Ti.C, na.rm = TRUE)
sum(competition$Ti.C, na.rm = TRUE)
```
```
Finally, the last column tracks the number of days that high temperatures have
been recorded. Because the data included a comma, R read the data in as a factor.
While we can tell R not to use factors using the parameter and value
stringsAsFactors = FALSE, the data will still be read in as characters. We cannot use
the as.numeric(as.character) functions as R does not know what to do with the
commas during the numeric conversion. We must use still another function to
remove the commas and then perform the conversion.
```{r}
summary(competition$Ti.C) #These summary functions show the quartiles and
min and max
summary(competition$avg.maintenance.cost.monthly)
summary(competition$Pi.Psia)
summary(competition$Vx.g)
summary(competition$Vy.g)
summary(competition$Tm.C)
summary(competition$MOR.Ohm)
summary(competition$Lv.V)
#possible way to remove outliers (basically turn them into NA which would then
exclude them from the boxplot)
competition$Ti.C[competition$Ti.C > .9] <- NA
competition$avg.maintenance.cost.monthly[competition$avg.maintenance.cost.mo
nthly>8000] <- NA
```
Other packages of interest for tidy: tidyr (especially useful for switching rows to
columns and vice versa), mice