Вы находитесь на странице: 1из 5

CS 910 Exercise Sheet 2: Trying out tools

Question 1

## Loading the .csv file to R into dataframe Auto

Auto <- read.csv("C:\\Users\\Akhilesh Pandey\\Desktop\\Automobiles.csv", header = F,

sep

= ",")

## Taking the first alphabet of the column make and storing it as a dataframe count <- as.data.frame(table(substr(Auto$V3, start = 1, stop = 1))) ## Taking count of rows with first alphabet as M or m

subset(count, Var1 == "m" | Var1 == "M")

##

Var1 Freq

##

8

m

39

The count of number if Cars with name starting with M are 39

Question 2 (a)

## Storing the required columns in a separate data frame count count <- as.data.frame(table(Auto$V4, Auto$V5, Auto$V6, Auto$V7, Auto$V8, Auto$V9)) ## removing the combinatons with zero occrence count <- subset(count, as.numeric(count$Freq) > 0) ## taking count of rows nrow(count)

##

[1]

36

The total number of unique combinations for which there are one or more missing values in one of the vectors is 36

(b)

## Storing the required columns in a separate data frame count count <- as.data.frame(table(Auto$V4, Auto$V5, Auto$V6, Auto$V7, Auto$V8, Auto$V9)) ## removing the combinatons with zero occrence count <- subset(count, as.numeric(count$Freq) > 0) ## saving the list of cols and removing the rows with ? in any field collist <- c("Var1", "Var2", "Var3", "Var4", "Var5", "Var6") sel <- apply(count[, collist], 1, function(row) !"?" %in% row) count <- count[sel, ] ## taking count of rows nrow(count)

##

[1]

34

The total number of unique combinations for which there are one or more missing values in one of the vectors is 34

1

Question 3

##Selecting cars with four doors

q3.Auto <- subset(Auto,

##converting the column cost into numeric q3.Auto$V26 <- as.numeric(as.character(q3.Auto$V26)) ##Removing the NA values and displaying median median(q3.Auto$V26, na.rm= TRUE)

Auto$V6 == "four")

##

[1] 11245

The median of price of four door cars is 11245

##Removing the NA values and displaying mean mean(q3.Auto$V26, na.rm= TRUE)

## [1] 13565.67

The mean of price of four door cars is 13565.67

Question 4

## Loading the .csv file to R into dataframe Abal Abal <- read.csv("C:\\Users\\Akhilesh Pandey\\Desktop\\Abalone.csv", header = T,

sep

= ",")

## Plotting the graph of height and length columns

plot(Abal$Height, Abal$Length, main = "Scatterplot showing Height and Length of Abalone",

xlab = "Height", ylab = "Length",

pch =

1,

ylim =

c(0, 1.2))

abline(lm(Abal$Length ~ Abal$Height), col = "red") lines(lowess(Abal$Height, Abal$Length), col = "blue")

2

Scatterplot showing Height and Length of Abalone

0.0 0.2 0.4 0.6 0.8 1.0 Height Length 0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.2
0.4
0.6
0.8
1.0
Height
Length
0.0
0.2
0.4
0.6
0.8
1.0
1.2

##Equation of the scatterplot lm(formula = Abal$Length ~ Abal$Height)->equation equation

##

## Call:

## lm(formula = Abal$Length ~ Abal$Height) ## ## Coefficients:

## (Intercept)

Abal$Height

##

0.1925

2.3761

Outliers are the values in a dataset which are not similar or along the lines of most of the dataset and hence tend to standout. These are usually present because of many reasons, e.g. , data being entered incorrectly, missing values, etc. In our plot the outliers are the points (0, 0.43) and (0, 0.315) being present as the Height has been entered as 0 for these plots. Also, points (0.515, 0.705) and (1.13, 0.455) are outliers as these values are very far from regression line, and hence, are outliers.

3

Question 5

##Taking numeric columns nAbal <- Abal[sapply(Abal, is.numeric)] ##making combinations of 2 columns combn(colnames(nAbal),2)-> combo ##calculate PPCC apply(combo, 2, function(x) cor(nAbal[,x[1]], nAbal[,x[2]])) -> PPCCnAbal ##Storing result as data frame as.data.frame(PPCCnAbal)-> PPCCnAbal ##taking transpose to convert column into rows t(combo) -> combo ##binding with result cbind(combo, PPCCnAbal)-> soln ##filtering as per condition subset(soln,as.numeric(as.character(soln$ PPCCnAbal))>0.95)

##

1

2 PPCCnAbal

##

1

Length

Diameter 0.9868116

## 19 Whole.weight Shucked.weight 0.9694055

## 20 Whole.weight Viscera.weight 0.9663751

## 21 Whole.weight

Shell.weight 0.9553554

The combinations for which Pearson product coefficient is more than 0.95 are (Length,Diameter), (Whole.weight,Shucked.weight), (Whole.weight,Viscera.weight) and (Whole.weight,Shell.weight)

4

Question 6

xlab = "Number

of

Rings", pch

19, col = "blue")

## Adding female ECDF lines(ecdf.female.rings, pch = 20, col = "red") ## Adding infant ECDF lines(ecdf.infant.rings, pch = 20, col = "green")

## taking rows with sex as Males Abal_m <- subset(Abal, as.character(Abal$Sex) == "M") ## calculating the ecdf subset ecdf.male.rings <- ecdf(Abal_m$Rings) ## taking rows with sex as Females Abal_f <- subset(Abal, as.character(Abal$Sex) == "F") ## calculating the ecdf subset ecdf.female.rings <- ecdf(Abal_f$Rings) ## taking rows with sex as Infants Abal_i <- subset(Abal, as.character(Abal$Sex) == "I") ## calculating the ecdf subset ecdf.infant.rings <- ecdf(Abal_m$Rings) ## Plotting the ECDF for males plot(ecdf.male.rings, main = "Emperical CDF of various Sexes", ylab = "Quantiles of diff Sexes", =

Emperical CDF of various Sexes

0 5 10 15 20 25 30 Quantiles of diff Sexes 0.0 0.2 0.4 0.6
0
5
10
15
20
25
30
Quantiles of diff Sexes
0.0
0.2
0.4
0.6
0.8
1.0

Number of Rings

5