Академический Документы
Профессиональный Документы
Культура Документы
Deepak Gupta
10/23/2019
Table of Contents
2 Assumptions ................................................................................................. 3
4 Conclusion ..................................................................................................... 14
2
OBJECTIVE :-
Main objective in the Problem 1 is to find out the penalty to be charged on the AMC
company by doing the various calculations and find out the temperature variance if any
during the whole year.
Main Objective in Problem 2 is to find out the weather complaints coming from
customers are due to the temperature irregularities or it is the sourcing of the material
There are 2 problem statements in the report and the objective is to find out the details
from the data provided in following .csv files:-
Problem 1: Use only the dataset: Cold_Storage_Temp_Data.csv,
Problem 2: Use only the dataset: Cold_Storage_Mar2018.csv
ASSUMPTIONS :-
Here we are assuming that the data is normally distributed and which we can also check
with the help of the distribution graph or the histogram, if it is a bell curved then it is
uniformly distributed.
The data provide is clean and there are no missing values in the data.
3
EXPLORATORY DATA ANALYSIS:-
Installing the necessary package :-
##==========================================##
> ## ##
> ## EXPLORATORY DATA ANALYSIS - COLD STORAGE ##
> ## ##
> ##==========================================##
> ### Loading the Library (ggplot2)
>
> library(ggplot2)
> library(dplyr)
4
Variable Identifications :-
We used the following under mentioned functions for doing our
Exploratory data analysis:-
summary – to find out the how the data if set up and also to see in
once glance the mean, median, the 1st Qu., the 3rd Qu. And also the
minimum and the maximum value of any numerical variable.
Mean :- Mean function is used to find out the mean of any given
numerical variable. Averages are useful because they: summarise
a large amount of data into a single value; and. indicate that there
is some variability around this single value within the original
data.
This is calculated with the formula:-
5
Boxplot :- This type of graph is used to show the shape of the
distribution, its central value, and its variability. In a box and
whisker plot: the ends of the box are the upper and lower quartiles,
so the box spans the interquartile range. the median is marked by a
vertical line inside the box.
Box plot shows the 5 measures in a single frame i.e median, 1st
quartile, 3rd quartile the IQR rage, the whisker and the outliers.
By using the summary function we can see that the there are 4 variables namely :-
Season, Month, Data, and Temperature.
Season is consist of Summer which is for 120 days, Rain is of 122 days and Winter is of
123 days and are consist of 365 days in total.
Months are in the month format along with the number of days in the month.
Date variable is not defined as the dates but it looks like an integer hence we have to
look into the correction of the same.
Temperature we can see the minimum temperature as 1.70⁰C and max. temp as 5⁰C
also the mean / average temperature of all the seasons are 2.963⁰C. while seeing the
maximum temperature we can also see that there may be some possibility of some
outliers in the data, which we will find out while doing the analysis.
6
Structure (str) function :-
By using structure function we can see the class and the structure of the complete data
set.
Here in the provided data set of “Cold Storage temp” we can see that the provided data
set is a data frame with 4 variable and 365 observations in it.
Season with 3 levels namely “Rainy, Summer and Winter” and month with 12 levels
with all the month names are of “factor “class, Date in an integer where as Temperature
is a numerical variable.
7
Univariate Analysis
1. Find mean cold storage temperature for Summer, Winter and Rainy Season?
Used the R – mean function to find out the mean temperature of all the seasons.
mean(Temperature[Season=="Winter"])
[1] 2.700813
> mean(Temperature[Season=="Summer"])
[1] 3.153333
> mean(Temperature[Season=="Rainy"])
[1] 3.039344
Although the days in Summer season are only for 120 days, the mean temperature
of summer season is higher. This may be because in the summer season, for 22 days
the temperature is maintained above 3.5⁰C, in comparison to the Rainey season
where the temperature is kept above 3.5⁰C and above for 20 days only. Hence the
difference of the 2 days may have resulted in the mean difference.
Above box plot is showing some outliers in the Winter and Rainy Season, but in Summer
season there are no outliers. While exploring the data set in excel found that in the
month of Sep. the temperature is kept 5⁰C for 2 days hence there is an outlier in Rainy
season.
### Preparation of Box Plot to see the Outliers and the other parameters
boxplot1
8
2. Finding overall mean for the full year.
Used the R – mean function to find out the mean temperature for the full year
> mean(mydata$Temperature)
[1] 2.96274
We calculated mean by adding the total of the temperature and dividing the same by
number of counts i.e
As the above distribution shows that the most of the data is normally distributed. The
mean temperature is also demoted with the “Red dashed line”.
> sd(mydata$Temperature)
[1] 0.508589
10
Bi-Variate Analysis
1. Assume Normal distribution, what is the probability of temperature having
fallen below 2 C?
Here µ = 2.96, σ2= 0.51, X = 2 with the temperature probability of going below 2⁰C.
We used the pnorm function in R Studio to find out the probability as under :-
11
2. Assume Normal distribution, what is the probability of temperature having gone
above 4 C?
Here µ = 2.96, σ2= 0.51, X = 2 with the temperature probability of going below 2⁰C.
We used the pnorm function in R Studio to find out the probability as under :-
12
3. What will be the penalty for the AMC Company?
As per the contract with the AMC company, It was agreed that if the it was
statistically proven that probability of temperature going outside the 2 - 4 C
during the one-year contract was above 2.5% and less than 5% then the penalty
would be 10% of AMC (annual maintenance case). In case it exceeded 5% then
the penalty would be 25% of the AMC fee.
=NORMDIST(4,2.96,0.51,1)-NORMDIST(2,2.96,0.51,1)
= 0.95
= 95% is the probability that the temperature will be maintained in between the
2⁰C – 4⁰C.
= we can see that the probability is greater then2.5% and less than 5%
= 1-0.95 = 0.05
Hence we can say that with probability of less than an equal to 5% the
penalty will be 10% of the AMC (Annual maintenance charges)
13
Conclusion of the report
While doing the EDA of the report here are some of the conclusions of the report
1. Mean temperature is maintained at 2.96⁰C. overall the year hence found that
there is not much of a fluctuation in maintaining the temperature.
2. There are only 2 incidences where temperature has gone to 5⁰C in the month of
Sep.
3. Probability of temperature going below 2⁰C is 2.91%, and temperature going
above 4⁰C is 2.07% which means that there will not be much of the temperature
variation in the year.
4. We also found that the penalty to be charged to AMC Company, would be 10% as
the probability of the temperature going outside 2⁰C. and 4⁰C. is less then 5%.
14
Appendix A – Source Code
##==========================================##
> ## ##
> ## EXPLORATORY DATA ANALYSIS - COLD STORAGE ##
> ## ##
> ##==========================================##
>
> ### Loading the Library (ggplot2)
>
> library(ggplot2)
> library(dplyr)
> ## Seting up the working diractory
> setwd("C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1")
> getwd()
[1] "C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1"
> ## Importing data by read.csv
> mydata = read.csv("Cold_Storage_Temp_Data.csv", header = TRUE)
> ### attaching the mydata
> attach(mydata)
The following objects are masked from mydata (pos = 3):
> sd(mydata$Temperature)
[1] 0.508589
> #### Prepration of the Frequency / distribution chart with overall mean
value
> plot1 = ggplot(mydata, aes(x=Temperature)) +
geom_density(fill = 4)
> plot1
> plot1 = plot1 +geom_vline(xintercept = mean(Temperature),
size =1, colour = "#FF3721", linetype ="dashed")+
ggtitle("Distribution of Temperature with mean Value")
> plot1
> ### Prespration of the Box Plot to see the Outliers and the other param
eters
> boxplot1= ggplot(mydata, aes(x=Season, y=Temperature, fill=Season))+
geom_boxplot()+ggtitle("Box Plot for Season and Temperature")
> boxplot1
16
17