Вы находитесь на странице: 1из 17

2019

Cold Storage - Project

Deepak Gupta
10/23/2019
Table of Contents

1 Objective of the report ................................................................................................ 3

2 Assumptions ................................................................................................. 3

3 Exploratory Data Analysis – Step by step approach .......................................................... 4

3.1 Environment Set up and Data


Install necessary Packages and Invoke Libraries
Set up working Directory
Import and Read the Dataset

3.2 Variable Identification .................................................................................. 5


3.2.1 Variable Identification – Inferences …………………………………… 6

3.3 Univariate Analysis …………………………………………………………… 8


Solving the problem 1 (finding the Mean and Standard Deviation)

3.4 Bi-Variate Analysis .................................. ................................................ 11


Solving problem 1 (finding out the Penalty %)
Solving problem 2 (finding the probability)

4 Conclusion ..................................................................................................... 14

5 Appendix A – Source Code ..................................................................................................... 15

2
OBJECTIVE :-
Main objective in the Problem 1 is to find out the penalty to be charged on the AMC
company by doing the various calculations and find out the temperature variance if any
during the whole year.
Main Objective in Problem 2 is to find out the weather complaints coming from
customers are due to the temperature irregularities or it is the sourcing of the material
There are 2 problem statements in the report and the objective is to find out the details
from the data provided in following .csv files:-
Problem 1: Use only the dataset: Cold_Storage_Temp_Data.csv,
Problem 2: Use only the dataset: Cold_Storage_Mar2018.csv

ASSUMPTIONS :-
Here we are assuming that the data is normally distributed and which we can also check
with the help of the distribution graph or the histogram, if it is a bell curved then it is
uniformly distributed.

The data provide is clean and there are no missing values in the data.

3
EXPLORATORY DATA ANALYSIS:-
 Installing the necessary package :-

We installed the necessary package / library “ggplot2” and “dplyr”, in the


system to get the necessary output from the provided data

##==========================================##
> ## ##
> ## EXPLORATORY DATA ANALYSIS - COLD STORAGE ##
> ## ##
> ##==========================================##
> ### Loading the Library (ggplot2)
>
> library(ggplot2)
> library(dplyr)

 Setting up the working directory , Import and read data sheet

After installing the necessary library in R Studio, we now created the


working directory by giving the command “setwd()” and filling the path
as mentioned below .
Giving the our file a name “mydata” on importing the data in R Strudio,
and attaching the data in the system, this will help us in reading the file
and the data more quickly and it is easy for us also to write the codes for
and functions

## Seting up the working diractory


> setwd("C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1")
> getwd()
[1] "C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1"
> ## Importing data by read.csv
> mydata = read.csv("Cold_Storage_Temp_Data.csv", header = TRUE)
>
> ### attaching the mydata
> attach(mydata)

4
 Variable Identifications :-
We used the following under mentioned functions for doing our
Exploratory data analysis:-
 summary – to find out the how the data if set up and also to see in
once glance the mean, median, the 1st Qu., the 3rd Qu. And also the
minimum and the maximum value of any numerical variable.

 structure(str) – this gives us the structure of the data and also


show the class of each variable and the details of the number of
observations in the data.

 Mean :- Mean function is used to find out the mean of any given
numerical variable. Averages are useful because they: summarise
a large amount of data into a single value; and. indicate that there
is some variability around this single value within the original
data.
This is calculated with the formula:-

 Standard Deviation :- Standard deviation is a number used to tell


how measurements for a group are spread out from the average
(mean), or expected value. A low standard deviation means that
most of the numbers are close to the average. A high standard
deviation means that the numbers are more spread out.

Standard Deviation is calculated with the following formula:-

 pnorm function :- pnorm is calculated for Getting probabilities


from a normal distribution with mean µ and standard deviation σ.

Formula for calculating the pnorm in r is as under


pnorm(x, mean = , sd = , lower.tail= )

 Histogram :- Creating a histogram provides a visual


representation of data distribution. Histograms can display a large
amount of data and the frequency. The function will calculate and
return a frequency distribution. We can use it to get the frequency
of values in a dataset of the data values.

5
 Boxplot :- This type of graph is used to show the shape of the
distribution, its central value, and its variability. In a box and
whisker plot: the ends of the box are the upper and lower quartiles,
so the box spans the interquartile range. the median is marked by a
vertical line inside the box.

Box plot shows the 5 measures in a single frame i.e median, 1st
quartile, 3rd quartile the IQR rage, the whisker and the outliers.

 Variable Identification – Inferences


o Summary Function :-
summary(mydata)
Season Month Date Temperature
Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700
Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500
Winter:123 Jan : 31 Median :16.00 Median :2.900
Jul : 31 Mean :15.72 Mean :2.963
Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
May : 31 Max. :31.00 Max. :5.000
(Other):179

By using the summary function we can see that the there are 4 variables namely :-
Season, Month, Data, and Temperature.

Season is consist of Summer which is for 120 days, Rain is of 122 days and Winter is of
123 days and are consist of 365 days in total.

Months are in the month format along with the number of days in the month.

Date variable is not defined as the dates but it looks like an integer hence we have to
look into the correction of the same.

Temperature we can see the minimum temperature as 1.70⁰C and max. temp as 5⁰C
also the mean / average temperature of all the seasons are 2.963⁰C. while seeing the
maximum temperature we can also see that there may be some possibility of some
outliers in the data, which we will find out while doing the analysis.

6
 Structure (str) function :-

## Finding out the class and stracture of the data


> str(mydata)
'data.frame': 365 obs. of 4 variables:
$ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3
3 3 ...
$ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5
5 5 5 ...
$ Date : int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...

By using structure function we can see the class and the structure of the complete data
set.

Here in the provided data set of “Cold Storage temp” we can see that the provided data
set is a data frame with 4 variable and 365 observations in it.

Season with 3 levels namely “Rainy, Summer and Winter” and month with 12 levels
with all the month names are of “factor “class, Date in an integer where as Temperature
is a numerical variable.

7
Univariate Analysis
1. Find mean cold storage temperature for Summer, Winter and Rainy Season?

Used the R – mean function to find out the mean temperature of all the seasons.

mean(Temperature[Season=="Winter"])
[1] 2.700813
> mean(Temperature[Season=="Summer"])
[1] 3.153333
> mean(Temperature[Season=="Rainy"])
[1] 3.039344

Mean temp. of Winter Mean temp. of Summer Mean temp. of Rainy

2.70⁰C 3.15⁰C 3.04⁰C

Although the days in Summer season are only for 120 days, the mean temperature
of summer season is higher. This may be because in the summer season, for 22 days
the temperature is maintained above 3.5⁰C, in comparison to the Rainey season
where the temperature is kept above 3.5⁰C and above for 20 days only. Hence the
difference of the 2 days may have resulted in the mean difference.

Above box plot is showing some outliers in the Winter and Rainy Season, but in Summer
season there are no outliers. While exploring the data set in excel found that in the
month of Sep. the temperature is kept 5⁰C for 2 days hence there is an outlier in Rainy
season.

### Preparation of Box Plot to see the Outliers and the other parameters

boxplot1=ggplot(mydata, aes(x=Season, y=Temperature, fill=Season))+


geom_boxplot()+ggtitle("Box Plot for Season and Temperature")

boxplot1
8
2. Finding overall mean for the full year.

Used the R – mean function to find out the mean temperature for the full year

## Finding the overall mean of the temperature

> mean(mydata$Temperature)
[1] 2.96274

Over all mean temperature is 2.96⁰C

We calculated mean by adding the total of the temperature and dividing the same by
number of counts i.e

Mean(m) = sum of the number (1081.4) / number of counts (365)

Mean (M) = 2.96

As the above distribution shows that the most of the data is normally distributed. The
mean temperature is also demoted with the “Red dashed line”.

This above graph is done in R studio using ggplot2 library as under :-

#### Prepration of the distribution chart with overall mean value

> plot1 = ggplot(mydata, aes(x=Temperature)) + geom_density(fill = 4)


> plot1
> > plot1 = plot1 +geom_vline(xintercept = mean(Temperature), size =1,
colour = "#FF3721",+linetype ="dashed")+
ggtitle("Distribution of Temperature with mean Value")
> plot1
9
3. Find Standard Deviation for the full year
Used the R – sd function to find out the standard deviation for the full year

### Finding out the standard Deviation of the temperature

> sd(mydata$Temperature)
[1] 0.508589

Standard Deviation of the temperature is 0.51, by looking at the standard deviation


we can say that the distribution is closely bounded.

10
Bi-Variate Analysis
1. Assume Normal distribution, what is the probability of temperature having
fallen below 2 C?

The normal distribution is defined by the following probability density function,


where μ is the population mean and σ2 is the variance.

Here µ = 2.96, σ2= 0.51, X = 2 with the temperature probability of going below 2⁰C.

We used the pnorm function in R Studio to find out the probability as under :-

### Assume Normal distribution, what is the probability of temperature


having fallen below 2 C?

> pnorm(2,mean=2.96274, sd=0.508589, lower.tail = TRUE)


[1] 0.02918142

This can also be graphically represented as below :-

Probability of Temperature going below 2⁰C is 2.92% only

11
2. Assume Normal distribution, what is the probability of temperature having gone
above 4 C?

The normal distribution is defined by the following probability density function,


where μ is the population mean and σ2 is the variance.

Here µ = 2.96, σ2= 0.51, X = 2 with the temperature probability of going below 2⁰C.

We used the pnorm function in R Studio to find out the probability as under :-

### Assume Normal distribution, what is the probability of temperature


having gone above 4 C?

> pnorm(4,mean=2.96274, sd=0.508589, lower.tail = FALSE)


[1] 0.02070079

This can also be graphically represented as below :-

Probability of Temperature going above 4⁰C is 2.07% only

12
3. What will be the penalty for the AMC Company?
As per the contract with the AMC company, It was agreed that if the it was
statistically proven that probability of temperature going outside the 2 - 4 C
during the one-year contract was above 2.5% and less than 5% then the penalty
would be 10% of AMC (annual maintenance case). In case it exceeded 5% then
the penalty would be 25% of the AMC fee.

We calculate this with the help of excel formula :-


Norm.dist
As under :-

=NORMDIST(4,2.96,0.51,1)-NORMDIST(2,2.96,0.51,1)
= 0.95
= 95% is the probability that the temperature will be maintained in between the
2⁰C – 4⁰C.
= we can see that the probability is greater then2.5% and less than 5%
= 1-0.95 = 0.05

Hence we can say that with probability of less than an equal to 5% the
penalty will be 10% of the AMC (Annual maintenance charges)

13
Conclusion of the report

While doing the EDA of the report here are some of the conclusions of the report

1. Mean temperature is maintained at 2.96⁰C. overall the year hence found that
there is not much of a fluctuation in maintaining the temperature.
2. There are only 2 incidences where temperature has gone to 5⁰C in the month of
Sep.
3. Probability of temperature going below 2⁰C is 2.91%, and temperature going
above 4⁰C is 2.07% which means that there will not be much of the temperature
variation in the year.
4. We also found that the penalty to be charged to AMC Company, would be 10% as
the probability of the temperature going outside 2⁰C. and 4⁰C. is less then 5%.

14
Appendix A – Source Code

##==========================================##
> ## ##
> ## EXPLORATORY DATA ANALYSIS - COLD STORAGE ##
> ## ##
> ##==========================================##
>
> ### Loading the Library (ggplot2)
>
> library(ggplot2)
> library(dplyr)
> ## Seting up the working diractory
> setwd("C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1")
> getwd()
[1] "C:/Users/RMCL/Desktop/BABI Program/Assignments/Project 1"
> ## Importing data by read.csv
> mydata = read.csv("Cold_Storage_Temp_Data.csv", header = TRUE)
> ### attaching the mydata
> attach(mydata)
The following objects are masked from mydata (pos = 3):

Date, Month, Season, Temperature

> ## Finding out the summary of the data


> summary(mydata)
Season Month Date Temperature
Rainy :122 Aug : 31 Min. : 1.00 Min. :1.700
Summer:120 Dec : 31 1st Qu.: 8.00 1st Qu.:2.500
Winter:123 Jan : 31 Median :16.00 Median :2.900
Jul : 31 Mean :15.72 Mean :2.963
Mar : 31 3rd Qu.:23.00 3rd Qu.:3.300
May : 31 Max. :31.00 Max. :5.000
(Other):179
> ## Finding out the class and stracture of the data
> str(mydata)
'data.frame': 365 obs. of 4 variables:
$ Season : Factor w/ 3 levels "Rainy","Summer",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Month : Factor w/ 12 levels "Apr","Aug","Dec",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Date : int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: num 2.4 2.3 2.4 2.8 2.5 2.4 2.8 2.3 2.4 2.8 ...
> #### Finding out the mean temperature of Different Seasons
> mean(Temperature[Season=="Winter"])
[1] 2.700813
> mean(Temperature[Season=="Summer"])
[1] 3.153333
> mean(Temperature[Season=="Rainy"])
[1] 3.039344
> ## Finding the overall mean of the temperature
> mean(mydata$Temperature)
[1] 2.96274
> ### Finding out the standard Deviation of the temperature
> sd(mydata$Temperature)
[1] 0.508589
> ### Assume Normal distribution, what is the probability of temperature having fal
len below 2 C?
> pnorm(2,mean=2.96274, sd=0.508589, lower.tail = TRUE) 15
[1] 0.02918142
>
> ### Assume Normal distribution, what is the probability of temperature having gon
> ### Finding out the standard Deviation of the temperature

> sd(mydata$Temperature)
[1] 0.508589

> ### Assume Normal distribution, what is the probability of temperature


having fallen below 2 C?

> pnorm(2,mean=2.96274, sd=0.508589, lower.tail = TRUE)


[1] 0.02918142

> ### Assume Normal distribution, what is the probability of temperature


having gone above 4 C?

> pnorm(4,mean=2.96274, sd=0.508589, lower.tail = FALSE)


[1] 0.02070079

> #### Prepration of the Frequency / distribution chart with overall mean
value
> plot1 = ggplot(mydata, aes(x=Temperature)) +
geom_density(fill = 4)
> plot1
> plot1 = plot1 +geom_vline(xintercept = mean(Temperature),
size =1, colour = "#FF3721", linetype ="dashed")+
ggtitle("Distribution of Temperature with mean Value")
> plot1

> ### Prespration of the Box Plot to see the Outliers and the other param
eters
> boxplot1= ggplot(mydata, aes(x=Season, y=Temperature, fill=Season))+
geom_boxplot()+ggtitle("Box Plot for Season and Temperature")
> boxplot1

-------** END OF REPORT **------

16
17

Вам также может понравиться