Вы находитесь на странице: 1из 19

Project:​ ​Predictive​ ​Analytics​ ​Capstone

Complete​ ​each​ ​section.​ ​When​ ​you​ ​are​ ​ready,​ ​save​ ​your​ ​file​ ​as​ ​a​ ​PDF​ ​document​ ​and​ ​submit​ ​it
here:
https://coco.udacity.com/nanodegrees/nd008/locale/en-us/versions/1.0.0/parts/7271/project

Task​ ​1:​ ​Determine​ ​Store​ ​Formats​ ​for​ ​Existing​ ​Stores


1. What​ ​is​ ​the​ ​optimal​ ​number​ ​of​ ​store​ ​formats?​ ​How​ ​did​ ​you​ ​arrive​ ​at​ ​that​ ​number?

The​ ​optimal​ ​number​ ​of​ ​store​ ​formats​ ​is​ ​3.​ ​To​ ​arrive​ ​at​ ​this​ ​number​ ​the​ ​following​ ​steps​ ​were
taken:

a) Extract​ ​store​ ​sales​ ​data​ ​for​ ​2015​ ​and​ ​sum​ ​the​ ​values​ ​for​ ​each​ ​store​ ​by​ ​category
b) Use​ ​K-means​ ​clustering​ ​to​ ​determine​ ​the​ ​optimum​ ​number​ ​of​ ​store​ ​clusters
based​ ​on​ ​the​ ​percentage​ ​of​ ​sales​ ​per​ ​category​ ​for​ ​2015
c) ​ ​From​ ​the​ ​K-means​ ​analysis​ ​report,​ ​the​ ​following​ ​output​ ​was​ ​produced:

d) Based​ ​on​ ​the​ ​output,​ ​the​ ​optimal​ ​number​ ​of​ ​clusters​ ​is​ ​3​ ​because:
i) Highest​ ​Adjusted​ ​Rand​ ​Index​ ​indicating​ ​best​ ​cluster​ ​stability
ii) Highest​ ​CH​ ​Index​ ​indicating​ ​best​ ​distinctness​ ​and​ ​compactness
iii) Minimal​ ​spread​ ​and​ ​high​ ​median,​ ​and​ ​matching​ ​number​ ​of​ ​clusters​ ​in
both​ ​index​ ​calculations
2. How​ ​many​ ​stores​ ​fall​ ​into​ ​each​ ​store​ ​format?

The​ ​summary​ ​of​ ​number​ ​of​ ​stores​ ​that​ ​fall​ ​into​ ​each​ ​format(cluster):
Cluster No​ ​of​ ​Stores

1 23

2 29

3 33

3. Based​ ​on​ ​the​ ​results​ ​of​ ​the​ ​clustering​ ​model,​ ​what​ ​is​ ​one​ ​way​ ​that​ ​the​ ​clusters​ ​differ​ ​from
one​ ​another?
The​ ​image​ ​below​ ​shows​ ​a​ ​section​ ​from​ ​the​ ​output​ ​of​ ​K-Means​ ​Clustering:

Referring​ ​to​ ​Percent​ ​of​ ​Sales​ ​for​ ​Bakery​ ​(PCSales_Bakery)​ ​and​ ​Percent​ ​of​ ​Sales​ ​for
Dairy(PCSales_Dairy),​ ​we​ ​can​ ​observe​ ​that​ ​cluster​ ​1​ ​has​ ​a​ ​strong​ ​negative​ ​while​ ​cluster
2​ ​has​ ​a​ ​strong​ ​positive​ ​indicating​ ​they​ ​are​ ​opposite​ ​of​ ​each​ ​other.​ ​Cluster​ ​3​ ​has​ ​a​ ​value
in-between​ ​clusters​ ​1​ ​and​ ​2.

4. Please​ ​provide​ ​a​ ​Tableau​ ​visualization​ ​(saved​ ​as​ ​a​ ​Tableau​ ​Public​ ​file)​ ​that​ ​shows​ ​the
location​ ​of​ ​the​ ​stores,​ ​uses​ ​color​ ​to​ ​show​ ​cluster,​ ​and​ ​size​ ​to​ ​show​ ​total​ ​sales.

https://public.tableau.com/views/Clustervisualizationv2/StoreLocationbyClusterandTotalSales?:e
mbed=y&:display_count=yes&publish=yes
Task​ ​2:​ ​Formats​ ​for​ ​New​ ​Stores
1. What​ ​methodology​ ​did​ ​you​ ​use​ ​to​ ​predict​ ​the​ ​best​ ​store​ ​format​ ​for​ ​the​ ​new​ ​stores?​ ​Why
did​ ​you​ ​choose​ ​that​ ​methodology?​ ​(Remember​ ​to​ ​Use​ ​a​ ​20%​ ​validation​ ​sample​ ​with
Random​ ​Seed​ ​=​ ​3​ ​to​ ​test​ ​differences​ ​in​ ​models.)

Store​ ​demographic​ ​data​ ​was​ ​used​ ​as​ ​the​ ​data​ ​to​ ​predict​ ​the​ ​best​ ​store​ ​format​ ​for​ ​new​ ​stores.
The​ ​data​ ​contains​ ​44​ ​variables​ ​related​ ​to​ ​customer​ ​demographics​ ​including:
Age,​ ​Education,​ ​HouseHold​ ​Size,​ ​HouseHold​ ​Income​ ​Range,Population​ ​Percentage,
Home​ ​Value​ ​,​ ​Population​ ​Density
The​ ​variables​ ​were​ ​used​ ​to​ ​develop​ ​a​ ​multinomial​ ​classification​ ​model​ ​for​ ​the​ ​store
format(cluster).​ ​3​ ​models​ ​were​ ​compared:
i)​ ​Decision​ ​Tree​ ​Model
ii)​ ​Random​ ​Forest​ ​Model
iii)​ ​Boosted​ ​Model

The​ ​output​ ​was​ ​validated​ ​using​ ​20%​ ​validation​ ​sample​ ​with​ ​random​ ​seed​ ​=​ ​3.​ ​The​ ​output​ ​of​ ​the
validation​ ​is​ ​as​ ​follows:

Model Accuracy F1 Accuracy_1 Accuracy_2 Accuracy_3

Random​ ​Forest 0.8235 0.8251 0.7500 0.8000 0.8750

Decision​ ​Tree 0.8235 0.8251 0.7500 0.8000 0.8750

Boosted 0.8235 0.8543 0.8000 0.6667 1.0000


The​ ​confusion​ ​matrix​ ​is​ ​shown​ ​below:

The​ ​3​ ​models​ ​show​ ​similar​ ​accuracy,​ ​but​ ​the​ ​boosted​ ​model​ ​shows​ ​better​ ​F1​ ​Score​ ​at​ ​0.8543..
Since​ ​the​ ​F1​ ​is​ ​a​ ​measure​ ​of​ ​precision​ ​and​ ​accuracy,​ ​the​ ​boosted​ ​model​ ​was​ ​selected​ ​and​ ​used
to​ ​predict​ ​the​ ​format​ ​of​ ​the​ ​new​ ​stores.
2. What​ ​format​ ​do​ ​each​ ​of​ ​the​ ​10​ ​new​ ​stores​ ​fall​ ​into?​ ​Please​ ​fill​ ​in​ ​the​ ​table​ ​below.

Store​ ​Number Segment


S0086 3
S0087 2
S0088 1
S0089 2
S0090 2
S0091 1
S0092 2
S0093 1
S0094 2
S0095 2

Task​ ​3:​ ​Predicting​ ​Produce​ ​Sales


1.​ ​What​ ​type​ ​of​ ​ETS​ ​or​ ​ARIMA​ ​model​ ​did​ ​you​ ​use​ ​for​ ​each​ ​forecast?​ ​Use​ ​ETS(a,m,n)​ ​or
ARIMA(ar,​ ​i,​ ​ma)​ ​notation.​ ​How​ ​did​ ​you​ ​come​ ​to​ ​that​ ​decision?

The​ ​model​ ​parameters​ ​were​ ​similar​ ​as​ ​the​ ​plots​ ​shared​ ​similar​ ​characteristics​ ​in​ ​time
series​ ​decomposition​ ​plots.​ ​These​ ​are​ ​shown​ ​in​ ​the​ ​coming​ ​pages.

Both​ ​ETS​ ​and​ ​ARIMA​ ​models​ ​were​ ​compared​ ​based​ ​on​ ​in-sample​ ​validation,​ ​AIC​ ​value,
and​ ​holdout​ ​sample​ ​validation​ ​to​ ​decide​ ​which​ ​model​ ​was​ ​used​ ​for​ ​the​ ​actual​ ​forecast.
Summary​ ​of​ ​Model​ ​Selection​ ​(explanation​ ​below):
Cluster(Store​ ​Format) Model

1 ETS(M,N,M)

2 ETS(M,N,M)

3 ETS(M,N,A)

ETS​ ​Model​ ​Building:

For​ ​ETS​ ​Model,​ ​2​ ​options​ ​were​ ​applied​ ​and​ ​compared:


The​ ​error​ ​showed​ ​a​ ​multiplicative​ ​trend​ ​as​ ​it​ ​had​ ​varying​ ​magnitude.
The​ ​trend​ ​appears​ ​to​ ​cancel​ ​out​ ​ ​-​ ​indicating​ ​there​ ​is​ ​no​ ​trend.
There​ ​is​ ​a​ ​seasonal​ ​component​ ​and​ ​the​ ​magnitude​ ​varied​ ​slightly,​ ​so​ ​both​ ​options​ ​were
investigated​ ​(multiplicative​ ​and​ ​additive)
ETS(M,N,M)​ ​and​ ​ETS​ ​(M,N,A)​ ​models​ ​were​ ​applied​ ​and​ ​compared.
Cluster​ ​1​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
Cluster​ ​2​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
Cluster​ ​3​ ​Time​ ​Series​ ​and​ ​Time​ ​Decomposition​ ​Plots:
ARIMA​ ​Model​ ​Building:

Time​ ​Series​ ​ACF​ ​and​ ​PACF​ ​showed​ ​there​ ​is​ ​still​ ​significant​ ​autocorrelation​ ​and​ ​the​ ​data
is​ ​not​ ​stationary.

The​ ​data​ ​was​ ​differenced​ ​one​ ​time​ ​for​ ​the​ ​seasonal​ ​component​ ​and​ ​another​ ​time​ ​for​ ​the
non-seasonal​ ​component,​ ​resulting​ ​in​ ​a​ ​stationarized​ ​plot.(d(1)​ ​and​ ​D(1)

For​ ​non-seasonal​ ​component,​ ​ACF​ ​and​ ​PACF​ ​show​ ​lag-1​ ​shows​ ​has​ ​a​ ​significant​ ​value.
As​ ​it​ ​is​ ​negative,​ ​a​ ​MA(1)​ ​term​ ​is​ ​applied.

For​ ​seasonal​ ​component,​ ​ACF​ ​and​ ​PACF​ ​plot​ ​shows​ ​lag-12​ ​has​ ​a​ ​significant​ ​value.​ ​As​ ​it
is​ ​negative​ ​a​ ​MA(1)​ ​term​ ​is​ ​applied.

The​ ​ARIMA​ ​model​ ​selected​ ​is​ ​ARIMA(0,1,1)(0,1,1)[12]​ ​and​ ​compared​ ​with​ ​a​ ​fully​ ​auto
ARIMA​ ​model-​ ​ARIMA(0,1,0)(0,0,0)[12].

Cluster​ ​1​ ​Time​ ​Series​ ​ACF/PACF:


Cluster​ ​1​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​2​ ​Time​ ​Series​ ​ACF/PACF:

Cluster​ ​2​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​3​ ​Time​ ​Series​ ​ACF/PACF:
Cluster​ ​3​ ​after​ ​d(1)​ ​and​ ​D(1)​ ​-​ ​Time​ ​series,​ ​ACF​ ​and​ ​PACF:
Cluster​ ​1​ ​Model​ ​Summary:
In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 807.7 16431.12 0.37 4.4

ETS(M,N,A) 831.8 22234.61 0.45 5.33

ARIMA(0,1,1)(0,1,1)[12] 481.6 11807.02 0.2 2.54

ARIMA(0,1,0)(0,0,0)[12] 766.3 25487.98 0.57 6.8


(auto)

Holdout​ ​Sample​ ​Validation​ ​Tables​ ​(12​ ​months​ ​holdout​ ​sample):


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​1:​ ​Final​ ​Model​ ​Selection:​ ​ETS(M,N,M)​ ​-​ ​ ​this​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.

Cluster​ ​2​ ​Model​ ​Summary:


In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 787.2 11997.14 0.41 3.31

ETS(M,N,A) 822.1 19691.19 0.56 4.48

ARIMA(0,1,1)( 478 10849.46 0.29 2.33


0,1,1)[12]

ARIMA(1,0,0)( 779.7 20917.39 0.71 5.75


0,0,0)[12]
(auto)

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​2:​ ​Final​ ​Model​ ​Selection:ETS(M,N,M)​ ​-​ ​ ​this​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.
Cluster​ ​3​ ​Model​ ​Summary:
In-Sample​ ​Validation
Cluster​ ​1 AIC RMSE MASE MAPE

ETS(M,N,M) 787.1 11935.17 0.38 3.97

ETS(M,N,A) 808.7 15394.64 0.46 4.91

ARIMA(0,1,1)( 475.6 10210.29 0.23 2.48


0,1,1)[12]

ARIMA(0,1,1)( 480.1 15837.35 0.36 3.92


0,1,0)[12]
(auto)

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)


Holdout​ ​sample​ ​visualization​ ​-​ ​comparing​ ​all​ ​models

Cluster​ ​3:​ ​Final​ ​Model​ ​Selection:​ ​ETS(M,N,A)​ ​-​ t​ his​ ​is​ ​based​ ​on​ ​lowest​ ​values​ ​for
holdout​ ​sample​ ​RMSE,​ ​MAE,​ ​MPE​ ​and​ ​MAPE​ ​compared​ ​to​ ​the​ ​other​ ​model.​ ​In​ ​addition
to​ ​that,​ ​the​ ​visualization​ ​of​ ​the​ ​models​ ​also​ ​show​ ​that​ ​it​ ​closely​ ​follows​ ​the​ ​actual​ ​trends
for​ ​the​ ​holdout​ ​sample​ ​for​ ​the​ ​duration.

2.​ ​Please​ ​provide​ ​a​ ​Tableau​ ​Dashboard​ ​(saved​ ​as​ ​a​ ​Tableau​ ​Public​ ​file)​ ​that​ ​includes​ ​a
table​ ​and​ ​a​ ​plot​ ​of​ ​the​ ​three​ ​monthly​ ​forecasts;​ ​one​ ​for​ ​existing,​ ​one​ ​for​ ​new,​ ​and​ ​one​ ​for
all​ ​stores.​ ​Please​ ​name​ ​the​ ​tab​ ​in​ ​the​ ​Tableau​ ​file​ ​"Task​ ​3".

https://public.tableau.com/views/ProduceSalesDashboard/ExistingandForecastProduceS
ales?:embed=y&:display_count=yes&publish=yes
Before​ ​you​ ​submit

Please​ ​check​ ​your​ ​answers​ ​against​ ​the​ ​requirements​ ​of​ ​the​ ​project​ ​dictated​ ​by​ ​the​ ​rubric.
Reviewers​ ​will​ ​use​ ​this​ ​rubric​ ​to​ ​grade​ ​your​ ​project.

Вам также может понравиться