Udacity Business Analyst Project 8

Project: Predictive Analytics Capstone
Complete each section. When you are ready, save your file as a PDF document and submit it
here:
https://coco.udacity.com/nanodegrees/nd008/locale/en-us/versions/1.0.0/parts/7271/project
Task 1: Determine Store Formats for Existing Stores

1. What is the optimal number of store formats? How did you arrive at that number?
The optimal number of store formats is 3. To arrive at this number the following steps were
taken:
a) Extract store sales data for 2015 and sum the values for each store by category
b) Use K-means clustering to determine the optimum number of store clusters
based on the percentage of sales per category for 2015
c) From the K-means analysis report, the following output was produced:
d) Based on the output, the optimal number of clusters is 3 because:
i) Highest Adjusted Rand Index indicating best cluster stability
ii) Highest CH Index indicating best distinctness and compactness
iii) Minimal spread and high median, and matching number of clusters in
both index calculations
2. How many stores fall into each store format?
The summary of number of stores that fall into each format(cluster):
Cluster No of Stores
1 23
2 29
3 33
3. Based on the results of the clustering model, what is one way that the clusters differ from
one another?
The image below shows a section from the output of K-Means Clustering:
Referring to Percent of Sales for Bakery (PCSales_Bakery) and Percent of Sales for
Dairy(PCSales_Dairy), we can observe that cluster 1 has a strong negative while cluster
2 has a strong positive indicating they are opposite of each other. Cluster 3 has a value
in-between clusters 1 and 2.
4. Please provide a Tableau visualization (saved as a Tableau Public file) that shows the
location of the stores, uses color to show cluster, and size to show total sales.
https://public.tableau.com/views/Clustervisualizationv2/StoreLocationbyClusterandTotalSales?:e
mbed=y&:display_count=yes&publish=yes
Task 2: Formats for New Stores
1. What methodology did you use to predict the best store format for the new stores? Why
did you choose that methodology? (Remember to Use a 20% validation sample with
Random Seed = 3 to test differences in models.)
Store demographic data was used as the data to predict the best store format for new stores.
The data contains 44 variables related to customer demographics including:
Age, Education, HouseHold Size, HouseHold Income Range,Population Percentage,
Home Value , Population Density
The variables were used to develop a multinomial classification model for the store
format(cluster). 3 models were compared:
i) Decision Tree Model
ii) Random Forest Model
iii) Boosted Model
The output was validated using 20% validation sample with random seed = 3. The output of the
validation is as follows:
Model Accuracy F1 Accuracy_1 Accuracy_2 Accuracy_3
Random Forest 0.8235 0.8251 0.7500 0.8000 0.8750
Decision Tree 0.8235 0.8251 0.7500 0.8000 0.8750
Boosted 0.8235 0.8543 0.8000 0.6667 1.0000

The confusion matrix is shown below:
The 3 models show similar accuracy, but the boosted model shows better F1 Score at 0.8543..
Since the F1 is a measure of precision and accuracy, the boosted model was selected and used
to predict the format of the new stores.
2. What format do each of the 10 new stores fall into? Please fill in the table below.
Store Number Segment

S0086 3
S0087 2
S0088 1
S0089 2
S0090 2
S0091 1
S0092 2
S0093 1
S0094 2
S0095 2
Task 3: Predicting Produce Sales

1. What type of ETS or ARIMA model did you use for each forecast? Use ETS(a,m,n) or
ARIMA(ar, i, ma) notation. How did you come to that decision?
The model parameters were similar as the plots shared similar characteristics in time
series decomposition plots. These are shown in the coming pages.
Both ETS and ARIMA models were compared based on in-sample validation, AIC value,
and holdout sample validation to decide which model was used for the actual forecast.
Summary of Model Selection (explanation below):
Cluster(Store Format) Model
1 ETS(M,N,M)
2 ETS(M,N,M)
3 ETS(M,N,A)
ETS Model Building:
For ETS Model, 2 options were applied and compared:

The error showed a multiplicative trend as it had varying magnitude.
The trend appears to cancel out - indicating there is no trend.
There is a seasonal component and the magnitude varied slightly, so both options were
investigated (multiplicative and additive)
ETS(M,N,M) and ETS (M,N,A) models were applied and compared.
Cluster 1 Time Series and Time Decomposition Plots:
ARIMA Model Building:
Time Series ACF and PACF showed there is still significant autocorrelation and the data
is not stationary.
The data was differenced one time for the seasonal component and another time for the
non-seasonal component, resulting in a stationarized plot.(d(1) and D(1)
For non-seasonal component, ACF and PACF show lag-1 shows has a significant value.
As it is negative, a MA(1) term is applied.
For seasonal component, ACF and PACF plot shows lag-12 has a significant value. As it
is negative a MA(1) term is applied.
The ARIMA model selected is ARIMA(0,1,1)(0,1,1)[12] and compared with a fully auto
ARIMA model- ARIMA(0,1,0)(0,0,0)[12].
Cluster 1 Time Series ACF/PACF:

Cluster 1 after d(1) and D(1) - Time series, ACF and PACF:
Cluster 1 Model Summary:
In-Sample Validation
Cluster 1 AIC RMSE MASE MAPE
ETS(M,N,M) 807.7 16431.12 0.37 4.4
ETS(M,N,A) 831.8 22234.61 0.45 5.33
ARIMA(0,1,1)(0,1,1)[12] 481.6 11807.02 0.2 2.54
ARIMA(0,1,0)(0,0,0)[12] 766.3 25487.98 0.57 6.8

(auto)
Holdout Sample Validation Tables (12 months holdout sample):

Holdout sample visualization - comparing all models
Cluster 1: Final Model Selection: ETS(M,N,M) - this is based on lowest values for
holdout sample RMSE, MAE, MPE and MAPE compared to the other model. In addition
to that, the visualization of the models also show that it closely follows the actual trends
for the holdout sample for the duration.

ETS(M,N,M) 787.2 11997.14 0.41 3.31
ETS(M,N,A) 822.1 19691.19 0.56 4.48
ARIMA(0,1,1)( 478 10849.46 0.29 2.33

0,1,1)[12]
ARIMA(1,0,0)( 779.7 20917.39 0.71 5.75

0,0,0)[12]
(auto)
Holdout Sample Validation (12 months holdout sample)

Cluster 2: Final Model Selection:ETS(M,N,M) - this is based on lowest values for
ETS(M,N,M) 787.1 11935.17 0.38 3.97
ETS(M,N,A) 808.7 15394.64 0.46 4.91
ARIMA(0,1,1)( 475.6 10210.29 0.23 2.48

0,1,1)[12]
ARIMA(0,1,1)( 480.1 15837.35 0.36 3.92

0,1,0)[12]
(auto)
Holdout Sample Validation (12 months holdout sample)

Cluster 3: Final Model Selection: ETS(M,N,A) - t his is based on lowest values for
2. Please provide a Tableau Dashboard (saved as a Tableau Public file) that includes a
table and a plot of the three monthly forecasts; one for existing, one for new, and one for
all stores. Please name the tab in the Tableau file "Task 3".
https://public.tableau.com/views/ProduceSalesDashboard/ExistingandForecastProduceS
ales?:embed=y&:display_count=yes&publish=yes
Before you submit
Please check your answers against the requirements of the project dictated by the rubric.
Reviewers will use this rubric to grade your project.

Udacity Business Analyst Project 8

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Udacity Business Analyst Project 8

Загружено:

Авторское право:

Доступные форматы

Project:​ ​Predictive​ ​Analytics​ ​Capstone

Task​ ​1:​ ​Determine​ ​Store​ ​Formats​ ​for​ ​Existing​ ​Stores

Model Accuracy F1 Accuracy_1 Accuracy_2 Accuracy_3

Random​ ​Forest 0.8235 0.8251 0.7500 0.8000 0.8750

Decision​ ​Tree 0.8235 0.8251 0.7500 0.8000 0.8750

Boosted 0.8235 0.8543 0.8000 0.6667 1.0000

Store​ ​Number Segment

Task​ ​3:​ ​Predicting​ ​Produce​ ​Sales

ETS​ ​Model​ ​Building:

For​ ​ETS​ ​Model,​ ​2​ ​options​ ​were​ ​applied​ ​and​ ​compared:

Cluster​ ​1​ ​Time​ ​Series​ ​ACF/PACF:

ETS(M,N,M) 807.7 16431.12 0.37 4.4

ETS(M,N,A) 831.8 22234.61 0.45 5.33

ARIMA(0,1,1)(0,1,1)[12] 481.6 11807.02 0.2 2.54

ARIMA(0,1,0)(0,0,0)[12] 766.3 25487.98 0.57 6.8

Holdout​ ​Sample​ ​Validation​ ​Tables​ ​(12​ ​months​ ​holdout​ ​sample):

Cluster​ ​2​ ​Model​ ​Summary:

ETS(M,N,M) 787.2 11997.14 0.41 3.31

ETS(M,N,A) 822.1 19691.19 0.56 4.48

ARIMA(0,1,1)( 478 10849.46 0.29 2.33

ARIMA(1,0,0)( 779.7 20917.39 0.71 5.75

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)

ETS(M,N,M) 787.1 11935.17 0.38 3.97

ETS(M,N,A) 808.7 15394.64 0.46 4.91

ARIMA(0,1,1)( 475.6 10210.29 0.23 2.48

ARIMA(0,1,1)( 480.1 15837.35 0.36 3.92

Holdout​ ​Sample​ ​Validation​ ​(12​ ​months​ ​holdout​ ​sample)

Вам также может понравиться

Project: Predictive Analytics Capstone

Task 1: Determine Store Formats for Existing Stores

Random Forest 0.8235 0.8251 0.7500 0.8000 0.8750

Decision Tree 0.8235 0.8251 0.7500 0.8000 0.8750

Store Number Segment

Task 3: Predicting Produce Sales

ETS Model Building:

For ETS Model, 2 options were applied and compared:

Cluster 1 Time Series ACF/PACF:

Holdout Sample Validation Tables (12 months holdout sample):

Cluster 2 Model Summary:

Holdout Sample Validation (12 months holdout sample)

Holdout Sample Validation (12 months holdout sample)