Академический Документы
Профессиональный Документы
Культура Документы
Complete each section. When you are ready, save your file as a PDF document and submit it
here:
https://coco.udacity.com/nanodegrees/nd008/locale/en-us/versions/1.0.0/parts/7271/project
The optimal number of store formats is 3. To arrive at this number the following steps were
taken:
a) Extract store sales data for 2015 and sum the values for each store by category
b) Use K-means clustering to determine the optimum number of store clusters
based on the percentage of sales per category for 2015
c) From the K-means analysis report, the following output was produced:
d) Based on the output, the optimal number of clusters is 3 because:
i) Highest Adjusted Rand Index indicating best cluster stability
ii) Highest CH Index indicating best distinctness and compactness
iii) Minimal spread and high median, and matching number of clusters in
both index calculations
2. How many stores fall into each store format?
The summary of number of stores that fall into each format(cluster):
Cluster No of Stores
1 23
2 29
3 33
3. Based on the results of the clustering model, what is one way that the clusters differ from
one another?
The image below shows a section from the output of K-Means Clustering:
Referring to Percent of Sales for Bakery (PCSales_Bakery) and Percent of Sales for
Dairy(PCSales_Dairy), we can observe that cluster 1 has a strong negative while cluster
2 has a strong positive indicating they are opposite of each other. Cluster 3 has a value
in-between clusters 1 and 2.
4. Please provide a Tableau visualization (saved as a Tableau Public file) that shows the
location of the stores, uses color to show cluster, and size to show total sales.
https://public.tableau.com/views/Clustervisualizationv2/StoreLocationbyClusterandTotalSales?:e
mbed=y&:display_count=yes&publish=yes
Task 2: Formats for New Stores
1. What methodology did you use to predict the best store format for the new stores? Why
did you choose that methodology? (Remember to Use a 20% validation sample with
Random Seed = 3 to test differences in models.)
Store demographic data was used as the data to predict the best store format for new stores.
The data contains 44 variables related to customer demographics including:
Age, Education, HouseHold Size, HouseHold Income Range,Population Percentage,
Home Value , Population Density
The variables were used to develop a multinomial classification model for the store
format(cluster). 3 models were compared:
i) Decision Tree Model
ii) Random Forest Model
iii) Boosted Model
The output was validated using 20% validation sample with random seed = 3. The output of the
validation is as follows:
The 3 models show similar accuracy, but the boosted model shows better F1 Score at 0.8543..
Since the F1 is a measure of precision and accuracy, the boosted model was selected and used
to predict the format of the new stores.
2. What format do each of the 10 new stores fall into? Please fill in the table below.
The model parameters were similar as the plots shared similar characteristics in time
series decomposition plots. These are shown in the coming pages.
Both ETS and ARIMA models were compared based on in-sample validation, AIC value,
and holdout sample validation to decide which model was used for the actual forecast.
Summary of Model Selection (explanation below):
Cluster(Store Format) Model
1 ETS(M,N,M)
2 ETS(M,N,M)
3 ETS(M,N,A)
Time Series ACF and PACF showed there is still significant autocorrelation and the data
is not stationary.
The data was differenced one time for the seasonal component and another time for the
non-seasonal component, resulting in a stationarized plot.(d(1) and D(1)
For non-seasonal component, ACF and PACF show lag-1 shows has a significant value.
As it is negative, a MA(1) term is applied.
For seasonal component, ACF and PACF plot shows lag-12 has a significant value. As it
is negative a MA(1) term is applied.
The ARIMA model selected is ARIMA(0,1,1)(0,1,1)[12] and compared with a fully auto
ARIMA model- ARIMA(0,1,0)(0,0,0)[12].
Cluster 2 after d(1) and D(1) - Time series, ACF and PACF:
Cluster 3 Time Series ACF/PACF:
Cluster 3 after d(1) and D(1) - Time series, ACF and PACF:
Cluster 1 Model Summary:
In-Sample Validation
Cluster 1 AIC RMSE MASE MAPE
Cluster 1: Final Model Selection: ETS(M,N,M) - this is based on lowest values for
holdout sample RMSE, MAE, MPE and MAPE compared to the other model. In addition
to that, the visualization of the models also show that it closely follows the actual trends
for the holdout sample for the duration.
Cluster 2: Final Model Selection:ETS(M,N,M) - this is based on lowest values for
holdout sample RMSE, MAE, MPE and MAPE compared to the other model. In addition
to that, the visualization of the models also show that it closely follows the actual trends
for the holdout sample for the duration.
Cluster 3 Model Summary:
In-Sample Validation
Cluster 1 AIC RMSE MASE MAPE
Cluster 3: Final Model Selection: ETS(M,N,A) - t his is based on lowest values for
holdout sample RMSE, MAE, MPE and MAPE compared to the other model. In addition
to that, the visualization of the models also show that it closely follows the actual trends
for the holdout sample for the duration.
2. Please provide a Tableau Dashboard (saved as a Tableau Public file) that includes a
table and a plot of the three monthly forecasts; one for existing, one for new, and one for
all stores. Please name the tab in the Tableau file "Task 3".
https://public.tableau.com/views/ProduceSalesDashboard/ExistingandForecastProduceS
ales?:embed=y&:display_count=yes&publish=yes
Before you submit
Please check your answers against the requirements of the project dictated by the rubric.
Reviewers will use this rubric to grade your project.