Вы находитесь на странице: 1из 7

Can you outline the Data collection process and how it would end up with spark and their

view of
what and how they would integrate with spark systems.
Data collection process largely depends upon the Spark’s source systems. The idea is to build data ingestion pipeline
which ensure data pull from source systems on scheduled basis

Flat Files
 The process of bringing data Log files
from source systems would be API’s
automated using following RDBMS NoSql
Spark’s source systems
– Sqoop: for relational data
– Batch Jobs: Scheduled
MR/Spark based jobs to
pull data from source Data ingestion layer
– Oozie: For scheduling and
workflow management
– API(s): Incase Spark can
invoke API on real time

Data Warehouse
Feedback from merchant POS to he’s platform

Redemption Process

User goes to merchant to  A version of mobile app will

redeem the offer be built for merchants
Merchant  Merchants will log in to app
sends an
Scan the code
through their merchant id
offer and a
Mobile user
QR code
through  User offer will be redeemed
by scanning the QR code
he’s platform either in digital or through
Proposed approach to Segmentation (2/4)

Various segmentation techniques

 A distance based technique that divides the data into many sub-clusters and then creates super clusters
Two-Step as segments.
Clustering  First step is the Pre-clustering/Sequential clustering followed by the final cluster formation.

Expectation Maximization  Probabilistic method of dividing the data into different clusters based on the maximum likelihood of each
Algorithm case belonging to a particular cluster, classification based on similarity in response patterns.
(Latent Gold)

Hierarchical –  Agglomerative: “Bottom up" approach where each observation starts in its own cluster and pairs of
Agglomerative & Divisive clusters are merged moving up the hierarchy.
Methods  Divisive: “Top down" approach where all observations start in one cluster and occur recursively moving
down the hierarchy.

CHAID - Chi-Square  Identifies relationship between a dependent variable and a series of predictor variables
Automatic Interaction  Selects a set of predictors and their interactions that optimally predict the dependent measure.

 A non-hierarchical process following combined methods of parallel threshold and optimization

K-Means techniques
Algorithm  Aims to partition n observations into k clusters in which each observation belongs to the cluster with the
nearest mean.
Proposed approach to Segmentation (3/4)

Illustration of multi-level hierarchical segmentation

Consumer Base

Cross Trainer Yogi Uninvolved
Level 1 Segments (30%) (25%) (29%)

Casual Casual
Cross Trainer Cross Trainer Yogi Yogi Uninvolved
Level 2 Segments Lifestyle Lifestyle
(30%) (30%) (25%) (25%) (29%)
(16%) (16%)

Final Solution had 7 robust

segments after iterations
Deep Dive profiling done
for each segment to bring
them to “life”
Proposed approach to Segmentation (4/4)

Integration with Spark’s current segmentation

We suggest following two approaches to integrate Spark’s current segments:

1. Understand Spark’s current segments and integrate them into the hierarchy accordingly, Or
2. If there are any overlaps observed then we clearly map New segments with Spark’s existing segments via
cross-tabing etc
New Segments

NS1 NS2 NS3 NS4 …….. NSn

ES1 x11 x12 x13 x14 x1n
ES2 …. …. …. …. …. ....
Existing ES3 …. …. …. …. …. ....
Segments ES4 …. …. …. …. …. ....
… …. …. …. …. …. ....
ESm …. …. …. …. …. xmn

* Illustrative mapping
Visibility of Algorithms

As part of overall proposed methodology, we will share details of various statistical algorithms and their running
sequence that we will be using for Customer Segmentation, Personalized Recommendation Generation etc.
Further, their will be mechanism to specify business constraints in these algorithms to ensure viability of results
Indicative modeling techniques and Input / Output parameters

Customer Recommendat
Response …….
Segmentation ion Generator
Inputs Outputs Model Outputs Final Outputs
 Demographics  Multi-  Prioritized  Personalized
 Purchasing dimensional offers by recommendati
Behavior customer segments ons
 Needs & segments  Key Drivers
Preferences Analysis
 Campaign
Response etc
Indicative business constraints that either Spark or Merchants can specify:
An offer being valid only for males, above 18 years in age

Offers that can be combined together or can not be combined.
Maximum number of offer redemption should not exceed 10,000
Tracking performance of Machine Learning module (1/2)

We propose to deploy several mechanisms to track the performance of recommendations and overall machine
learning module. Given below is an indicative list:
1. Track Recommended vs. Actual Conversions through a front-end UI screen
2. Advanced screens to track model accuracy (predicted vs. actual) including Decile charts etc
3. Conduct periodic Control Tests to measure efficacy of recommendations (conversion, sales lift etc)

Indicative Decile Chart Indicative Test vs. Control Chart

40 Lift attributed to
80 recommendations
Test: with
50 20
Control: w/o
0 0
1st 2nd 3rd 4th 5th 6th 7th 8th 9th 10th
Decile Decile Decile Decile Decile Decile Decile Decile Decile Decile 0 10 20 30