Академический Документы
Профессиональный Документы
Культура Документы
view of
what and how they would integrate with spark systems.
Data collection process largely depends upon the Spark’s source systems. The idea is to build data ingestion pipeline
which ensure data pull from source systems on scheduled basis
Flat Files
The process of bringing data Log files
from source systems would be API’s
automated using following RDBMS NoSql
technologies:
Spark’s source systems
– Sqoop: for relational data
transfer
– Batch Jobs: Scheduled
MR/Spark based jobs to
pull data from source Data ingestion layer
systems
– Oozie: For scheduling and
workflow management
– API(s): Incase Spark can
invoke API on real time
basis
Data Warehouse
Feedback from merchant POS to he’s platform
Redemption Process
A distance based technique that divides the data into many sub-clusters and then creates super clusters
Two-Step as segments.
Clustering First step is the Pre-clustering/Sequential clustering followed by the final cluster formation.
Expectation Maximization Probabilistic method of dividing the data into different clusters based on the maximum likelihood of each
Algorithm case belonging to a particular cluster, classification based on similarity in response patterns.
(Latent Gold)
Hierarchical – Agglomerative: “Bottom up" approach where each observation starts in its own cluster and pairs of
Agglomerative & Divisive clusters are merged moving up the hierarchy.
Methods Divisive: “Top down" approach where all observations start in one cluster and occur recursively moving
down the hierarchy.
CHAID - Chi-Square Identifies relationship between a dependent variable and a series of predictor variables
Automatic Interaction Selects a set of predictors and their interactions that optimally predict the dependent measure.
Detector
Consumer Base
(N)
Casual
Cross Trainer Yogi Uninvolved
Lifestyle
Level 1 Segments (30%) (25%) (29%)
(16%)
Casual Casual
Cross Trainer Cross Trainer Yogi Yogi Uninvolved
Level 2 Segments Lifestyle Lifestyle
(30%) (30%) (25%) (25%) (29%)
(16%) (16%)
* Illustrative mapping
Visibility of Algorithms
As part of overall proposed methodology, we will share details of various statistical algorithms and their running
sequence that we will be using for Customer Segmentation, Personalized Recommendation Generation etc.
Further, their will be mechanism to specify business constraints in these algorithms to ensure viability of results
Indicative modeling techniques and Input / Output parameters
Offer
Customer Recommendat
Response …….
Segmentation ion Generator
Inputs Outputs Model Outputs Final Outputs
Demographics Multi- Prioritized Personalized
Purchasing dimensional offers by recommendati
Behavior customer segments ons
Needs & segments Key Drivers
Preferences Analysis
Campaign
Response etc
Indicative business constraints that either Spark or Merchants can specify:
1
An offer being valid only for males, above 18 years in age
2
Offers that can be combined together or can not be combined.
3
Maximum number of offer redemption should not exceed 10,000
Tracking performance of Machine Learning module (1/2)
We propose to deploy several mechanisms to track the performance of recommendations and overall machine
learning module. Given below is an indicative list:
1. Track Recommended vs. Actual Conversions through a front-end UI screen
2. Advanced screens to track model accuracy (predicted vs. actual) including Decile charts etc
3. Conduct periodic Control Tests to measure efficacy of recommendations (conversion, sales lift etc)