Вы находитесь на странице: 1из 8

Implementing a Product Recommendation Engine with

Implicit Feedback: An Application to Grocery Shopping.

Jason Knaut Jiang Wu Yihua Wu


University of Illinois at University of Illinois at University of Illinois at
Urbana-Champaign Urbana-Champaign Urbana-Champaign
jknaut2@illinois.edu jiangwu2@illinois.edu yihuawu2@illinois.edu

ABSTRACT and overwhelming cognitive load. That said, successful on-


E-commerce removes the physical shelf space constraints line merchants have developed a competitive advantage by
that historically limited the variety of items offered to cus- alleviating this friction while simultaneously creating up-
tomers by traditional brick-and-mortar retailers. From this selling and cross-selling marketing opportunities through the
explosion in customer choice emerged complexities of sifting implementation of recommendation systems.
through seemingly endless product options. Online mer- Perhaps the most canonical example of such a successful
chants increasingly rely on sophisticated recommendation merchant is Amazon. According to McKinsey, Amazon gen-
systems—also known as recommendation engines or, more erates 35% of its revenue from its recommendation system
simply, recommenders—to address this challenge by proac- [17]. Based on Amazon’s estimated $190 billion retail sales
tively surfacing the most relevant items for each customer. revenue in 2017 [12], the company’s recommendation system
To provide these personalized recommendations, models can contributes over $60 billion to revenue in a single year.
be learned from customer data including customer-provided With Amazon’s acquisition of brick-and-mortar grocery
ratings (e.g., 5-star scale for previously watched movies) or chain Whole Foods Market, one would expect Amazon to
even prior observed behaviors (e.g., purchases, likes, clicks). apply its recommendation system expertise to capture an
Extensive research exists on algorithms and evaluation ap- even greater share of the groceries market. But the compe-
proaches for recommendation tasks, yet few studies have tition isn’t standing still. Instacart, an online grocery deliv-
tackled an end-to-end practical application for learning, eval- ery startup valued at more than $4 billion [18], has made
uating, and deploying a production recommendation engine. meaningful investments in data science and shares Ama-
We propose such an approach for building a recommenda- zon’s view regarding the importance of well-designed and
tion system for Instacart, an online grocery ordering and high-accuracy recommendation systems in delivering a fric-
delivery service. Leveraging distributed computation on an tionless online purchasing experience [9].
Amazon EMR cluster running Apache Spark, we efficiently In this paper, we present a prototype implementation of a
learn a recommendation model from over three million his- personalized top-N product recommendation system in the
torical customer orders. We then architect, develop, and domain of grocery shopping. The remainder of this paper
benchmark a lightweight recommendation engine microser- is organized as follows. Section 2 briefly reviews important
vice that could be used in production to conduct online eval- recommendation system preliminaries including underlying
uation experiments of our derived recommendation model. approaches and algorithms. Section 3 surveys related works
from academic and industry literature. Section 4 details
the experimental setup methodology for training and tun-
Keywords ing recommendation models using Apache Spark applied to
a large online grocery shopping transaction dataset. Sec-
Recommendation System; Collaborative Filtering; Binary tion 5 presents a comparison of evaluation criteria applied
Feedback; Rank Evaluation Metrics; Apache Spark; Amazon to the trained models. Section 6 proposes an end-to-end ar-
EMR chitecture for a production recommendation engine built on
Amazon AWS cloud infrastructure that will serve grocery
product recommendations based on the model discovered in
1. INTRODUCTION Sections 4 and 5. Section 7 concludes by summarizing our
E-commerce sales topped $450 billion in the United States findings and describing opportunities for future work.
alone during 2017, growing 16% over the prior year and at
a rate three times faster than the overall retail sales growth
rate [12]. Of overall annual retail sales growth, nearly half 2. BACKGROUND
resulted from consumers increasing their spending through
online channels—a confirmation of the continued spending 2.1 Recommendation Systems
shift away from traditional brick-and-mortar retailers to on- Recommendation systems incorporate feedback from an
line merchants. individual user to suggest additional items (e.g., products,
However, the seemingly advantageous breadth of mer- songs, movies, news articles) that are most likely to be of
chants and endless product options available across online interest to that user. More specifically, a top-N recommen-
channels actually creates challenges for consumers during dation system surfaces N recommendations (where N is a
the purchase process in the form of increased search costs prespecified constant), often by truncating a sorted superset

1
of recommendations ranked in descending order of estimated In an online experiment, one or more proposed recommen-
relevance. dation models are deployed and exposed to actual users.
The increasing interest in developing sophisticated recom- Observed user interactions with the recommendation sys-
mendation systems over the past decade is evidenced by the tems produce test data which will subsequently be used to
rapid growth of research covering diverse applications (e.g., calculate evaluation metrics relevant to the domain (e.g.,
movies, shopping, documents, books) and data mining tech- purchase rate, click-through-rate, profit, etc.) and to esti-
niques relevant to the domain [20]. Several prominent data mate expected recommender performance. The online eval-
mining techniques underlying recommendation systems in- uation approach often takes the form of an A/B test where
clude association rules, clustering, nearest neighbors, neural some fraction of users are presented a new proposed rec-
networks, and link analysis. These techniques may be ap- ommendation system while others interact with a different
plied to an overarching strategy for generating recommen- recommendation system or no recommendation system at
dations known as collaborative filtering. all.
Offline evaluation is an alternative approach to online ex-
2.2 Collaborative Filtering periments. Offline evaluation does not necessitate users ac-
tually interacting with a proposed recommendation system
Collaborative filtering is one of the most widely adopted
to estimate model accuracy. Instead, the protocol utilizes a
recommendation system methods in industry [7, 13, 22].
historical dataset, a fraction of which is not used in train-
Collaborative filtering is referred to as user-item recommen-
ing the recommendation model, to simulate potential user
dation because the approach generates item recommenda-
behavior.
tions to a targeted user based on the item preferences of
similar users. These preferences—or “feedback”—are col-
lected in two ways: explicit feedback and implicit feedback.
Explicit feedback requires a user to proactively provide in- 3. RELATED WORK
put by explicitly stating preferences (e.g., rating movies, lik- Recommendation systems have been widely researched
ing an article), whereas implicit feedback relies solely on over the past several decades. The overwhelming major-
prior observed behaviors (e.g., product purchase history, ity of that research focuses on the underlying algorithms
web clickstream data). Both explicit and implicit prefer- for learning models for recommendation tasks and for eval-
ences may be specified by either binary (e.g., like/did not uating the “goodness” of the learned models. The litera-
like, purchase/did not purchase, etc.) or multivalued (e.g., ture historically skewed toward modeling with explicit feed-
five-star ordinal rating scale, # of ad clicks, # of purchased back ratings, likely due in part to the availability of several
items) data. large public datasets for experimentation—notably the Net-
flix Prize and Movielens datasets. However, coverage of top-
2.3 Matrix Factorization & ALS ics relevant to an implicit feedback context has expanded re-
Matrix factorization is a dimensionality reduction approach cently prompted by the unique challenges posed by implicit
applied in collaborative filtering. Stated simply, the ma- feedback, widespread industry applicability (e.g., purchases,
trix factorization technique employs a latent factor model likes, clicks), and availability of large open source datasets
that deconstructs the massive yet sparse user-item prefer- for experimentation.
ence matrix into compressed vector representations captur-
ing the joint latent factor spaces relating users and items 3.1 Recommendation with Implicit Feedback
[15]. Learning those latent factors relies on the iterative
One of the seminal papers exploring the application of
Alternating Least Squares (ALS) algorithm. ALS is an al-
CF to recommendation tasks based on implicit feedback
ternative to stochastic gradient descent with two beneficial
datasets is [8]. The contributions of [8] led to and underly
properties in the context of big data and collaborative filter-
the CF implementation in the Apache Spark ML library. We
ing with implicit feedback [15]. First, empirical evidence has
briefly summarize the key concepts to aid the understanding
shown that the ALS with weighted-λ-regularization (ALS-
of our experimental approach presented later in this paper.
WR) algorithm can obtain significant performance improve-
Data in the explicit feedback context consists of users,
ments through parallelization [24], thereby leveraging ben-
items, and ratings. Let u denote a user and i denote an
efits of cloud computing architectures when computing on
item. The rating of i by u is represented as rui . Numerous
large-scale datasets. Second, ALS is optimized to operate on
algorithms, evaluation protocols, and software implementa-
non-sparse data which is the case for the implicit feedback
tions exist to produce recommender models from (u, i, rui )
training data; looping over each training record as required
input tuples without much effort.
by stochastic gradient descent would be far more computa-
Employing these algorithms and implementations in an
tionally expensive. For these reasons, the machine learning
implicit feedback context requires a bit more thought and
library for Apache Spark implements ALS for its collabora-
work. First, for implicit feedback, rui proxies a rating by
tive filtering functionality [2].
aggregating implicit signals of u for i. The larger the value
of rui , the greater the evidence we have that u prefers i. In
2.4 Evaluation Protocols: Offline & Online our domain of grocery shopping, rui represents the number
With countless methods to model and generate top-N rec- of times customer u purchased product i. These raw obser-
ommendations, the evaluation and selection of an optimal vations rui for all u, i must then be transformed into prefer-
model is a critical step in building and deploying a recom- ences pui and confidence levels cui . The binary variable pui
mendation system. Evaluation protocols for recommenda- indicates whether or not u prefers i, where pui = 1 if rui > 0;
tion tasks can be broadly classified as online experimenta- otherwise, pui = 0. Confidence cui measures the confidence
tion and offline evaluation [5]. of observing pui with some minimal level of confidence that

2
increases with the observed evidence rui . Interested read- 4. EXPERIMENTAL SETUP
ers should refer to [8] for a more detailed treatment of the
approach. 4.1 Data Description
In 2017, Instacart open sourced a dataset containing a
3.2 Evaluating Recommender Systems large sample of grocery orders transacted on its platform
A second well-researched area in the recommendation sys- to the machine learning community [10, 11]. Concurrently,
tems literature explores the appropriateness of model eval- the company launched a Kaggle competition seeking cre-
uation approaches and metrics [4, 5, 6, 19, 23]. Evaluation ative contributions from machine learning novices and ex-
metrics for recommendation tasks can be broadly catego- perts alike to improve prediction accuracy of future customer
rized into three major classes: 1) predictive accuracy met- orders. The anonymized relational dataset contains 3.4 mil-
rics, 2) classification accuracy metrics, and 3) rank accuracy lion order transactions composed of approximately 50,000
metrics [6]. distinct products. These sequential order transactions repre-
Predictive accuracy metrics, such as MSE and RMSE, sent the repeat grocery purchasing behavior of over 200,000
measure the deviation between predicted ratings and actual North American customers—including between 4 to 100 or-
ratings. This approach is widely applied for evaluating CF ders per customer.
models with explicit feedback where a numerical rating will
ultimately be presented to a user as part of the recommen- 4.2 Generating Training & Test Sets
dation. Reasons for wide adoption of predictive accuracy Prior to conducting our analysis, we divide the dataset
measures include computational ease and sound statistical into training and test sets. Commonly in machine learning
properties. For recommender tasks with implicit feedback tasks a random sampling holdout approach would be applied
(especially when the feedback is binary), predictive accuracy to split the dataset. However, several challenges unique to
metrics are not suitable. We cannot compare the predicted training and evaluating a CF algorithm in a recommenda-
confidence values generated by an ALS algorithm with an tion context necessitate a different approach.
unbounded purchase count or rank integer to produce any First, we must address the issue of the cold start problem—
sensible evaluation. producing recommendations for a particular user requires
Another category of metrics, classification accuracy met- that user to have some item ratings present in the training
rics, are typically applied to evaluate binary preference (e.g., data; otherwise, no latent factors will be learned for that
customer either buys or does not buy a product). These met- user. Therefore, we cannot split the data by randomly as-
rics are most appropriate for the “Find Good Items” recom- signing users between the training and test sets.
mendation task [6]. The two most commonly applied clas- An alternative approach addressing the cold start prob-
sification accuracy metrics are Precision and Recall. Preci- lem relies on randomly “hiding” user-item pairs in the user-
sion represents the proportion of recommended items that item matrix. In the Instacart dataset, this approach would
the user actually selects. Recall represents the proportion of require randomly assigning (customer id, product id, pur-
actually selected items that were also recommended. In [8], chase count) tuples to either the training set or the test set.
the authors argue that due to the absence of user reaction to However, this approach does not accurately simulate orders
recommendation results, Precision is not appropriate. For occurring sequentially over time. For example, assume a
example, the user may not have consumed a recommended customer made three orders over the period of one month.
item because she was unaware of it; we cannot assume the The first order is randomly selected for inclusion in the test
absence of an implicit signal implies a negative preference. set and the second and third orders are added to the train-
Rank accuracy metrics are a third category of evaluation ing set. When we apply the CF algorithm, we are effectively
metrics. As to the result produced by the ALS model, the asking the model to predict prior purchases based on future
top ranked items in the predicted list have higher confidence purchases. This is the converse of the problem we attempt
values supporting the predicted item preferences than those to solve with a recommendation model.
items further down the recommendation list. Mean Average Fortunately, the orders table of the Instacart dataset in-
Precision (MAP) and Normalized Discounted Cumulative cludes a field eval set that separates each customer’s prior
Gain (NDCG) are popular rank accuracy metrics [3] with orders (eval set = P RIOR) from the customer’s most re-
origins in the field of Information Retrieval. Both NDCG cent order (eval set = {T RAIN, T EST })1 . This informa-
and MAP may be applied to evaluate binary recommenda- tion allows us to apply the prior/post sequential splitting
tion models, though these metrics are very sensitive to the approach presented by [8]. This approach ensures orders up
order of recommendations, making these measures less ap- to time t are used to train the model while orders after time
propriate in the “Find Good Items” scenario. Mean Percent- t are used for evaluating the accuracy of those recommen-
age Rank (MPR) is a third rank accuracy method proposed dations.
by [8]. MPR measures the average percentile rank of actu- Applying the prior/post sequential splitting approach yields
ally preferred items in a confidence-ranked recommendation training and test sets containing 3,214,874 and 131,209 or-
list. Given percentile rank has a nice property of ranging ders, respectively. While a typical random sampling ap-
from 0 (most confident) to 1 (least confident), the MPR has proach would holdout 10%-30% of the data for testing, our
a natural baseline: a random recommendation model is ex- approach sets aside approximately 4% of the orders. Given
pected to have an MPR of 0.5. However, MPR requires a the absolute number of orders in the test set and the impor-
complete recommendation ranking over every item and for tance of more closely simulating the sequential dynamics of
each user. This step can be quite computationally inten- 1
Instacart withheld data for the orders with eval set =
sive on large datasets. Applied to the Instacart dataset, T EST to be used as the final evaluation set for its data sci-
we would need to compute the confidence of more than 10 ence competitions. Therefore, only the eval set = T RAIN
billion user-item pairs. records were used to construct our test set.

3
Rank Product % Purchasing for each customer-product pair in the Instacart dataset. To
1 Banana 35.9% accomplish this, we transform the relational Instacart data
2 Bag of Organic Bananas 30.8% to output (user id, product id, purchase count) tuples stored
3 Organic Strawberries 28.5% in a Spark DataFrame, trainDF . This transformation over
4 Organic Baby Spinach 26.7% the training dataset produces over 13 million tuples on which
5 Large Lemon 22.5% to fit the CF model (i.e., compute the user and item latent
6 Limes 21.8% factor matrices).
7 Organic Hass Avocado 21.1%
8 Strawberries 20.9% 4.4.2 Fitting the Model
9 Organic Avocado 20.7% Next, we utilize the ALS algorithm to generate user and
10 Organic Blueberries 18.0% item latent factor matrices from the training tuples in trainDF .
11 Organic Garlic 17.0% The ML library of Apache Spark provides an implementa-
12 Organic Yellow Onion 16.7% tion of the ALS algorithm in the spark.ml.recommendation.ALS
13 Organic Zucchini 15.8% module. In addition to the transformed input dataset trainDF ,
14 Organic Raspberries 15.3% we provide the following tunable hyperparameters: alpha,
15 Cucumber Kirby 14.5% regularization, number of iterations, and rank. We cover
16 Yellow Onions 14.5% each briefly here but recommend referring to the Apache
17 Organic Grape Tomatoes 14.1% ML Collaborative Filtering API documentation2 and [8] pa-
18 Seedless Red Grapes 13.3% per for additional details.
19 Organic Lemon 13.2%
20 Organic Baby Carrots 12.8% • Iterations = 10. ALS is an iterative algorithm that
gradually improves the learned latent factors. The al-
Table 1: Top-20 PopRank Recommendations
gorithm typically converges to a good solution in fewer
than 20 iterations.
a real-world recommender system, we view the trade-off as • Alpha = 0.1. The ALS with implicit feedback algo-
reasonable. rithm requires alpha to calibrate the baseline confi-
dence (i.e., weight of the observed strength rui ). We
4.3 Baseline: Popularity Recommender also tested values {0.01, 0.05}.
A universal performance benchmark for selecting or re- • Rank = 100. Rank captures the number of latent fea-
jecting recommendation algorithms across domains does not tures to learn during the factorization of the user-item
exist. Therefore, determining the “goodness” of candidate matrix. We also tested values {10, 50}.
recommendation models requires contrasting performance
with some baseline model. Similar to the approach taken by • Lambda = 0.01. Lambda is the regularization parame-
[8, 16, 21], we start with a naı̈ve non-personalized recom- ter used to prevent overfitting during the ALS training
mender that suggests the same list of top-N most popular process.
products for all customers. Keeping with convention, we
refer to this model as PopRank. We opted not to conduct an extensive grid search pro-
The process of learning the top-N PopRank recommenda- cess over hyperparameter permutations using k-fold cross
tion model is straightforward. We first count the number validation but instead experimented with a few variations
of distinct customers purchasing each product in the train- of these hyperparameters. As discussed above, offline eval-
ing dataset and then sort the results in descending order uation measures serve a specific purpose in the context of
of purchaser counts. Notice that we define popularity in recommendation tasks: to reduce the candidate set of mod-
terms of the number of distinct purchasers, not the number els for online experimentation. Optimizing recommendation
of individual purchases. Our definition of popularity avoids models beyond a certain point based on offline evaluation
overweighting frequently repurchased products (e.g., milk, likely only yields hypothetical accuracy results that may not
bread)—an important adjustment in the domain of grocery create (and perhaps may even destroy) value when deployed
shopping. for real user interaction [6].
The output of this process can be used to quickly pro-
4.4.3 Producing Recommendations
duce a top-N recommendation model of any size in N =
{1, 2, ..., #of P roducts}. Simply truncate the ordered list of With the model fit on the training data, the task of pre-
all items at the desired N. Table 1 shows the output of the diction is straightforward. The Apache Spark ML ALS API
top-20 PopRank product recommendations. provides a function recommendF orAllU sers(N ) that gen-
erates a DataFrame containing the top-N product recom-
4.4 Collaborative Filtering Recommender mendations along with estimated confidence scores for every
With the baseline PopRank model constructed, we turn user in the training dataset. Table 2 presents an example
our attention to developing a CF recommender. This sec- output of the top-20 recommendations for user id = 2000.
tion describes our approach for applying CF to the Instacart Notice that several of the top-20 CFRank recommenda-
dataset to generate a personalized top-N recommendation tions overlap with items in the PopRank recommendations.
model, which we will refer to as CFRank. Our simplified implementation does not exclude popular
items from the CFRank candidate set.
4.4.1 Transforming the Dataset 2
See the Apache Spark Collaborative Filtering API
Modeling implicit feedback with a CF approach requires at https://spark.apache.org/docs/2.2.0/mllib-collaborative-
data containing a strength signal (i.e., the implicit “rating”) filtering.html

4
4.4.4 Computational Performance Rank Product Confidence
We performed the preceding computations on an Amazon 1 Banana 0.2062
EMR cluster composed of three c4.large nodes. Transform- 2 100% Whole Wheat Bread 0.1386
ing the raw dataset with a few Spark SQL join and aggrega- 3 Strawberries 0.1274
tion operations completed in well under a minute. Fitting 4 Blueberries 0.1270
the ALS model learned 100 latent factors for 200,000 users 5 2% Reduced Fat Milk 0.1243
and 50,000 products in approximately 10 minutes. Finally, 6 Yellow Onions 0.1109
the longest running task was the generation of top-100 rec- 7 Black Beans 0.1106
ommendations for each of the roughly 200,000 users. The 8 Fat Free Milk 0.1097
cluster computed over 2 million unique recommendations 9 Fridge Pack Cola 0.1091
with confidence scores in just over fifteen minutes. The 10 Creamy Peanut Butter 0.1071
end-to-end pipeline from loading flat-file order data to rec- 11 Spinach 0.1057
ommendation output completed in less than thirty minutes. 12 Cherubs Heavenly 0.1036
Given the matrix-intensive mathematical computations un- 13 Large Lemon 0.1035
derlying the ALS algorithm, we suspect that replacing our 14 Red Onion 0.1012
toy test cluster with p3.2xlarge GPU instances could provide 15 Cage Free Brown Egg 0.1007
the computational power to utilize the proposed pipeline on 16 Roma Tomato 0.0998
the much larger Instacart production dataset. 17 Original English Muffins 0.0997
18 Organic Fuji Apple 0.0984
5. EVALUATION & RESULTS 19 Honey Nut Cheerios 0.0964
20 Orange Bell Pepper 0.0964
5.1 Evaluation Approach & Metrics Table 2: Top-20 CFRank Recommendations for user id =
After generating the top-100 ranked recommendations gen- 2000
erated for each customer, we contrast the relative perfor-
mance of our baseline naı̈ve recommender, PopRank, and
personalized collaborative filtering recommender, CFRank, 5.2 Results
using the holdout test set. We also calculate evaluation met- Table 3 summarizes the comparative accuracy of top-N
rics for these models at different values of N = {10, 20, 50, 100} recommendations generated by the PopRank and CFRank
to explore the potential impact of the size of recommenda- models as measured against the holdout test set.
tion lists surfaced to customers.
We consider four classification and rank accuracy met- Prec@N Recall@N NDCG@N MAP@N
rics: Precision@N, Recall@N, NDCG@N, and MAP@N. Of Top-10 Recommender
these metrics, Recall@N is the most appropriate for our eval- PopRank 0.0169 0.0343 0.0178 0.0601
uation. Recall@N captures the proportion of a customer’s CFRank 0.0121 0.0274 0.0104 0.0274
actually purchased products that overlap with the top-N rec-
Top-20 Recommender
ommendations. Recall@N is not impacted by the number of
PopRank 0.0137 0.0530 0.0152 0.0702
purchased products present in the test set; Recall@N takes
a value between [0, 1] regardless if a customer purchased one CFRank 0.0129 0.0567 0.0116 0.0398
product or N products in the test set. This also explains why Top-50 Recommender
Precision@N is inappropriate. Since Precision@N represents PopRank 0.0105 0.0988 0.0122 0.0852
the proportion of recommended items actually purchased, it CFRank 0.0116 0.1223 0.0112 0.0569
is skewed by the number of purchases each customer made. Top-100 Recommender
For example, if a customer purchased two products, and PopRank 0.0079 0.1473 0.0095 0.0944
both are in a recommended top-20 list, the Precision@N CFRank 0.0094 0.1949 0.0097 0.0684
would be 0.1. If another customer purchased 20 products,
and only two overlapped with the top-N recommendations, Table 3: Summary of Evaluation Results
Precision@N would also be 0.1.
The other metrics, NDCG@N and MAP@N, incoprorate As we increase the number of recommendations N, we
not only the “match rate” between actual and recommended demonstrate through the Recall@N metric that the per-
purchases but also the predicted rank of the purchased prod- sonalized CFRank model outperforms the naı̈ve PopRank
ucts. For example, if a customer purchased only one product model (see Figure 1). The CFRank model shows similar ac-
that happened to be the most relevant recommendation in curacy to PopRank for the top-20 recommendations. CFRank
our top-20 recommendations, the NDCG@N = 1. However, increases accuracy (as measured by Recall@N) over PopRank
if another customer purchased only one product that hap- by 24% and 32% for N = 50 and N = 100, respectively.
pened to be the 10th most relevant recommendation in the These results offer directional support to suggest the CFRank
top-20 recommendations, the NDCG@N = 0.29. We argue model is an appropriate candidate to proceed with online
that in both of these cases a customer purchased a product experimental evaluation.
that our model suggested was relevant, which is the ultimate
purpose of a “Find Good Items” recommendation system. 6. RECOMMENDER ARCHITECTURE
Therefore, NDCG@N and MAP@N were deemphasized in
our evaluation. 6.1 Design Considerations
While Recall@N is the most appropriate evaluation metric
for our case, we include the other metrics for completeness. 6.1.1 Computation Resource Efficiency.

5
Figure 1: Recall@N for PopRank & CFRank Figure 2: Recommender System Architecture
0.20
0.18
CFRank
0.16
PopRank
0.14
Recall@N

0.12
0.10
0.08
0.06
0.04
0.02
The Dynamic Batch Scheduler handles the process of start-
0.00 ing and stopping an EMR cluster when the order transac-
10 20 50 100
tion data is updated. Any time another process updates the
N S3 Bucket containing the input data a trigger in Amazon
CloudWatch launches the cluster, starts the Recommenda-
tion Generator, and finally cleans up and terminates the
cluster. This architecture decouples the recommendation
The system should be designed to free up unused com- engine from the production transactional systems. Further,
putation resources. For example, assuming a daily batch it ensures efficient use of computational resources by avoid-
update frequency for the recommendations, the system only ing idle EMR clusters between batch updates.
needs the Amazon EMR cluster for a limited period.
6.2.2 Recommendation Generator
6.1.2 Scalability The Recommendation Generator executes the CF model
As Instacart’s customer base grows so will the demands in Apache Spark. After retraining the CF model, the Rec-
on the recommendation engine. First, more customers im- ommendation Generator produces top-N recommendation
ply more transactions, creating a larger order transaction lists for each customer. These recommendation lists are then
dataset which needs to be processed when retraining the persisted to the Recommendation Repository.
CF model. Second, more customers imply increased traffic
and touchpoints to serve the recommendations. 6.2.3 Recommendation Repository
The Recommendation Repository is a key-value datastore
6.1.3 Client Agnostic implemented with DynamoDB that persists the top-N rec-
Instacart has numerous interfaces with its customers, in- ommendations for each customer. The key is set to the
cluding its website, mobile applications, email marketing Customer ID and the value is a list of Product IDs for the
campaigns, and so on. Therefore, a wide range of clients top-N recommendations. The key-value implementation en-
need to be able to retrieve the product recommendations. ables rapid constant-time lookups without requiring expen-
sive computation when a client requests a customer’s rec-
6.1.4 Low Latency ommendations.
An individual customer may log in to the Instacart website
at any time. This means the recommendation system needs 6.2.4 Recommendation Server
to serve recommendations to a client in real-time, upon re- The Recommendation Server is implemented as a microser-
quest. In addition, using order volumes as a proxy for traffic vice with a limited set of REST API endpoints. Most impor-
to the Instacart website, we observe large variations in aver- tantly a GET /recommendations/{customer id} request will
age traffic per unit of time depending on the time of day and return a JSON response containing the list of top-N recom-
day of the week. Peak average hourly order volume on Sun- mendations for the specified Customer ID. The microservice
day at 10AM is 85 times the lowest average hourly order is implemented using Amazon API Gateway (configures the
volume on Tuesday at 3AM. The recommendation system endpoint) combined with AWS Lambda. The AWS Lambda
needs to maintain low latency even during peak demand. code handles fetching the top-N list from the Recommenda-
tion Repository and returning the JSON response to the
6.2 Overall Architecture consumer. This implementation offloads responsibility of
The end-to-end recommendation engine includes the fol- scale to AWS, which is beneficial for automatically scaling
lowing components: Dynamic Batch Scheduler, Recommen- up and down in response to the swings in intraday order
dation Generator, Recommendation Repository, and Rec- volumes.
ommendation Server. The following describes each of these
components along with the implementation using AWS in- 6.3 Performance Benchmarking
frastructure. We conduct a performance benchmarking test on the rec-
ommendation system architecture to validate the low-latency
6.2.1 Dynamic Batch Scheduler design requirement outlined above. To test performance, we

6
simulate several different traffic loads on the Recommenda- example, a customer’s purchasing behavior may differ be-
tion Server using a separate EC2 instance located in the tween an impulse order she placed on Friday evening versus
same region. her typical weekly grocery order submitted on Sunday af-
We configured the Recommendation Repository read ca- ternoon. Additionally, a customer’s preferences may evolve
pacity to handle 1,000 reads per second (i.e., the upper- over time (e.g., suffering taste bud burnout after consum-
bound of our load test). The test server initiated calls to ing the same snack for several weeks). Accommodating for
the Recommendation Server API endpoint to retrieve the this “preference drift” has been demonstrated to improve
recommendations for random user id entries at rates of 10, recommendation quality in other domains [14]. Incorporat-
100, and 1,000 requests per second. For each request rate, ing seasonality should also improve the performance of our
we conducted 100 tests. Table 3 summarizes the response recommendation system. For example, if a customer pur-
times for these performance benchmarking tests. chases a bottle of rum in July, we would reasonably expect
Our test results indicate that our architecture satisfies a recommendation of piña colada ingredients to be better
the low-latency design requirement. On average, the Rec- received than egg nog. However, the opposite would likely
ommendation Server provides recommendations to the con- be true in December.
sumer in under 75 ms. Further, even the 90th-percentile of Next, others have demonstrated that accuracy metrics
response times is below 100 ms. tend to favor recommendation algorithms that surface the
One interesting observation is the drop in response time obvious items [19]. We expect adjusting the model to ac-
when the load increases from 10 per second to 100 per sec- count for diversity (e.g., don’t recommend four varieties of
ond. After reviewing a description of the AWS Lambda apples) and popularity (e.g., suggesting less popular items)
container provisioning [1], we suspect that the counterintu- of recommendations would improve the effectiveness of the
itive decline in response time under increased load is likely recommender system in production.
due to containers thawing instead of cold-starting, thereby Finally, recommendation model selection should be in-
amortizing the initial container startup cost over more re- formed by, but not limited to, the offline evaluation pro-
quests. cedure applied in this paper. While offline evaluation is
valuable in reducing the set of recommendation model can-
Figure 3: API Benchmarking Results didates, evaluating the actual observed change in user be-
havior when exposed to various recommendation models in
90%tile production (i.e., online evaluation) is of critical importance
Average in incorporating complex interrelated factors such as user
100 intent, context, and recommendation interface [5].
10%tile
Response Time (ms)

75 8. REFERENCES
[1] Amazon. Understanding container reuse in AWS
50 Lambda, 2014. Last accessed 7 May 2018.
[2] Apache Software Foundation. Spark 2.2.0
documentation: Collaborative filtering - RDD-based
25 API. Last accessed 21 February 2018.
[3] Apache Software Foundation. Spark 2.2.0
0 documentation: Evaluation metrics - RDD-based API.
10 100 1,000 Last accessed 7 May 2018.
API Requests Per Second [4] P. Cremonesi, Y. Koren, and R. Turrin. Performance
of recommender algorithms on top-n recommendation
tasks. In Proceedings of the fourth ACM conference on
Recommender systems, pages 39–46. ACM, 2010.
7. CONCLUSION & FUTURE WORK [5] A. Gunawardana and G. Shani. A survey of accuracy
In this paper, we describe an end-to-end procedure for evaluation metrics of recommendation tasks. J. Mach.
transforming prior purchasing behavior in the context of Learn. Res., 10:2935–2962, Dec. 2009.
grocery shopping into personalized recommendations that [6] J. L. Herlocker, J. A. Konstan, L. G. Terveen, and
can be quickly queried from a lightweight and scalable rec- J. T. Riedl. Evaluating collaborative filtering
ommendation engine microservice. We demonstrated the recommender systems. ACM Trans. Inf. Syst.,
power of Apache Spark running on a three-node Amazon 22(1):5–53, Jan. 2004.
EMR cluster to learn a recommendation model and subse- [7] W. Hill, L. Stead, M. Rosenstein, and G. Furnas.
quently produce and persist up to top-100 recommendations Recommending and evaluating choices in a virtual
for over 200,000 customers—and all in less than thirty min- community of use. In Proceedings of the SIGCHI
utes. Benchmarking our personalized CF-based recommen- Conference on Human Factors in Computing Systems,
dation model, CFRank, against a non-personalized naı̈ve CHI ’95, pages 194–201, New York, NY, USA, 1995.
popularity-based recommendation model, PopRank, we found ACM Press/Addison-Wesley Publishing Co.
meaningful improvement in the relevance of recommenda- [8] Y. Hu, Y. Koren, and C. Volinsky. Collaborative
tions. filtering for implicit feedback datasets. 2008 Eighth
One area warranting further exploration is the inclusion IEEE International Conference on Data Mining, pages
of temporal attributes in the recommendation model. For 263–272, 2008.

7
[9] Instacart. Data science at Instacart, 2016. Last [19] H. J. C. Pampın, H. Jerbi, and M. P. OMahony.
accessed 21 February 2018. Evaluating the relative performance of collaborative
[10] Instacart. 3 million Instacart orders, open sourced, filtering recommender systems. Journal of Universal
2017. Last accessed 21 February 2018. Computer Science, 21(13):1849–1868, 2015.
[11] Instacart. The Instacart online grocery shopping [20] D. H. Park, H. K. Kim, I. Y. Choi, and J. K. Kim. A
dataset, 2017. Last accessed 21 February 2018. literature review and classification of recommender
[12] Internet Retailer. U.S. e-commerce sales grow 16.0% systems research. Expert Systems with Applications,
in 2017, 2018. Last accessed 21 February 2018. 39(11):10059 – 10072, 2012.
[13] J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, [21] Y. Peng, Z. Peilin, and G. Xin. Efficient cost-sensitive
L. R. Gordon, and J. Riedl. Grouplens: Applying learning for recommendation. 2017.
collaborative filtering to usenet news. Commun. ACM, [22] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl.
40(3):77–87, Mar. 1997. Analysis of recommendation algorithms for
[14] Y. Koren. Collaborative filtering with temporal e-commerce. In Proceedings of the 2Nd ACM
dynamics. Commun. ACM, 53(4):89–97, Apr. 2010. Conference on Electronic Commerce, EC ’00, pages
[15] Y. Koren, R. Bell, and C. Volinsky. Matrix 158–167, New York, NY, USA, 2000. ACM.
factorization techniques for recommender systems. [23] H. Steck. Evaluation of recommendations:
Computer, 42(8):30–37, Aug. 2009. rating-prediction and ranking. In Proceedings of the
[16] Y. Liu, P. Zhao, A. Sun, and C. Miao. A boosting 7th ACM conference on Recommender systems, pages
algorithm for item recommendation with implicit 213–220. ACM, 2013.
feedback. In IJCAI, volume 15, pages 1792–1798, 2015. [24] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan.
[17] McKinsey. How retailers can keep up with consumers, Large-scale parallel collaborative filtering for the
2013. Last accessed 21 February 2018. netflix prize. In R. Fleischer and J. Xu, editors,
[18] D. Olsen. Instacart ups valuation to $4.2b with latest Algorithmic Aspects in Information and Management,
funding, 2018. Last accessed 21 February 2018. pages 337–348, Berlin, Heidelberg, 2008. Springer
Berlin Heidelberg.

Вам также может понравиться