Вы находитесь на странице: 1из 1

Data Processing Engine Using Kafka & K-Strems

Rajakrishna Reddy | 02 August 2019

Processing state to the state topic

Stored Data Categorized by Store or Multiple Stores

Saves the current


Push the data to Kafka grouped by store state to one of the
Core Processing Engine core topics for fault
tolerance

Store A Processor A Semi Processed A Processor A(2..n)


SemiProcsd
Semi
Semi
Procsd A( (Level2)
Level2)
Procsd A A( Level2)
Finalizer A Final A Grabs the analysis

Legends: Store-Data: Stored in Kafka


identified by Store, and also can be
grouped by Stores, partitioned in the
same way.

Collection of records that get stored


in this topic (These could be related
a single store or multiple stores.)

Processor which processes the


stored data on the fly using the K-
Strems

Transformer that could transform


the data to anther format or could
enrich the existing data by
requirement.

Problem:
A large fast food chain wants you to generate forecast for 2000 restaurants of this fast food chain. Each restaurant has close to 500 items that they sell.
The fast food chain provides data for last 3 years at a store, item, day level. The ask is provide forecast out for the following year.
Assume the data is static. Data Scientist look at the problem and have figured out a solution that provides the best forecast.
The data scientist have built an algorithm that takes all data at a store level and produce forecasted output at the store level.
It takes them 15 minutes to process each store.

Solution:

I've planned to use Kafka and K-Streams for this solution as we need, a data storage and fast computing power. Furnished my solutions for all under each question.

Questions:
1) Given the input data is static. What is the right technology to store the data and what would be the partitioning strategy?
As we are using Kafka for this, the data would get stored as part of the Kafka itself and the partitioning could be store wise or we can group multiple stores data in each partition (This
would be the better approach as we have 2k+ stores), based on our computing requirement as computing is done based on the partitions.

2) Each store takes 15 minutes, how would you design the system to orchestrate the compute faster - so the entire compute can finish this in < 5hrs
We can reduce the computing time by allowing the system to run in parallel for each store or group of stores based on the number of partitions, as each partition data can be run in parallel.
Here we can make use of the K-Streams which could transform/analyze the data on the fly and also has the capability to run custom analytical algorithms on the data flow. If you could see
my design doc, there are multiple levels and intermediate topics in each store's data processing, those processes could run our custom analytical algorithms and yield the results and store
them in a final topic.

Вам также может понравиться