Академический Документы
Профессиональный Документы
Культура Документы
06/15/2015
Blagoy Kaloferov
Software Engineer
Copyright Edmunds.com, Inc. (the Company). Edmunds and the Edmunds.com logo are registered
trademarks of the Company. This document contains proprietary and/or confidential information of the
Company. No part of this document or the information it contains may be used, or disclosed to any
person or entity, for any purpose other than advancing the best interests of the Company, and any
such disclosure requires the express approval of the Company.
Todays talk
About me:
Blagoy Kaloferov
Big Data Software Engineer
About my company:
Edmunds.com is a car buying platform
Divisions:
DWH Engineers
Business Analysts
Statistics team
Spark SQL :
Simplified ETL and enhanced Visualization tools
Hadoop Cluster
Raw Data
Map Reduce
Clickstream jobs
Inventory
Dealer
Lead
Transaction
HDFS
aggregates
Architecture for Analytics
Hadoop Cluster
Business
Raw Data
Map Reduce Intelligence
Clickstream jobs Tools
Inventory
Dealer
Lead
Transaction Platfora
Tableau
HDFS
aggregates Database
Redshift
Databricks
Architecture for Analytics Spark
Spark SQL
Hadoop Cluster
Business
Raw Data
Map Reduce Intelligence
Clickstream jobs Tools
Inventory
Dealer
Lead
Transaction Platfora
Tableau
HDFS
aggregates Database
Redshift
Architecture for Analytics
Hadoop Cluster
Business
Map Reduce Spark
Raw Data
Spark SQL
Intelligence
Clickstream jobs Tools
Inventory
Dealer
Lead
Transaction Platfora
Tableau
HDFS
aggregates Database
Redshift
Architecture for Analytics
Hadoop Cluster
Business
Map Reduce Spark
Raw Data
Spark SQL
Intelligence
Clickstream jobs Tools
Inventory
Dealer
Lead
Transaction Platfora
Tableau
HDFS
aggregates Database
Redshift
Exposing S3 data via
Spark SQL
Our approach:
o Spark SQL tables similar to
existing our Redshift tables
S3 dataset
Spark SQL table S3 dataset
S3 dataset
S3 dataset
S3 dataset S3 dataset
S3 dataset
2015/05/31/01
2015/05/31/02
partitions
Spark
Adding SQL
Latest tables
Table Partitions
Register valid partitions of Spark SQL tables with spark Hive Metastore
Create Last_X_ Days copy of any Spark SQL table in memory
Scheduled jobs:
Faster Pipeline
o Anyone can ETL on prefixed
Better Insights Its about
aggregates and create time!
new Data Marts
Business
Intelligence
Tools
Platfora Dashboards Pipeline
Optimization
MapReduce
HDFS Joined jobs
S3 Datasets
Source Dataset Build / Update Lens Dashboards
Platfora Dashboards Pipeline
Optimization
Limitations:
o We can not optimize the Platfora Map Reduce jobs
o Defined Data Marts not available elsewhere
MapReduce
HDFS Joined jobs
S3 Datasets
Source Dataset Build / Update Lens Dashboards
Join
on
lead_id
,
inventory_id,
visitor_id,
dealer_id
Platfora Dealer Leads dataset
Use Case
Region
Info
Dealer
Lead Categorization
Info
Lead
Lead Dealer Leads
join Submitter Data Submitter
Joined Dataset Insights
Vehicle
Info
Transaction
Data
Optimizing Dealer Leads Dataset
Dashboards
10 minutes 10 minutes
ETL and Visualization
takeaway
ETL
Platfora
S3
New Data Marts
Using Spark SQL Redshift
load
POC with Spark SQL vision
Introduction
Definitions:
Impression, CPM
Line Item, Order
Ad Revenue Billing Use Case
Introduction
Combine data
Line Item groupings MERGED
|
CPM
|
NEW_impressions
|
a?ributes
Box representing each example
Cap impression!
Box representing each example
Adjust impression!
Can we automate ad revenue calculation?
Process Vision:
Each Line Item 1_day_impr
|
CPM
|
a?ributes
Billing Engine
Automation Challenges
Ad Performance
Tableau
Project Architecture in Spark
Business Analysts
aggregate Ad Performance
join Tableau
Spark SQL
Project Architecture in Spark
aggregate Ad Performance
join Tableau
Line Item
Line Item DFP / Other Merging
Dimensions Impressions Engine
Spark SQL
aggregate Ad Performance
join Tableau
Line Item Billing Rules
Line Item DFP / Other Merging Engine
Dimensions Impressions Engine
Spark SQL
Thank you!
Blagoy Kaloferov
bkaloferov@edmunds.com