Building Data Lakes Workshop Series q3 2018 Unnik

Data Lakes on AWS
Big Data Hands-on Workshop
Unni Pillai
Senior Solutions Architect – Big Data and Analytics
unni_k_pillai
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?
“a single store for all of the raw data that anyone in an organization
might need to analyse” - Martin Fowler
“If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various
users of the lake can come to examine, dive in, or take samples.” - James Dixon
“The promise of a data lake is to provide a place to store the data…so it

will be available for analytics and data science” – Alex Gorelik
Characteristics of a Data Lake
Collect Dive in Flexible

Everything Anywhere Access
Traditionally, Analytics Used to Look Like This
• Relational data
Business Intelligence
• TBs–PBs scale
• Highly fragmented and Siloed
Multiple à Silos • Schema defined prior to data load &

locked in once loaded - inflexible
Data Warehouse
• Operational reporting and ad hoc.
• Large initial CAPEX + $10K–$50K/TB/Year
OLTP ERP CRM LOB
Data Lakes Extend the Traditional Approach
• Structured and unstructured data

Business Machine
Intelligence Learning • Complement or replace DW
• TBs–EBs scale
DW QueriesBig data
Interactive Real-time
processing • Diverse analytical engines
Catalog
• Schema on-read - Dynamic
• Low-cost storage & analytics

Data Warehouse Data Lake
• Extensible and scalable
OLTP ERP CRM LOB Devices Web Sensors Social
Data Lakes Extend the Traditional Approach
Business Machine
A data lake is an architectural
Intelligence Learning
approach that allows you to store
massive amounts of data into a
DW QueriesBig data
processing
Interactive Real-time
central location, so it's readily
Catalog
available to be categorized,
processed, analyzed and
Data Warehouse Data Lake consumed by diverse group of users
within an organization.
OLTP ERP CRM LOB Devices Web Sensors Social
Components of a Data Lake
Security • Different data types

• Streaming data
Process
• Log file data
Catalog & Search • Database data
Storage
• Flexible ingestion
Ingestion mechanisms
Security
• High durability
Process
• High scalability
Catalog & Search • Stores raw data
Storage • Support for any type of
data
Ingestion
• Low cost
Security • Metadata lake

Process • Classification
Catalog & Search
• Search for simplified access
• Internal users
Storage
• External parties
Ingestion
• Expose through an API
Security • Transformation
• Data
Process
• Data Formats
Catalog & Search • Analysis
Storage • SQL
• Exploratory
Ingestion • Machine Learning
Security • Encryption
Process • Authentication
Catalog & Search
• Authorisation
• Auditing
Storage
• Internal Users
Ingestion
• External Parties
Data Lake technology?
Scalable
Extensible BUT…….
Flexible
Hadoop seems perfect
COMPUTE COMPUTE
COMPUTE COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
STORAGE
Ingest
Data Movement From On-premises Datacenters
AWS Snowball, AWS Database AWS Storage

AWS Direct Connect Snowball Edge and Migration Service Gateway
Snowmobile
Establish a dedicated Petabyte and Exabyte- Migrate database from Lets your on-premises
network 10G connection scale data transport the most widely-used applications to use AWS
from your premises to solution that uses secure commercial and open- for storage; includes a
AWS; reduces your appliances to transfer source offerings to AWS highly-optimized data
network costs, increase large amounts of data quickly and securely with transfer mechanism,
bandwidth throughput, into and out of the AWS minimal downtime to bandwidth management,
and provide a more cloud applications along with local cache
consistent network
experience than Internet-
based connections
Data Movement From Real-time Sources
AWS IoT Core Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis
Firehose Streams Video Streams
Supports billions of Capture, transform, and Build custom, real-time Securely stream video
devices and trillions of load data streams into applications that process from connected devices
messages, and can AWS data stores for near data streams using to AWS for analytics,
process and route those real-time analytics with popular stream machine learning (ML),
messages to AWS existing business processing frameworks and other processing
endpoints and to other intelligence tools.
devices reliably and Integration with Kinesis
securely Analytics for real-time
IoT predictive
maintenance.
Storage
Amazon S3 - Object Storage
Durability, Availability Security and

Query in Place Flexible Management
& Scalability Compliance
Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention
Amazon Glacier—Backup and Archive
Durability, Availability Retrieves data in

Secure Inexpensive
& Scalability minutes
$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region
Data Driven Enterprise View – S3 is the Data Lake
Enterprise Data Driven View and Reporting

Functional Reporting and Analysis
Functional Use Cases | AI | ML | Analytics | Insights Ent Apps Agile Apps
Network Regional Network Customer Enterprise DevTest
IT Marketing Retail
Operations Operations Engineering Care B2B DevOps
S3 Data Lake
on AWS
Clickstream Text IoT Device Network OTT Customer Region Organization Orders BSS OSS Hadoop
Web Sentiment Edge xDR Media Relationship Shadow IT Catalog
- Common Enterprise KPIs and Metrics - Extensible to applications and visualization tools
- Democratization of Data – Data Driven - Democratization of AI and ML capabilities
Store Data in the Format You Want
Open and comprehensive
• Data is being generated in many formats.
• Easily land all your data in native formats to

CSV the S3 Data Lake “landing zone”
• Store data in the format you want – ask the

ORC
questions you want from the data:
• Text files like CSV
Grok
Amazon S3 • Columnar like Apache Parquet, and Apache ORC
Amazon Glacier • Logstash like Grok
Avro
AWS Glue
• JSON (simple, nested), AVRO
Parquet • And more…
• Transform data based on your use case:
Parquet JSON • For example – for performance and efficiency
transform JSON into Parquet and store in a new
bucket.
Catalog
Data Preparation Accounts for ~80% of the Work
Building training sets
Cleaning and organizing data
Collecting data sets
Mining data for patterns
Refining algorithms
Other
Storing is Not Enough, Data Needs to Be Discoverable
“ Dark data are the information

assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and
”
direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured
Data Lake Challenges
Data variety and data volumes are increasing rapidly
• Ingest
• Discover
• Catalog
• Understand,
• Curate
• … all kinds of
data
Complex Orgs - Multiple Consumers and Applications
Quickly drive
new insights
AWS Glue - Data Catalog
Make Data Discoverable
• Automatically discovers data and stores schema

Glue
Data Catalog
ETL • Catalog makes data searchable, and available for ETL
Auto-generate
• Catalog contains table and job definitions
Discover data and customizable code in
extract schema Python and Spark • Computes statistics to make queries efficient
Compliance
AWS Glue—ETL Service
Make ETL scripting and deployment easy
• Automatically generates ETL code
• Code is customizable with Python

and Spark
• Endpoints provided to edit, debug,

test code
• Jobs are scheduled or event-based
• Serverless
Security
Level of Data Lake Security
Data Security Infra Security Access Control Compliance /

Governance
AWS Provides Highest Levels of Security
Multiple levels of security, identity and access management, encryption, and

compliance to secure their data lake.
Security Identity Encryption Compliance
Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact
AWS Shield AWS SSO AWS Key Management Amazon Inspector

Service
AWS WAF Amazon Cloud Directory Amazon Cloud HSM
Encryption at rest
Amazon Macie AWS Directory Service Amazon Cognito
Encryption in transit
VPC AWS Organizations AWS CloudTrail
Bring your own keys, HSM
support
Compliance: Log and Audit all AWS Activity
• Log and continuously

monitor every account activity
and API calls with CloudTrail
Store data in S3 Account event CloudTrail captures A log of API calls • Increase visibility into your user
occurs generating and records the is delivered to
API activity API activity S3 bucket and and resource activity
optionally delivered
to CloudWatch Logs
and CloudWatch
Events
• Enables governance,
compliance, and operational
and risk auditing (i.e. separation
of concerns)
Questions / Comments ?
Dive Deep Into AWS Glue
What is AWS Glue?
Fully-managed, serverless
extract-transform-load (ETL) service
for developers, built by developers
1000s of Developers and jobs
There are many tools already in AWS Ecosystem
Amazon Redshift Partner Page for Data Integration
Fivetran
Still ETL Developers Hand-Code
• Canvas based tools are hard to extend
• Code is flexible, powerful, and easy to share
• Familiar tools and development pipelines

• IDEs, version control, testing, continuous integration
• Highly customizable tasks can be achieved

Hand-coding is laborious
schemas change
data formats change makes hand-coding
add or change sources
error-prone & brittle
data volume grows
AWS Glue does the undifferentiated heavy lifting

so developers can easily customize
AWS Glue Components
Data Catalog Job Authoring Job Execution

Discover Develop Deploy
Automatic crawling Auto-generates ETL code Serverless execution
Apache Hive Metastore compatible Python and Apache Spark Flexible scheduling
Integrated with AWS analytic services Edit, Debug, and Explore Monitoring and alerting
Common use-cases
Load data warehouses
Build a data lake on Amazon S3
Example Architecture
AWS Glue Analytics Services
Amazon
Quick Sight
AWS Glue AWS Glue

Crawlers Data Catalog
Amazon
Archive Amazon S3
bucket
Athena
AWS Glue
ETL
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
AWS Glue - data catalog
Make data discoverable
Glue
Data Catalog
Automatically discovers data and stores schema
Discover data and
extract schema
Catalog makes data searchable, and available for ETL
Catalog contains table and job definitions
Computes statistics to make queries efficient
RDS S3 Redshift
ETL example (con’t)
Organize data in Apache Hive-style partitions
year 2017 year 2017
… month 11 12 …
month 11 12
day … day 27 28 …
27 28
filter &
transform …
hour … hour
JSON CSV
Step 1: Run crawler
200+ fields
Groups files into

Apache Hive-style partitions
AWS Glue - ETL service
Make ETL scripting and deployment easy
Serverless Transformations
Based on Apache Spark
Automatically generates ETL code
Code is customizable with PySpark and Scala

Endpoints provided to edit, debug, test code
Jobs are scheduled or event-based
Step 2: Specify mappings
Anatomy of a generated script
Initialize job bookmark
Annotations for graphical DAG
Read Dynamic Frame

from source
Data transformation +
data cleaning functions
Write Dynamic Frame

to sink
Commit job bookmark
Step 3: Edit + Test with Dev-Endpoints
AWS Glue Spark environment
Interpreter Remote
Server Interpreter
Connect your IDE to an AWS Glue development endpoint.
Environment to interactively develop, debug, and test ETL code.
Step 3: Explore and experiment with data
Connect your notebook (e.g. Zeppelin)

to an AWS Glue development endpoint.
Interactively experiment and explore

datasets and data sources
Deploy to production
Push scripts to S3
Create or register with ETL job
Step 4: Schedule a job
several event types pass parameters
Serverless job execution
Compute instances
No need to provision, configure, or
manage servers
Auto-configure VPC & role-based access

security & isolation preserved
Customers can specify job capacity (DPU)
Automatically scale resources
Only pay for the resources you consume

per-second billing (10-minute min)
Customer VPC Customer VPC
Lab - Architecture
Amazon
Quick Sight
AWS Glue AWS Glue

Crawlers Data Catalog
Amazon S3 Amazon
Archive Athena
bucket
AWS Glue
ETL
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
WiFi Password
Wifi Name: AWSWORKSHOP

Password: AWSworkshop2018
Lab Environment
1. Open new ‘Cognito / Private Browsing’ tab

2. Go to : https://unniklabs.awsapps.com/start
3. Login with username / password provided
4. Open your ‘Management Console’
>> Start the lab
Lab Guide
http://bit.ly/aws-innovate-2018-glue-demo
Under the hood ?
Public GitHub timeline is …
semi-structured
35+ event types
payload structure
and size varies by
event type
Apache Spark and AWS Glue ETL
What is Apache Spark?

Parallel, scale-out data processing engine
Fault-tolerance built-in
Flexible interface: Python scripting, SQL
Rich eco-system: ML, Graph, analytics, …
SparkSQL AWS Glue ETL AWS Glue ETL libraries

Integration: Data Catalog, job orchestration,
Dataframes Dynamic Frames code-generation, job bookmarks, S3, RDS
ETL transforms, more connectors & formats
Spark core: RDDs
New data structure: Dynamic Frames
Dataframes and Dynamic Frames
Dataframes
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
Dynamic Frames
Like dataframes for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
Dynamic Frame internals
Dynamic Records
{“id”:”2489”, “type”: ”CreateEvent”, {“id”:”6510”, “type”: “PushEvent”, {“id”:4391, “type”: “PullEvent”,
”payload”: {“creator”:…}, …} ”payload”: {“pusher”:…}, …} ”payload”: {“assets”:…}, …}
id type id type id type
Dynamic Frame Schema
Schema per-record, no up-front schema needed

id id type
• Easy to restructure, tag, modify
• Can be more compact than dataframe rows
• Many flows can be done in single-pass
Dynamic Frame transforms
15+ transforms out-of-the box

project cast separate into cols
ResolveChoice() B B B B B B B
C
ApplyMapping() A
A X Y
X Y
Relationalize() transform
Semi-structured schema Relational schema
A B B C.X C.Y FK
PK Offset Value
A B B C D[ ]
X Y
Transforms and adds new columns, types, and tables on-the-fly

Tracks keys and foreign keys across runs
SQL on the relational schema is orders of magnitude faster than JSON processing
Useful AWS Glue transforms
toDF(): Convert to a Dataframe

Spigot(): Sample data of any Dynamic Frame to S3
Unbox(): Parse string column as given format into Dynamic Frame
Filter(), Map(): Apply Python UDFs to Dynamic Frames
Join(): Join two Dynamic Frames
And more ….
Performance: AWS Glue ETL
GitHub Timeline ETL Performance
1800
Configuration
1600
DynamicFrames DataFrames
10 DPUs
1400
(lower is better) Apache Spark 2.1.1
1200
Time (sec)
1000
Workload
800
JSON to CSV
600 Filter for Pull events
400
200 On average: 2x performance

0 improvement
Day Month Year
24 744 8699
Data size (# files)
Performance: Lots of small files
AWS Glue ETL small file scalability
Lots of small files, e.g. Kinesis Firehose
8000
Spark
Out-Of-Memory
>= 320: 640K files
7000
Vanilla Apache Spark (2.1.1) overheads
Spark Glue
6000 Must reconstruct partitions (2-pass)
Too many tasks: task per file
Time (sec)
5000
4000 Scheduling & memory overheads

1.2 Million Files
3000
2000
AWS Glue Dynamic Frames
Integration with Data Catalog
1000
Grouping
Automatically group files per task
0
1:2K 20:40K 40:80K 80:160K 160:320K 320:640K 640: 1280K
Rely on crawler statistics
# partitions : # files
Job bookmark example
Input table Output table
year 2017 year 2017
11 12 … month 11 12 …
month
day 27 28 … day 27 28 … Periodically run a job
avoid reprocessing
hour … … hour … …
previous input
avoid generating
run 1 … duplicate output
run 2
Job bookmarks
Bookmarks are per-job checkpoints that Examples uses:
track the work done in previous runs.
Process githubarchive files daily
They persist the state of sources, Process Firehose files hourly
transforms, and sinks on each run. Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
run 1 run 2 run 3
Job bookmark options
Option Behavior pause

Enable Pick up from where you left off
Ignore and process the entire
run 1 run 2 run 3
Disable
dataset every time
Temporarily disable advancing the
Pause
bookmark
disable enable
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchive table
Pause: Process the previous githubarchive partition
Job bookmark internals
Bookmark state
Example run 3:
run 2 run 3
…
How do we avoid space blowup?
Use timestamps to filter
already processed input
process files created after T2
run 2 run 3
…
But S3 is eventually consistent?
Maintain exclusion list of files
created in inconsistency window
(size d) prior to start. excluded
process files T -d
2
T2
created after
exclusion list
Training: Big Data on AWS
Make your data driven decisions count, and AWS Certified Big Data - Specialty
make a career in Big Data on AWS. Follow
the Big Data Specialty learning path and Big Data on AWS – 3-day Classroom Training
become a specialist in Big Data:
• Implement core AWS Big Data services
Free AWS digital training:
according to best practices Big Data Technology Fundamentals
• Design and maintain Big Data
• Leverage tools to automate data analysis
Who should follow this learning path? Certified Cloud

Practitioner
Associate-level Certification
• Enterprise solutions • Big Data solutions

architects architects
• Data scientists • Data analysts Free AWS digital training: Foundational knowledge
71
Feedback Feedback

Building Data Lakes Workshop Series q3 2018 Unnik

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Building Data Lakes Workshop Series q3 2018 Unnik

Загружено:

Авторское право:

Доступные форматы

Data Lakes on AWS

Big Data Hands-on Workshop

“The promise of a data lake is to provide a place to store the data…so it

Collect Dive in Flexible

• Highly fragmented and Siloed

Multiple à Silos • Schema defined prior to data load &

• Large initial CAPEX + $10K–$50K/TB/Year

OLTP ERP CRM LOB

• Structured and unstructured data

• Low-cost storage & analytics

• Extensible and scalable

OLTP ERP CRM LOB Devices Web Sensors Social

Security • Different data types

Security • Metadata lake

AWS Snowball, AWS Database AWS Storage

Durability, Availability Security and

Durability, Availability Retrieves data in

Enterprise Data Driven View and Reporting

• Easily land all your data in native formats to

• Store data in the format you want – ask the

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

“ Dark data are the information

• Automatically discovers data and stores schema

• Automatically generates ETL code

• Code is customizable with Python

• Endpoints provided to edit, debug,

• Jobs are scheduled or event-based

Data Security Infra Security Access Control Compliance /

Multiple levels of security, identity and access management, encryption, and

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector

• Log and continuously

1000s of Developers and jobs

Amazon Redshift Partner Page for Data Integration

• Canvas based tools are hard to extend

• Code is flexible, powerful, and easy to share

• Familiar tools and development pipelines

• Highly customizable tasks can be achieved

AWS Glue does the undifferentiated heavy lifting

Data Catalog Job Authoring Job Execution

AWS Glue AWS Glue

Catalog contains table and job definitions

Computes statistics to make queries efficient

year 2017 year 2017

Groups files into

Automatically generates ETL code

Code is customizable with PySpark and Scala

Jobs are scheduled or event-based

Initialize job bookmark

Annotations for graphical DAG

Read Dynamic Frame

Write Dynamic Frame

Connect your IDE to an AWS Glue development endpoint.

Environment to interactively develop, debug, and test ETL code.

Connect your notebook (e.g. Zeppelin)

Interactively experiment and explore

several event types pass parameters

Auto-configure VPC & role-based access

Customers can specify job capacity (DPU)

Automatically scale resources