Вы находитесь на странице: 1из 72

Data Lakes on AWS

Big Data Hands-on Workshop

Unni Pillai
Senior Solutions Architect – Big Data and Analytics

unni_k_pillai

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?

“a single store for all of the raw data that anyone in an organization
might need to analyse” - Martin Fowler

“If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various
users of the lake can come to examine, dive in, or take samples.” - James Dixon

“The promise of a data lake is to provide a place to store the data…so it


will be available for analytics and data science” – Alex Gorelik

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake

Collect Dive in Flexible


Everything Anywhere Access

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This

• Relational data
Business Intelligence
• TBs–PBs scale

• Highly fragmented and Siloed

Multiple à Silos • Schema defined prior to data load &


locked in once loaded - inflexible
Data Warehouse
• Operational reporting and ad hoc.

• Large initial CAPEX + $10K–$50K/TB/Year

OLTP ERP CRM LOB

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach

• Structured and unstructured data


Business Machine
Intelligence Learning • Complement or replace DW

• TBs–EBs scale
DW QueriesBig data
Interactive Real-time
processing • Diverse analytical engines
Catalog
• Schema on-read - Dynamic

• Low-cost storage & analytics


Data Warehouse Data Lake

• Extensible and scalable

OLTP ERP CRM LOB Devices Web Sensors Social

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach

Business Machine
A data lake is an architectural
Intelligence Learning
approach that allows you to store
massive amounts of data into a
DW QueriesBig data
processing
Interactive Real-time
central location, so it's readily
Catalog
available to be categorized,
processed, analyzed and
Data Warehouse Data Lake consumed by diverse group of users
within an organization.
OLTP ERP CRM LOB Devices Web Sensors Social

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake

Security • Different data types


• Streaming data
Process
• Log file data
Catalog & Search • Database data

Storage
• Flexible ingestion
Ingestion mechanisms

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake

Security
• High durability
Process
• High scalability
Catalog & Search • Stores raw data
Storage • Support for any type of
data
Ingestion
• Low cost

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake

Security • Metadata lake


Process • Classification
Catalog & Search
• Search for simplified access
• Internal users
Storage
• External parties
Ingestion
• Expose through an API

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake

Security • Transformation
• Data
Process
• Data Formats
Catalog & Search • Analysis
Storage • SQL
• Exploratory
Ingestion • Machine Learning

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake

Security • Encryption
Process • Authentication
Catalog & Search
• Authorisation
• Auditing
Storage
• Internal Users
Ingestion
• External Parties

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake technology?

Scalable
Extensible BUT…….

Flexible
Hadoop seems perfect

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COMPUTE COMPUTE

COMPUTE COMPUTE

COMPUTE
COMPUTE

COMPUTE
COMPUTE
COMPUTE

STORAGE
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingest

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From On-premises Datacenters

AWS Snowball, AWS Database AWS Storage


AWS Direct Connect Snowball Edge and Migration Service Gateway
Snowmobile

Establish a dedicated Petabyte and Exabyte- Migrate database from Lets your on-premises
network 10G connection scale data transport the most widely-used applications to use AWS
from your premises to solution that uses secure commercial and open- for storage; includes a
AWS; reduces your appliances to transfer source offerings to AWS highly-optimized data
network costs, increase large amounts of data quickly and securely with transfer mechanism,
bandwidth throughput, into and out of the AWS minimal downtime to bandwidth management,
and provide a more cloud applications along with local cache
consistent network
experience than Internet-
based connections

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources

AWS IoT Core Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis
Firehose Streams Video Streams

Supports billions of Capture, transform, and Build custom, real-time Securely stream video
devices and trillions of load data streams into applications that process from connected devices
messages, and can AWS data stores for near data streams using to AWS for analytics,
process and route those real-time analytics with popular stream machine learning (ML),
messages to AWS existing business processing frameworks and other processing
endpoints and to other intelligence tools.
devices reliably and Integration with Kinesis
securely Analytics for real-time
IoT predictive
maintenance.

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 - Object Storage

Durability, Availability Security and


Query in Place Flexible Management
& Scalability Compliance

Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive

Durability, Availability Retrieves data in


Secure Inexpensive
& Scalability minutes

$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Driven Enterprise View – S3 is the Data Lake

Enterprise Data Driven View and Reporting


Functional Reporting and Analysis
Functional Use Cases | AI | ML | Analytics | Insights Ent Apps Agile Apps
Network Regional Network Customer Enterprise DevTest
IT Marketing Retail
Operations Operations Engineering Care B2B DevOps

S3 Data Lake
on AWS

Clickstream Text IoT Device Network OTT Customer Region Organization Orders BSS OSS Hadoop
Web Sentiment Edge xDR Media Relationship Shadow IT Catalog

- Common Enterprise KPIs and Metrics - Extensible to applications and visualization tools
- Democratization of Data – Data Driven - Democratization of AI and ML capabilities
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Store Data in the Format You Want
Open and comprehensive
• Data is being generated in many formats.

• Easily land all your data in native formats to


CSV the S3 Data Lake “landing zone”

• Store data in the format you want – ask the


ORC
questions you want from the data:
• Text files like CSV
Grok
Amazon S3 • Columnar like Apache Parquet, and Apache ORC
Amazon Glacier • Logstash like Grok
Avro
AWS Glue
• JSON (simple, nested), AVRO
Parquet • And more…
• Transform data based on your use case:
Parquet JSON • For example – for performance and efficiency
transform JSON into Parquet and store in a new
bucket.

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work

Building training sets

Cleaning and organizing data

Collecting data sets

Mining data for patterns

Refining algorithms

Other

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable

“ Dark data are the information


assets organizations collect,
process, and store during
regular business activities,
but generally fail to use for other
purposes (for example, analytics,
business relationships and


direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data

CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Challenges
Data variety and data volumes are increasing rapidly
• Ingest
• Discover
• Catalog
• Understand,
• Curate
• … all kinds of
data
Complex Orgs - Multiple Consumers and Applications
Quickly drive
new insights

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue - Data Catalog
Make Data Discoverable

• Automatically discovers data and stores schema


Glue
Data Catalog
ETL • Catalog makes data searchable, and available for ETL

Auto-generate
• Catalog contains table and job definitions
Discover data and customizable code in
extract schema Python and Spark • Computes statistics to make queries efficient

Compliance

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy

• Automatically generates ETL code

• Code is customizable with Python


and Spark

• Endpoints provided to edit, debug,


test code

• Jobs are scheduled or event-based

• Serverless

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Level of Data Lake Security

Data Security Infra Security Access Control Compliance /


Governance

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security

Multiple levels of security, identity and access management, encryption, and


compliance to secure their data lake.

Security Identity Encryption Compliance

Amazon GuardDuty AWS IAM AWS Certification Manager AWS Artifact

AWS Shield AWS SSO AWS Key Management Amazon Inspector


Service
AWS WAF Amazon Cloud Directory Amazon Cloud HSM
Encryption at rest
Amazon Macie AWS Directory Service Amazon Cognito
Encryption in transit
VPC AWS Organizations AWS CloudTrail
Bring your own keys, HSM
support
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Compliance: Log and Audit all AWS Activity

• Log and continuously


monitor every account activity
and API calls with CloudTrail

Store data in S3 Account event CloudTrail captures A log of API calls • Increase visibility into your user
occurs generating and records the is delivered to
API activity API activity S3 bucket and and resource activity
optionally delivered
to CloudWatch Logs
and CloudWatch
Events
• Enables governance,
compliance, and operational
and risk auditing (i.e. separation
of concerns)

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions / Comments ?

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dive Deep Into AWS Glue

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is AWS Glue?

Fully-managed, serverless
extract-transform-load (ETL) service
for developers, built by developers

1000s of Developers and jobs

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There are many tools already in AWS Ecosystem

Amazon Redshift Partner Page for Data Integration

Fivetran

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Still ETL Developers Hand-Code

• Canvas based tools are hard to extend

• Code is flexible, powerful, and easy to share

• Familiar tools and development pipelines


• IDEs, version control, testing, continuous integration

• Highly customizable tasks can be achieved


© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Hand-coding is laborious

schemas change
data formats change makes hand-coding
add or change sources
error-prone & brittle
data volume grows

AWS Glue does the undifferentiated heavy lifting


so developers can easily customize

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Components

Data Catalog Job Authoring Job Execution


Discover Develop Deploy
Automatic crawling Auto-generates ETL code Serverless execution
Apache Hive Metastore compatible Python and Apache Spark Flexible scheduling
Integrated with AWS analytic services Edit, Debug, and Explore Monitoring and alerting

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common use-cases

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Load data warehouses

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Build a data lake on Amazon S3

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example Architecture
AWS Glue Analytics Services

Amazon
Quick Sight

AWS Glue AWS Glue


Crawlers Data Catalog

Amazon
Archive Amazon S3
bucket
Athena
AWS Glue
ETL

Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
AWS Glue - data catalog
Make data discoverable

Glue
Data Catalog
Automatically discovers data and stores schema
Discover data and
extract schema
Catalog makes data searchable, and available for ETL

Catalog contains table and job definitions

Computes statistics to make queries efficient

RDS S3 Redshift

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL example (con’t)
Organize data in Apache Hive-style partitions

year 2017 year 2017

… month 11 12 …
month 11 12

day … day 27 28 …
27 28
filter &
transform …
hour … hour

JSON CSV

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 1: Run crawler
200+ fields

Groups files into


Apache Hive-style partitions

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue - ETL service
Make ETL scripting and deployment easy

Serverless Transformations
Based on Apache Spark

Automatically generates ETL code

Code is customizable with PySpark and Scala


Endpoints provided to edit, debug, test code

Jobs are scheduled or event-based

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 2: Specify mappings

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of a generated script

Initialize job bookmark

Annotations for graphical DAG

Read Dynamic Frame


from source

Data transformation +
data cleaning functions

Write Dynamic Frame


to sink
Commit job bookmark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 3: Edit + Test with Dev-Endpoints
AWS Glue Spark environment

Interpreter Remote
Server Interpreter

Connect your IDE to an AWS Glue development endpoint.

Environment to interactively develop, debug, and test ETL code.

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 3: Explore and experiment with data

Connect your notebook (e.g. Zeppelin)


to an AWS Glue development endpoint.

Interactively experiment and explore


datasets and data sources

Deploy to production
Push scripts to S3
Create or register with ETL job

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 4: Schedule a job

several event types pass parameters

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless job execution
Compute instances
No need to provision, configure, or
manage servers

Auto-configure VPC & role-based access


security & isolation preserved

Customers can specify job capacity (DPU)

Automatically scale resources

Only pay for the resources you consume


per-second billing (10-minute min)

Customer VPC Customer VPC

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab - Architecture

Amazon
Quick Sight

AWS Glue AWS Glue


Crawlers Data Catalog

Amazon S3 Amazon
Archive Athena
bucket
AWS Glue
ETL

Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
WiFi Password

Wifi Name: AWSWORKSHOP


Password: AWSworkshop2018

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab Environment

1. Open new ‘Cognito / Private Browsing’ tab


2. Go to : https://unniklabs.awsapps.com/start
3. Login with username / password provided
4. Open your ‘Management Console’
>> Start the lab

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab Guide

http://bit.ly/aws-innovate-2018-glue-demo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Under the hood ?

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Public GitHub timeline is …

semi-structured

35+ event types

payload structure
and size varies by
event type

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark and AWS Glue ETL

What is Apache Spark?


Parallel, scale-out data processing engine
Fault-tolerance built-in
Flexible interface: Python scripting, SQL
Rich eco-system: ML, Graph, analytics, …

SparkSQL AWS Glue ETL AWS Glue ETL libraries


Integration: Data Catalog, job orchestration,
Dataframes Dynamic Frames code-generation, job bookmarks, S3, RDS
ETL transforms, more connectors & formats
Spark core: RDDs
New data structure: Dynamic Frames

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dataframes and Dynamic Frames

Dataframes
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics

Dynamic Frames
Like dataframes for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Frame internals

Dynamic Records
{“id”:”2489”, “type”: ”CreateEvent”, {“id”:”6510”, “type”: “PushEvent”, {“id”:4391, “type”: “PullEvent”,
”payload”: {“creator”:…}, …} ”payload”: {“pusher”:…}, …} ”payload”: {“assets”:…}, …}

id type id type id type

Dynamic Frame Schema

Schema per-record, no up-front schema needed


id id type
• Easy to restructure, tag, modify
• Can be more compact than dataframe rows
• Many flows can be done in single-pass

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Frame transforms

15+ transforms out-of-the box


project cast separate into cols

ResolveChoice() B B B B B B B

C
ApplyMapping() A
A X Y
X Y

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationalize() transform

Semi-structured schema Relational schema

A B B C.X C.Y FK

PK Offset Value
A B B C D[ ]

X Y

Transforms and adds new columns, types, and tables on-the-fly


Tracks keys and foreign keys across runs
SQL on the relational schema is orders of magnitude faster than JSON processing
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Useful AWS Glue transforms

toDF(): Convert to a Dataframe


Spigot(): Sample data of any Dynamic Frame to S3
Unbox(): Parse string column as given format into Dynamic Frame
Filter(), Map(): Apply Python UDFs to Dynamic Frames
Join(): Join two Dynamic Frames
And more ….

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance: AWS Glue ETL
GitHub Timeline ETL Performance

1800
Configuration
1600
DynamicFrames DataFrames
10 DPUs
1400
(lower is better) Apache Spark 2.1.1
1200
Time (sec)

1000
Workload
800
JSON to CSV
600 Filter for Pull events
400

200 On average: 2x performance


0 improvement
Day Month Year
24 744 8699

Data size (# files)

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance: Lots of small files
AWS Glue ETL small file scalability
Lots of small files, e.g. Kinesis Firehose
8000
Spark
Out-Of-Memory
>= 320: 640K files
7000
Vanilla Apache Spark (2.1.1) overheads
Spark Glue
6000 Must reconstruct partitions (2-pass)
Too many tasks: task per file
Time (sec)

5000

4000 Scheduling & memory overheads


1.2 Million Files

3000

2000
AWS Glue Dynamic Frames
Integration with Data Catalog
1000
Grouping
Automatically group files per task
0
1:2K 20:40K 40:80K 80:160K 160:320K 320:640K 640: 1280K
Rely on crawler statistics
# partitions : # files
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark example

Input table Output table

year 2017 year 2017

11 12 … month 11 12 …
month

day 27 28 … day 27 28 … Periodically run a job

avoid reprocessing
hour … … hour … …
previous input

avoid generating
run 1 … duplicate output
run 2
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmarks
Bookmarks are per-job checkpoints that Examples uses:
track the work done in previous runs.
Process githubarchive files daily
They persist the state of sources, Process Firehose files hourly
transforms, and sinks on each run. Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization

run 1 run 2 run 3

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark options

Option Behavior pause


Enable Pick up from where you left off
Ignore and process the entire
run 1 run 2 run 3
Disable
dataset every time
Temporarily disable advancing the
Pause
bookmark
disable enable
Examples:
Enable: Process the newest githubarchive partition
Disable: Process the entire githubarchive table
Pause: Process the previous githubarchive partition

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark internals
Bookmark state
Example run 3:
run 2 run 3

How do we avoid space blowup?
Use timestamps to filter
already processed input
process files created after T2

run 2 run 3

But S3 is eventually consistent?
Maintain exclusion list of files
created in inconsistency window
(size d) prior to start. excluded
process files T -d
2
T2
created after
exclusion list
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Training: Big Data on AWS
Make your data driven decisions count, and AWS Certified Big Data - Specialty
make a career in Big Data on AWS. Follow
the Big Data Specialty learning path and Big Data on AWS – 3-day Classroom Training
become a specialist in Big Data:
• Implement core AWS Big Data services
Free AWS digital training:
according to best practices Big Data Technology Fundamentals
• Design and maintain Big Data
• Leverage tools to automate data analysis

Who should follow this learning path? Certified Cloud


Practitioner
Associate-level Certification

• Enterprise solutions • Big Data solutions


architects architects

• Data scientists • Data analysts Free AWS digital training: Foundational knowledge

71
Feedback Feedback

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Вам также может понравиться