Академический Документы
Профессиональный Документы
Культура Документы
Unni Pillai
Senior Solutions Architect – Big Data and Analytics
unni_k_pillai
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is a Data Lake?
“a single store for all of the raw data that anyone in an organization
might need to analyse” - Martin Fowler
“If you think of a datamart as a store of bottled water – cleansed and packaged and
structured for easy consumption – the data lake is a large body of water in a more natural
state. The contents of the data lake stream in from a source to fill the lake, and various
users of the lake can come to examine, dive in, or take samples.” - James Dixon
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Characteristics of a Data Lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Used to Look Like This
• Relational data
Business Intelligence
• TBs–PBs scale
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
• TBs–EBs scale
DW QueriesBig data
Interactive Real-time
processing • Diverse analytical engines
Catalog
• Schema on-read - Dynamic
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
Business Machine
A data lake is an architectural
Intelligence Learning
approach that allows you to store
massive amounts of data into a
DW QueriesBig data
processing
Interactive Real-time
central location, so it's readily
Catalog
available to be categorized,
processed, analyzed and
Data Warehouse Data Lake consumed by diverse group of users
within an organization.
OLTP ERP CRM LOB Devices Web Sensors Social
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake
Storage
• Flexible ingestion
Ingestion mechanisms
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake
Security
• High durability
Process
• High scalability
Catalog & Search • Stores raw data
Storage • Support for any type of
data
Ingestion
• Low cost
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake
Security • Transformation
• Data
Process
• Data Formats
Catalog & Search • Analysis
Storage • SQL
• Exploratory
Ingestion • Machine Learning
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Components of a Data Lake
Security • Encryption
Process • Authentication
Catalog & Search
• Authorisation
• Auditing
Storage
• Internal Users
Ingestion
• External Parties
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake technology?
Scalable
Extensible BUT…….
Flexible
Hadoop seems perfect
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
COMPUTE COMPUTE
COMPUTE COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
COMPUTE
STORAGE
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ingest
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From On-premises Datacenters
Establish a dedicated Petabyte and Exabyte- Migrate database from Lets your on-premises
network 10G connection scale data transport the most widely-used applications to use AWS
from your premises to solution that uses secure commercial and open- for storage; includes a
AWS; reduces your appliances to transfer source offerings to AWS highly-optimized data
network costs, increase large amounts of data quickly and securely with transfer mechanism,
bandwidth throughput, into and out of the AWS minimal downtime to bandwidth management,
and provide a more cloud applications along with local cache
consistent network
experience than Internet-
based connections
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement From Real-time Sources
AWS IoT Core Amazon Kinesis Data Amazon Kinesis Data Amazon Kinesis
Firehose Streams Video Streams
Supports billions of Capture, transform, and Build custom, real-time Securely stream video
devices and trillions of load data streams into applications that process from connected devices
messages, and can AWS data stores for near data streams using to AWS for analytics,
process and route those real-time analytics with popular stream machine learning (ML),
messages to AWS existing business processing frameworks and other processing
endpoints and to other intelligence tools.
devices reliably and Integration with Kinesis
securely Analytics for real-time
IoT predictive
maintenance.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storage
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3 - Object Storage
Built for eleven nine’s of Three different forms of Run analytics & ML on Classify, report, and
durability; data encryption; encrypts data data lake without data visualize data usage
distributed across 3 in transit when movement; S3 Select can trends; objects can be
physical facilities in an replicating across regions; retrieve subset of data, tagged to see storage
AWS region; log and monitor with improving analytics consumption, cost, and
automatically replicated CloudTrail, use ML to performance by 400% security; build lifecycle
to any other AWS region discover and protect policies to automate
sensitive data with Macie tiering, and retention
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Glacier—Backup and Archive
$
Built for eleven nine’s of Three retrieval options to Log and monitor with Lowest cost AWS object
durability; data fit your use case; CloudTrail, Vault Lock storage class, allowing
distributed across 3 expedited retrievals with enables WORM storage you to archive large
physical facilities in an Glacier Select can return capabilities, helping amounts of data at a very
AWS region; data in minutes satisfy compliance low cost
automatically replicated requirements
to any other AWS region
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Driven Enterprise View – S3 is the Data Lake
S3 Data Lake
on AWS
Clickstream Text IoT Device Network OTT Customer Region Organization Orders BSS OSS Hadoop
Web Sentiment Edge xDR Media Relationship Shadow IT Catalog
- Common Enterprise KPIs and Metrics - Extensible to applications and visualization tools
- Democratization of Data – Data Driven - Democratization of AI and ML capabilities
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Store Data in the Format You Want
Open and comprehensive
• Data is being generated in many formats.
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Catalog
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Preparation Accounts for ~80% of the Work
Refining algorithms
Other
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Storing is Not Enough, Data Needs to Be Discoverable
”
direct monetizing).
Gartner IT Glossary, 2018
https://www.gartner.com/it-glossary/dark-data
CRM ERP Data warehouse Mainframe Web Social Log Machine Semi- Unstructured
data files data structured
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Challenges
Data variety and data volumes are increasing rapidly
• Ingest
• Discover
• Catalog
• Understand,
• Curate
• … all kinds of
data
Complex Orgs - Multiple Consumers and Applications
Quickly drive
new insights
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue - Data Catalog
Make Data Discoverable
Auto-generate
• Catalog contains table and job definitions
Discover data and customizable code in
extract schema Python and Spark • Computes statistics to make queries efficient
Compliance
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue—ETL Service
Make ETL scripting and deployment easy
• Serverless
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Level of Data Lake Security
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Provides Highest Levels of Security
Store data in S3 Account event CloudTrail captures A log of API calls • Increase visibility into your user
occurs generating and records the is delivered to
API activity API activity S3 bucket and and resource activity
optionally delivered
to CloudWatch Logs
and CloudWatch
Events
• Enables governance,
compliance, and operational
and risk auditing (i.e. separation
of concerns)
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Questions / Comments ?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dive Deep Into AWS Glue
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What is AWS Glue?
Fully-managed, serverless
extract-transform-load (ETL) service
for developers, built by developers
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
There are many tools already in AWS Ecosystem
Fivetran
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Still ETL Developers Hand-Code
schemas change
data formats change makes hand-coding
add or change sources
error-prone & brittle
data volume grows
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue Components
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Common use-cases
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Load data warehouses
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Build a data lake on Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Example Architecture
AWS Glue Analytics Services
Amazon
Quick Sight
Amazon
Archive Amazon S3
bucket
Athena
AWS Glue
ETL
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
AWS Glue - data catalog
Make data discoverable
Glue
Data Catalog
Automatically discovers data and stores schema
Discover data and
extract schema
Catalog makes data searchable, and available for ETL
RDS S3 Redshift
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
ETL example (con’t)
Organize data in Apache Hive-style partitions
… month 11 12 …
month 11 12
day … day 27 28 …
27 28
filter &
transform …
hour … hour
JSON CSV
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 1: Run crawler
200+ fields
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Glue - ETL service
Make ETL scripting and deployment easy
Serverless Transformations
Based on Apache Spark
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 2: Specify mappings
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anatomy of a generated script
Data transformation +
data cleaning functions
Interpreter Remote
Server Interpreter
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 3: Explore and experiment with data
Deploy to production
Push scripts to S3
Create or register with ETL job
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Step 4: Schedule a job
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless job execution
Compute instances
No need to provision, configure, or
manage servers
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab - Architecture
Amazon
Quick Sight
Amazon S3 Amazon
Archive Athena
bucket
AWS Glue
ETL
Amazon S3
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved. bucket
WiFi Password
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab Environment
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lab Guide
http://bit.ly/aws-innovate-2018-glue-demo
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Under the hood ?
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Public GitHub timeline is …
semi-structured
payload structure
and size varies by
event type
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Apache Spark and AWS Glue ETL
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dataframes and Dynamic Frames
Dataframes
Core data structure for SparkSQL
Like structured tables
Need schema up-front
Each row has same structure
Suited for SQL-like analytics
Dynamic Frames
Like dataframes for ETL
Designed for processing semi-structured data,
e.g. JSON, Avro, Apache logs ...
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Frame internals
Dynamic Records
{“id”:”2489”, “type”: ”CreateEvent”, {“id”:”6510”, “type”: “PushEvent”, {“id”:4391, “type”: “PullEvent”,
”payload”: {“creator”:…}, …} ”payload”: {“pusher”:…}, …} ”payload”: {“assets”:…}, …}
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Dynamic Frame transforms
ResolveChoice() B B B B B B B
C
ApplyMapping() A
A X Y
X Y
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relationalize() transform
A B B C.X C.Y FK
PK Offset Value
A B B C D[ ]
X Y
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance: AWS Glue ETL
GitHub Timeline ETL Performance
1800
Configuration
1600
DynamicFrames DataFrames
10 DPUs
1400
(lower is better) Apache Spark 2.1.1
1200
Time (sec)
1000
Workload
800
JSON to CSV
600 Filter for Pull events
400
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Performance: Lots of small files
AWS Glue ETL small file scalability
Lots of small files, e.g. Kinesis Firehose
8000
Spark
Out-Of-Memory
>= 320: 640K files
7000
Vanilla Apache Spark (2.1.1) overheads
Spark Glue
6000 Must reconstruct partitions (2-pass)
Too many tasks: task per file
Time (sec)
5000
3000
2000
AWS Glue Dynamic Frames
Integration with Data Catalog
1000
Grouping
Automatically group files per task
0
1:2K 20:40K 40:80K 80:160K 160:320K 320:640K 640: 1280K
Rely on crawler statistics
# partitions : # files
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark example
11 12 … month 11 12 …
month
avoid reprocessing
hour … … hour … …
previous input
avoid generating
run 1 … duplicate output
run 2
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmarks
Bookmarks are per-job checkpoints that Examples uses:
track the work done in previous runs.
Process githubarchive files daily
They persist the state of sources, Process Firehose files hourly
transforms, and sinks on each run. Track timestamps or primary keys in DBs
Track generated foreign keys for
normalization
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark options
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Job bookmark internals
Bookmark state
Example run 3:
run 2 run 3
…
How do we avoid space blowup?
Use timestamps to filter
already processed input
process files created after T2
run 2 run 3
…
But S3 is eventually consistent?
Maintain exclusion list of files
created in inconsistency window
(size d) prior to start. excluded
process files T -d
2
T2
created after
exclusion list
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Training: Big Data on AWS
Make your data driven decisions count, and AWS Certified Big Data - Specialty
make a career in Big Data on AWS. Follow
the Big Data Specialty learning path and Big Data on AWS – 3-day Classroom Training
become a specialist in Big Data:
• Implement core AWS Big Data services
Free AWS digital training:
according to best practices Big Data Technology Fundamentals
• Design and maintain Big Data
• Leverage tools to automate data analysis
• Data scientists • Data analysts Free AWS digital training: Foundational knowledge
71
Feedback Feedback
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.