Вы находитесь на странице: 1из 5

Research on AWS Glue

What is Glue?
A cloud-optimized Extract, Transform and Load service. Glue is different from other ETL tools in
3 different ways.
(1) Glue is server-less
- No need to provision, configure, manage and maintain servers for the ETL
processes/jobs
(2) Glue provides automatic schema-inference thru crawlers
- Crawlers automatically discovers all your data sets, file types and defines the schema
of both structured and semi-structured data sets.
(3) Glue provides auto-generation of ETL scripts
- Glue does the heavy-coding so developers can focus on customizations.

AWS Glue Main Components


Data Catalog (Discover)
- Helps to discover and understand the data sources you’re working with. It directly
associated to crawlers which stores all the data sets, file types, schema and
structures including statistics in to the data catalog.
- It is also Hive Metastore compatible.
- Integrated with AWS Analytics Services.

Job Authoring (Develop)


- It lets you get started quickly when developing the ETL flow. It generates the ETL
code for you if you point it to tables stored in the Data Catalog.
- The code it generates is Python and Apache Spark.
- There also tools offered to Edit, Debug and Explore the data you’re working with.

Job Execution (Deploy)


- It turns the ETL code in to a job and run it thru server-less execution.
- Flexible scheduling.
- Monitoring and alerting.
Where is it commonly-used?
Loading Data Warehouses

Used as the main integration tool in integrating both (either/or) structured (OLTP/Relational
Database) and semi-structured (Amazon S3/JSON) data in to the data warehouse (Amazon
Redshift).

Building Data Lake on Amazon S3

Used Glue Crawlers to crawl all their data and index all the information and stores all the
information on the Data Catalog and make the data available and ready for analysis to one of
many available analytics services that Amazon currently has including BI tools that works on top
of those services.
How to build an ETL Flow?
Crawl and Catalogue your data

Automatically discovers schema of your data source and its partitioning and added additional
fields for the said partitioning (in this case, it added year, month and day).

Specify mappings and Generate your scripts

This is where you can covert the data, cast columns in to different data types, change the order
of the columns and map the source column to its target column.
Interactively Edit and Explore with Dev-Endpoints

Development end-point is a Glue-based Spark Environment that’s constantly up and running to


develop, debug and test your ETL scripts and get answers back very quickly. It supports an
interface that a lot of IDEs and Notebooks are used to so it can be easily attached to an IDE that
you have that is supported.

It can also be connected to a notebook (e.g. Zeppelin) to interactively explore and experiment
with your data.

Schedule a Job for Running in Production

Once the ETL job is registered with the system, the job can be triggered in several events, triggers
and schedules.

Benefits
No server maintenance, cost savings by eliminating over-provisioning or under-provisioning
resources, support for data sources including easy integration with Oracle and MS SQL data
sources, and AWS Lambda integration.
As an AWS product, it works well with other AWS services such as Amazon Athena, Amazon
Redshift Spectrum, and AWS Identity and Access Management.

A lot of its automated and intelligent features (i.e. Crawlers, auto-generation of ETL code) helps
the developer a lot in terms of laying the foundation of the data sources and its structures,
schema and mappings and the ETL flow which gives developers more time to focus on
customizations and the architecture of the whole ETL process up to the analysis and reporting.

Вам также может понравиться