Академический Документы
Профессиональный Документы
Культура Документы
With AWS Glue you can run Apache Spark serverless. For grasping basic usage, we will try to
extract data from S3 and RDS to Redshift by ETL (Extract, Transform, and Load). It became
available also in the Tokyo region on December 22, 2017 . In addition, although this page uses
Python , Scala is newly supported .
• ETL from Data Source to Data Target. In this page, Data Source is S3 and RDS, Data
Target is Redshift.
• ETL runs as a Job. Job can be run periodically like cron or event driven from Lambda
etc. It can also be executed manually at an arbitrary timing.
• ETL Job is processed in AWS Glue without server. Internally, Apache Spark is running,
especially using PySpark's script written in Python.
• To define an ETL Job, select Data Source from the Data Catalog. Data Target is created
when a job is defined, or it is selected from the Data Catalog.
• Data Catalog is metadata created based on information collected by Crawler. It is a
catalog of data, not actual data.
• The Data Catalog is managed as a table. Manage multiple tables as a database.
VPC
Note that in order to access RDS and Redshift as described below from Glue, the following
settings are required for VPC.
In the routing table of the private subnet, set the default gateway to "NAT Gateway belonging to
public subnet"
S3
Create a new bucket my-bucket-20171124 and upload the following JSON file.
s3: //my-bucket-20171124/s3.json
RDS
Create a new MySQL DB and user as follows. We will add a security group which is set to
"allow access if the same security group is set as the access source" so that it can be accessed
later from Glue in VPC. There is no need to access from the Internet, Publicly Accessible is No
or Yes. Host names such as mydw.xxxx.us-east-1.redshift.amazonaws.com and
myrdsdb.xxxx.us-east-1.rds.amazonaws.com are resolved to private IP within VPC.
• DB name: myrdsdb
• User name: myuser
• Password: mypass
Redshift settings
Create redshift to be the ETL destination on the same VPC public subnet as Glue. In fact, it is
safer to have Red Hat belonging to the private subnet by preparing a springboard server on the
public subnet. Please refer to this page for the simple cheat sheet of psql command.
psql -h mydw.xxxx.us-east-1.redshift.amazonaws.com -p 5439 mydw username
When deleted
We will add a security group which is set to "allow access if the same security group is set as the
access source" so that it can be accessed later from Glue in VPC. There is no need to access from
the Internet, Publicly Accessible is No or Yes. Host names such as mydw.xxxx.us-east-
1.redshift.amazonaws.com and myrdsdb.xxxx.us-east-1.rds.amazonaws.com are
resolved to private IP within VPC.
Create Database
Create Connections
Register the authentication information used for JDBC connection . Specifying RDS or Redshift
as Connection type somewhat simplifies the JDBC setting, but here we set it as Connection type
JDBC. After setting, you can test whether it can be connected normally with "Test connection"
button.
RDS
Redshift
Create Crawlers
Make sure each log is output to CloudWatch by manual execution. It is not indispensable to
separate Crawlers for each Data store like this time, and if it is a similar Data store, it is
registered in the same Crawler collectively. As a result multiple tables are generated from one
Crawler. Crawler is not something to create massively for Table.
• Name: myjob-20171124
• IAM role: my-glue-role-20171124
• This job runs: A proposed script generated by AWS Glue (Make script generation
automatically, you can also set your own script from scratch here)
• Script file name: myjob-20171124
• S3 path where the script is stored: s3: // aws - glue - scripts - xxxx - us - west - 2 / admin
(script is generated on S3)
• Temporary directory: s3: // aws - glue - scripts - xxxx - us - west - 2 / tmp (You need to
specify S3 folder because temporary files etc are required for Job execution)
• Choose your data sources: mys3prefix_s3_json (We have specified S3 here, but you
can mys3prefix_s3_json RDS as Source by editing script later)
• Choose your data targets: Create tables in your data target (ETL destination specifies
Redshift)
o Data store: JDBC
o Connection: my-redshift-connection
o Database name: mydw
• Map the source columns to target columns: This is a hint for automatic PySpark script
generation. Exclude unused Source information from Target.
• After creating the Job, add my-rds-connection to Required connections with Edit job.
Create an endpoint
Create an EC2 instance that will be a steppingstone to Apache Spark as an endpoint to use from
the development environment. You can create it from the GUI endpoints of Glue console with
GUI. The end point in the READY state will be charged . Because it is not cheap, to reduce the
cost, for example, create DPU 2. If it is DPU 1 creation will fail.
• SSH to the endpoint with the secret key corresponding to the registered public key and
use the REPL shell
• SSH to the endpoint with the private key corresponding to the registered public key and
port transfer with the -L option , start Apache Zeppelin on the local PC and connect to the
endpoint
• Building an EC 2 instance that runs Apache Zeppelin with CloudFormation provided by
Glue service
Third choice is convenient, but it costs money. The second choice takes a bit of effort, but it
reduces costs. The first option is useful for simple validation. You can use the SSH command by
copying it from the details screen of the created endpoint. Also, in case of using Docker in the
second option, it is necessary to add the -g option when ssh port transfer, because it uses the host
side service from within the container, and it is not localhost when setting up Zeppelin Note that
you need to specify the IP of the host.
S3 Output example
When executing as ETL Job, delete %pyspark and add Job initialization and termination
processing as follows.
S3 Output example
Select field
srcS3.select_fields(['pstr']).toDF().show() +----+ |pstr| +----+ | aaa| |
bbb| | ccc| | ddd| | eee| +----+
Exclude field
srcS3.drop_fields(['pstr']).toDF().show() +----+ |pint| +----+ | 1| | 2| |
3| | 4| | 5| +----+
Change the name of the field
srcS3.rename_field('pstr', 'pstr2').toDF().show() +----+-----+ |pint|pstr2|
+----+-----+ | 1| aaa| | 2| bbb| | 3| ccc| | 4| ddd| | 5| eee| +----+-----+
Type conversion
SplitFields
Returns a DynamicFrameCollection containing two frames consisting only of the specified field
and two of the remaining fields. From DynamicFrameCollection you can extract DynamicFrame
with select .
map
def my_f(dyr): dyr['pint'] = dyr['pint'] + 10 return dyr
srcS3.map(my_f).toDF().show() +----+----+ |pint|pstr| +----+----+ | 11| aaa|
| 12| bbb| | 13| ccc| | 14| ddd| | 15| eee| +----+----+
unnest
dyf = srcS3.apply_mapping([ ('pstr', 'string', 'proot.str', 'string'),
('pint', 'int', 'proot.int', 'int') ]) dyf.printSchema() dyf2 = dyf.unnest()
dyf2.printSchema() root |-- proot: struct | |-- str: string | |-- int: int
root |-- proot.str: string |-- proot.int: int
collection / select
dfc = SplitFields.apply(srcS3, ['pstr'], 'split_off', 'remaining')
dfc.select('split_off').toDF().show() dfc.select('remaining').toDF().show()
+----+ |pstr| +----+ | aaa| | bbb| | ccc| | ddd| | eee| +----+ +----+ |pint|
+----+ | 1| | 2| | 3| | 4| | 5| +----+
collection / map
def my_f(dyf, ctx): df = dyf.toDF() return DynamicFrame.fromDF(df.union(df),
glueContext, 'dyf') dfc = SplitFields.apply(srcS3, ['pstr']) dfc2 =
dfc.map(my_f) for dyf_name in dfc2.keys():
dfc2.select(dyf_name).toDF().show() +----+ |pstr| +----+ | aaa| | bbb| | ccc|
| ddd| | eee| | aaa| | bbb| | ccc| | ddd| | eee| +----+ +----+ |pint| +----+
| 1| | 2| | 3| | 4| | 5| | 1| | 2| | 3| | 4| | 5| +----+
Special processing difficult to resolve with SQL can be solved by converting to Python object by
the following procedure. Please refer to the following page about Python object processing.
• About tuples
• List manipulation
• Loop processing
If the result of processing is, for example, a list of tuples, it can be converted to DataFrame as
follows.
Basically, Apache Spark performs all processing on memory. In the case of Apache Spark
operating in Cluster mode like AWS Glue, one Driver and multiple Executors will appear. Driver
executes PySpark script processing. The Resilient Distributed Dataset (RDD) including the
DataFrame is divided and expanded in memory of multiple executors. The number of divisions
of DataFrame into Executors can be adjusted by repartition () . When processing RDD, multiple
Executor will divide the task. Therefore, even if RDD is printed, execution result of the task can
not be confirmed with Driver's standard output. The above collect() collects RDDs that exist
in multiple Executor to Driver. Therefore, if collecting data with collect() and handling data
size is huge, pay attention to Driver's memory shortage. In addition, the following error may
occur.
Container killed by YARN for exceeding memory limits. Xxx GB of yyy GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.
In RDD processing in Executors, do not share Python objects on Driver with Executors every
time , all processing of Python objects should be processed on Driver and fully converted to
DataFrame (RDD) and expanded to Executors There is a possibility to solve it. For example,
avoid referring to Python objects on Driver with udf() in RDD conversion processing. Also, if
the RDD being handled is very large relative to the memory capacity of the Executor in the first
place, it can cope with by increasing the DPU.
In order to generate a special RDD called a pair RDD from DataFrame, we use KeyBy .
Even if you have not created catalog tables with Crawler, you can get data directly from S3 if
schema information is known in advance. For JSON, it is as follows.
df = spark.read.json('s3://my-bucket-20171124/s3.json')
Add field
Similar processing can be described with select below, but you can know that you can add fields
withColumn . When dynamically setting the value of a field using a function, it becomes as
follows. For details on how to define functions in Python please refer to this page . udf() careful
not to refer to objects on Driver in udf() above.
select
df = srcS3.toDF() df.select(df['pstr'],
df['pstr'].substr(1,3).alias('pstr2'), (df['pint'] % 2).alias('peo')).show()
+----+-----+---+ |pstr|pstr2|peo| +----+-----+---+ | aaa| aaa| 1| | bbb| bbb|
0| | ccc| ccc| 1| | ddd| ddd| 0| | eee| eee| 1| +----+-----+---+
distinct, dropDuplicates
from pyspark.sql.functions import lit df = srcS3.toDF().limit(2) df2 =
df.union(df).withColumn('pint2', lit(123)) df2.show() +----+----+-----+
|pstr|pint|pint2| +----+----+-----+ | aaa| 1| 123| | bbb| 2| 123| | aaa| 1|
123| | bbb| 2| 123| +----+----+-----+
where
df = srcS3.toDF() df.where(df['pint'] > 1).show() +----+----+ |pstr|pint| +-
---+----+ | bbb| 2| | ccc| 3| | ddd| 4| | eee| 5| +----+----+
count
sum
min / max
df = srcS3.toDF() df.groupBy(df['pint'] % 2).min('pint').show() +----------
+---------+ |(pint % 2)|min(pint)| +----------+---------+ | 1| 1| | 0| 2| +--
--------+---------+ df.groupBy(df['pint'] % 2).max('pint').show() +----------
+---------+ |(pint % 2)|max(pint)| +----------+---------+ | 1| 5| | 0| 4| +--
--------+---------+
avg
orderBy
df = srcS3.toDF() df.orderBy(desc('pstr'), 'pint').show() +----+----+
|pstr|pint| +----+----+ | eee| 5| | ddd| 4| | ccc| 3| | bbb| 2| | aaa| 1| +--
--+----+
• "Conversion" that does not need to use all of the RDDs that are divided into cluster-wide
Executors like the DynamicFrame filter and the DataFrame where.
• An "action" that must use all of the RDDs that are divided into cluster-wide Executors,
such as DynamicFrame count and DataFrame collect
There are two. The operation result of "conversion" is RDD. "Conversion" will not be executed
until it is needed in "Action". This is called delay evaluation. Also, even if the same "conversion"
is repeatedly delayed every time it becomes necessary in "action".
When cache () is used, the result of RDD which is delayed and converted for action execution is
saved in Executors' memory.
Even if the action using the RDD is executed again while the result of the complicated RDD
conversion is perpetuated
df.count()
Complex RDD conversions are not executed again. If you execute the above action without
cached, the following complicated RDD transformation will be delayed again. Including reading
data from HDFS is time-consuming, so try to persist RDDs that are required for multiple actions.
src.xxx.xxx.xxx.xxx.count()
cache()Is like a shorthand notation that you can specify storageLevel in detail . When the
persisted RDD becomes unnecessary, it explicitly deletes it from
Executors.persist()unpersist()
df.unpersist()
Visualization
When using Apache Zeppelin
df = srcS3.toDF() df.createOrReplaceTempView('df')
Python
pyspark
Scala
spark-shell