Вы находитесь на странице: 1из 17

Overview

With AWS Glue you can run Apache Spark serverless. For grasping basic usage, we will try to
extract data from S3 and RDS to Redshift by ETL (Extract, Transform, and Load). It became
available also in the Tokyo region on December 22, 2017 . In addition, although this page uses
Python , Scala is newly supported .

AWS Glue ETL Schematic


The system schematic diagram when using AWS Glue as an ETL service is as follows .

• ETL from Data Source to Data Target. In this page, Data Source is S3 and RDS, Data
Target is Redshift.
• ETL runs as a Job. Job can be run periodically like cron or event driven from Lambda
etc. It can also be executed manually at an arbitrary timing.
• ETL Job is processed in AWS Glue without server. Internally, Apache Spark is running,
especially using PySpark's script written in Python.
• To define an ETL Job, select Data Source from the Data Catalog. Data Target is created
when a job is defined, or it is selected from the Data Catalog.
• Data Catalog is metadata created based on information collected by Crawler. It is a
catalog of data, not actual data.
• The Data Catalog is managed as a table. Manage multiple tables as a database.

Creating an IAM role to be used by AWS


Glue
Select IAM role with AWS service and select Glue to create a new one. In this example, the role
name is created as my-glue-role-20171124 , but if you are considering working with IAM
with restricted authority etc, set a name that starts with "AWSGlueServiceRole" character string.
As a policy you need to set what you need to operate Glue, S3, RDS, Redshift.

• Role name: my-glue-role-20171124


• AWS service: Glue
• policy
o AWSGlueServiceRole
o AmazonS3FullAccess
o AmazonRDSFullAccess
o AmazonRedshiftFullAccess

Data for verification is prepared for S3 and


RDS
Create S3 and RDS in the same region as Glue and store the data for verification. This time RDS
belongs to the public subnet within the same VPC as Glue, but actually it is safer to have RDS
belong to the private subnet by preparing the springboard server on the public subnet. You can
also create VPC Endpoints so that S3 does not go through the Internet, but do not set it this time.
You can set it as soon as you need it.

VPC

Note that in order to access RDS and Redshift as described below from Glue, the following
settings are required for VPC.

• RDS and Redshift must be on the same VPC as Glue.


• enableDnsHostnames and enableDnsSupport must be true .
• Glue does not have Public IP. Therefore, create a private subnet to be used from Glue in
VPC and set the routing table with NAT Gateway as the default gateway. Please note that
the NAT Gateway itself must belong to the public subnet . This does not mean to access
RDS / Redshift from Glue via the Internet, and there is no problem whether the Publicly
Accessible setting of RDS / Redshift is No or Yes. For the sake of simplicity this time I
want to operate via the Internet Yes as RDS / Redshift will belong to the public subnet.

In the routing table of the private subnet, set the default gateway to "NAT Gateway belonging to
public subnet"
S3

Create a new bucket my-bucket-20171124 and upload the following JSON file.

s3: //my-bucket-20171124/s3.json

{"pstr":"aaa","pint":1} {"pstr":"bbb","pint":2} {"pstr":"ccc","pint":3}


{"pstr":"ddd","pint":4} {"pstr":"eee","pint":5}

RDS

Create a new MySQL DB and user as follows. We will add a security group which is set to
"allow access if the same security group is set as the access source" so that it can be accessed
later from Glue in VPC. There is no need to access from the Internet, Publicly Accessible is No
or Yes. Host names such as mydw.xxxx.us-east-1.redshift.amazonaws.com and
myrdsdb.xxxx.us-east-1.rds.amazonaws.com are resolved to private IP within VPC.

• DB name: myrdsdb
• User name: myuser
• Password: mypass

Create the following table and record in myrdsdb.

GRANT ALL ON myrdsdb.* TO 'myuser'@'%' IDENTIFIED BY 'mypass'; CREATE TABLE


myrdsdb.myrdstable (id INT PRIMARY KEY AUTO_INCREMENT, cstr VARCHAR(32), cint
INT); INSERT INTO myrdsdb.myrdstable (cstr, cint) VALUES ('aaa', -1), ('bbb',
-2), ('ccc', -3), ('ddd', -4), ('eee', -5);

Redshift settings
Create redshift to be the ETL destination on the same VPC public subnet as Glue. In fact, it is
safer to have Red Hat belonging to the private subnet by preparing a springboard server on the
public subnet. Please refer to this page for the simple cheat sheet of psql command.
psql -h mydw.xxxx.us-east-1.redshift.amazonaws.com -p 5439 mydw username

Create users and tables in DB mydw .

CREATE USER myuser WITH PASSWORD 'myPassword20171124'; CREATE TABLE mytable


(cstr VARCHAR(32), cint INTEGER);

When deleted

DROP USER myuser; DROP TABLE mytable;

We will add a security group which is set to "allow access if the same security group is set as the
access source" so that it can be accessed later from Glue in VPC. There is no need to access from
the Internet, Publicly Accessible is No or Yes. Host names such as mydw.xxxx.us-east-
1.redshift.amazonaws.com and myrdsdb.xxxx.us-east-1.rds.amazonaws.com are
resolved to private IP within VPC.

Set up Crawler and create catalog table


Set up Crawler from the Glue console and create catalog tables for S3 and RDS respectively. Just
repeat GUI setup as described on this official blog .

Create Database

Create a database to gather the metadata storage table in Glue.

• Database name: mygluedb

Create Connections

Register the authentication information used for JDBC connection . Specifying RDS or Redshift
as Connection type somewhat simplifies the JDBC setting, but here we set it as Connection type
JDBC. After setting, you can test whether it can be connected normally with "Test connection"
button.

RDS

• Connection name: my-rds-connection


• Connection type: JDBC
• JDBC URL: Sets the endpoint for access by Glue in the VPC. A sample JDBC URL is
here . jdbc:mysql://myrdsdb.xxxx.us-east-1.rds.amazonaws.com:3306/myrdsdb .
• Username: myuser
• Password: mypass
• VPC: This is the setting of VPC to which Glue belongs. Specify the same VPC as RDS.
• Subnet: Subnet setting to which Glue belongs. Specify what can communicate with the
Subnet where RDS resides. Since Glue does not have Public IP as described above, it
selects the private Subnet to be routed to the NAT Gateway.
• Security groups The security groups to set for Glue. Add the one that was also set to
"RDS" and "Set as permitted if the same security group is set as the access source" is
added. By setting the same security group as this RDS, access to RDS will be allowed.

Redshift

• Connection name: my-redshift-connection


• Connection type: JDBC
• JDBC URL: Sets the endpoint for access by Glue in the VPC. A sample JDBC URL is
here . jdbc:redshift://mydw.xxxx.us-east-
1.redshift.amazonaws.com:5439/mydw .
• Username: myuser
• Password: myPassword 20171124
• VPC: This is the setting of VPC to which Glue belongs. Specify the same VPC as
Redshift.
• Subnet: Subnet setting to which Glue belongs. Specify what can communicate with the
Subnet on which Redshift resides. Since Glue does not have Public IP as described
above, it selects the private Subnet to be routed to the NAT Gateway.
• Security groups The security groups to set for Glue. We will add the one that was also set
to Redshift and set as "Allow access if the same security group is set as the access
source". By setting the same security group as Redshift, access to Redshift will be
allowed.

Create Crawlers

Register Crawler which catalogs S3 data.

• Crawler name: my-s 3-crawler-20171124


• Data store: S3
o Crawl data in: Specified path in my account
o Include path: s3: //my-bucket-20171124/s3.json
• Choose an existing IAM role: my-glue-role-20171124
• Frequency: Run on demand
• Database: mygluedb
• Prefix added to tables: mys3prefix_ prefix mys3prefix_

Register Crawler which catalogs RDS data.

• Crawler name: my-rds-crawler-20171124


• Data store: JDBC
o Connection: my-rds-connection
o Include path: myrdsdb / myrdstable
• Choose an existing IAM role: my-glue-role-20171124
• Frequency: Run on demand
• Database: mygluedb
• Prefix added to tables: myrdsprefix_

Make sure each log is output to CloudWatch by manual execution. It is not indispensable to
separate Crawlers for each Data store like this time, and if it is a similar Data store, it is
registered in the same Crawler collectively. As a result multiple tables are generated from one
Crawler. Crawler is not something to create massively for Table.

Create ETL Job


Register the Job from the Glue console. Specifically, it sets the source and target of ETL. The
PySpark script is automatically generated based on the setting contents.

• Name: myjob-20171124
• IAM role: my-glue-role-20171124
• This job runs: A proposed script generated by AWS Glue (Make script generation
automatically, you can also set your own script from scratch here)
• Script file name: myjob-20171124
• S3 path where the script is stored: s3: // aws - glue - scripts - xxxx - us - west - 2 / admin
(script is generated on S3)
• Temporary directory: s3: // aws - glue - scripts - xxxx - us - west - 2 / tmp (You need to
specify S3 folder because temporary files etc are required for Job execution)
• Choose your data sources: mys3prefix_s3_json (We have specified S3 here, but you
can mys3prefix_s3_json RDS as Source by editing script later)
• Choose your data targets: Create tables in your data target (ETL destination specifies
Redshift)
o Data store: JDBC
o Connection: my-redshift-connection
o Database name: mydw
• Map the source columns to target columns: This is a hint for automatic PySpark script
generation. Exclude unused Source information from Target.
• After creating the Job, add my-rds-connection to Required connections with Edit job.

Construction of PySpark development


environment
The generated PySpark script can be confirmed by opening the editor from Job's "Edit script"
button. You can edit and save as it is, but it will take time to run job every time you verify the
operation. It is efficient if you run job after constructing the following development environment
and thoroughly verifying it.

Create an endpoint
Create an EC2 instance that will be a steppingstone to Apache Spark as an endpoint to use from
the development environment. You can create it from the GUI endpoints of Glue console with
GUI. The end point in the READY state will be charged . Because it is not cheap, to reduce the
cost, for example, create DPU 2. If it is DPU 1 creation will fail.

• Development endpoint name: myendpoint-20171124


• IAM role: my-glue-role-20171124 (IAM role used from Glue created at the top of this
page)
• Data processing units (DPUs): 2
• Networking: Skip networking information is not a problem when using S3 only via the
Internet. To use RDS or Redshift, you need to select an already created one from Choose
a connection and make the V2 instance belonging to the future to belong to VPC.
• Public key contents: Register the public key for SSH to the EC2 instance that will be the
steppingstone to Apache Spark, which will be created as an endpoint from now. Register
what you normally use, or create a new one here.

It will take a while until it is created.

Selection of PySpark development environment


There are three construction methods as the development environment using the EC 2 instance
which becomes a steppingstone to Apache Spark created as an end point.

• SSH to the endpoint with the secret key corresponding to the registered public key and
use the REPL shell
• SSH to the endpoint with the private key corresponding to the registered public key and
port transfer with the -L option , start Apache Zeppelin on the local PC and connect to the
endpoint
• Building an EC 2 instance that runs Apache Zeppelin with CloudFormation provided by
Glue service

Third choice is convenient, but it costs money. The second choice takes a bit of effort, but it
reduces costs. The first option is useful for simple validation. You can use the SSH command by
copying it from the details screen of the created endpoint. Also, in case of using Docker in the
second option, it is necessary to add the -g option when ssh port transfer, because it uses the host
side service from within the container, and it is not localhost when setting up Zeppelin Note that
you need to specify the IP of the host.

Create an IAM role

The third option is to run Apache Zeppelin on an EC 2 instance. To launch it in CloudFormation,


you need to create an IAM role to use in the EC 2 instance beforehand. When using S3, RDS,
Redshift all are as follows.

• Role name: my-ec2-glue-notebook-role-20171124


• AWS service: EC2
• policy
o AWSGlueServiceNotebookRole
o AmazonS3FullAccess
o AmazonRDSFullAccess
o AmazonRedshiftFullAccess

Edit PySpark script and run ETL Job


Edit the PySpark script in the developed development environment. When verification is
completed, it executes as ETL Job. The editing example is described below. Please note that it is
necessary to begin writing with %pyspark during Zeppelin operation.

Example of acquiring and displaying information from Data


Source
%pyspark import sys from awsglue.transforms import * from awsglue.utils
import getResolvedOptions from pyspark.context import SparkContext from
awsglue.context import GlueContext from awsglue.job import Job from
awsglue.dynamicframe import DynamicFrame from pyspark.sql.functions import
udf from pyspark.sql.functions import desc # AWS Glue を操作するオブジェクトsc =
SparkContext.getOrCreate() glueContext = GlueContext(sc) spark =
glueContext.spark_session # S3 からデータを取得srcS3 =
glueContext.create_dynamic_frame.from_catalog( database = 'mygluedb',
table_name = 'mys3prefix_s3_json') # 情報を表示print 'Count:', srcS3.count()
srcS3.printSchema() # S3 から取得したデータを、加工せずにそのまま S3 に CSV フォーマッ
トで出力glueContext.write_dynamic_frame.from_options( frame = srcS3,
connection_type = 's3', connection_options = { 'path': 's3://my-bucket-
20171124/target' }, format = 'csv')

S3 Output example

$ aws s3 ls s3://my-bucket-20171124/target/ 2017-11-26 15:59:44 40 run-


1511679582568-part-r-00000 $ aws s3 cp s3://my-bucket-20171124/target/run-
1511679582568-part-r-00000 - pstr,pint aaa,1 bbb,2 ccc,3 ddd,4 eee,5

Standard output example

Count: 5 root |-- pstr: string |-- pint: int

When executing as ETL Job, delete %pyspark and add Job initialization and termination
processing as follows.

import sys from awsglue.transforms import * from awsglue.utils import


getResolvedOptions from pyspark.context import SparkContext from
awsglue.context import GlueContext from awsglue.job import Job from
awsglue.dynamicframe import DynamicFrame from pyspark.sql.functions import
udf from pyspark.sql.functions import desc # AWS Glue を操作するオブジェクトsc =
SparkContext.getOrCreate() glueContext = GlueContext(sc) spark =
glueContext.spark_session # Job の初期化 ←追加args =
getResolvedOptions(sys.argv, ['JOB_NAME']) job = Job(glueContext)
job.init(args['JOB_NAME'], args) # S3 からデータを取得srcS3 =
glueContext.create_dynamic_frame.from_catalog( database = 'mygluedb',
table_name = 'mys3prefix_s3_json') # 情報を表示print 'Count:', srcS3.count()
srcS3.printSchema() # S3 から取得したデータを、加工せずにそのまま S3 に CSV フォーマッ
トで出力glueContext.write_dynamic_frame.from_options( frame = srcS3,
connection_type = 's3', connection_options = { 'path': 's3://my-bucket-
20171124/target' }, format = 'csv') # Job を終了 ←追加job.commit()

S3 Output example

$ aws s3 ls s3://my-bucket-20171124/target/ 2017-11-26 15:59:44 40 run-


1511679582568-part-r-00000 $ aws s3 cp s3://my-bucket-20171124/target/run-
1511679582568-part-r-00000 - pstr,pint aaa,1 bbb,2 ccc,3 ddd,4 eee,5

Some examples of CloudWatch output

LogType:stdout Log Upload Time:Sun Nov 26 06:59:44 +0000 2017 LogLength:46


Log Contents: Count: 5 root |-- pstr: string |-- pint: int End of
LogType:stdout

PySpark cheat sheet


AWS Glue DynamicFrame and Apache Spark DataFrame can convert to each other with toDF()
and fromDF() . When processing that can not be expressed in the DynamicFrame provided by
AWS Glue exists, "data that is generated between casting of data acquired from a data source
and elimination of unnecessary fields" and "conversion of the final structure and data output"
Convert "into Apache Spark's DataFrame and describe it. When investigating Documents related
to DataFrame, also know the existence of Python's dir and help functions .

DynamicFrame / formatting input data


filter
srcS3.filter(lambda r: r['pint'] > 1).toDF().show() +----+----+ |pint|pstr|
+----+----+ | 2| bbb| | 3| ccc| | 4| ddd| | 5| eee| +----+----+

Select field
srcS3.select_fields(['pstr']).toDF().show() +----+ |pstr| +----+ | aaa| |
bbb| | ccc| | ddd| | eee| +----+

Exclude field
srcS3.drop_fields(['pstr']).toDF().show() +----+ |pint| +----+ | 1| | 2| |
3| | 4| | 5| +----+
Change the name of the field
srcS3.rename_field('pstr', 'pstr2').toDF().show() +----+-----+ |pint|pstr2|
+----+-----+ | 1| aaa| | 2| bbb| | 3| ccc| | 4| ddd| | 5| eee| +----+-----+

Type conversion

Casting type is here .

srcS3.resolveChoice(specs = [('pstr', 'cast:int')]).toDF().show() +----+----


+ |pstr|pint| +----+----+ |null| 1| |null| 2| |null| 3| |null| 4| |null| 5|
+----+----+

DynamicFrame / Data conversion


Join
Join.apply(srcS3, srcS3, 'pstr', 'pstr').toDF().show() +----+----+-----+----
-+ |pint|pstr|.pint|.pstr| +----+----+-----+-----+ | 2| bbb| 2| bbb| | 4|
ddd| 4| ddd| | 5| eee| 5| eee| | 3| ccc| 3| ccc| | 1| aaa| 1| aaa| +----+----
+-----+-----+

It can also be described as follows.

srcS3.join('pstr', 'pstr', srcS3).toDF().show() +----+----+-----+-----+


|pint|pstr|.pint|.pstr| +----+----+-----+-----+ | 2| bbb| 2| bbb| | 4| ddd|
4| ddd| | 5| eee| 5| eee| | 3| ccc| 3| ccc| | 1| aaa| 1| aaa| +----+----+----
-+-----+

SplitFields

Returns a DynamicFrameCollection containing two frames consisting only of the specified field
and two of the remaining fields. From DynamicFrameCollection you can extract DynamicFrame
with select .

dfc = SplitFields.apply(srcS3, ['pstr']) for dyf_name in dfc.keys():


dfc.select(dyf_name).toDF().show() +----+ |pstr| +----+ | aaa| | bbb| | ccc|
| ddd| | eee| +----+ +----+ |pint| +----+ | 1| | 2| | 3| | 4| | 5| +----+

map
def my_f(dyr): dyr['pint'] = dyr['pint'] + 10 return dyr
srcS3.map(my_f).toDF().show() +----+----+ |pint|pstr| +----+----+ | 11| aaa|
| 12| bbb| | 13| ccc| | 14| ddd| | 15| eee| +----+----+

unnest
dyf = srcS3.apply_mapping([ ('pstr', 'string', 'proot.str', 'string'),
('pint', 'int', 'proot.int', 'int') ]) dyf.printSchema() dyf2 = dyf.unnest()
dyf2.printSchema() root |-- proot: struct | |-- str: string | |-- int: int
root |-- proot.str: string |-- proot.int: int

collection / select
dfc = SplitFields.apply(srcS3, ['pstr'], 'split_off', 'remaining')
dfc.select('split_off').toDF().show() dfc.select('remaining').toDF().show()
+----+ |pstr| +----+ | aaa| | bbb| | ccc| | ddd| | eee| +----+ +----+ |pint|
+----+ | 1| | 2| | 3| | 4| | 5| +----+

collection / map
def my_f(dyf, ctx): df = dyf.toDF() return DynamicFrame.fromDF(df.union(df),
glueContext, 'dyf') dfc = SplitFields.apply(srcS3, ['pstr']) dfc2 =
dfc.map(my_f) for dyf_name in dfc2.keys():
dfc2.select(dyf_name).toDF().show() +----+ |pstr| +----+ | aaa| | bbb| | ccc|
| ddd| | eee| | aaa| | bbb| | ccc| | ddd| | eee| +----+ +----+ |pint| +----+
| 1| | 2| | 3| | 4| | 5| | 1| | 2| | 3| | 4| | 5| +----+

DynamicFrame / formatting and outputting post-conversion


data
Conversion to DynamicFrame, change of frame structure
# 相互変換df = srcS3.toDF() dyf = DynamicFrame.fromDF(df, glueContext, 'dyf')
dyf.printSchema() # 構造変更dyf2 = dyf.apply_mapping([ ('pstr', 'string',
'proot.str', 'string'), ('pint', 'int', 'proot.int', 'int') ])
dyf2.printSchema() root |-- pstr: string |-- pint: int root |-- proot: struct
| |-- str: string | |-- int: int

Split into multiple frames that can be stored in RDB

An image diagram of frame division is here .

from pyspark.sql.functions import array # 検証用にリスト形式のフィールドを追加しま


す。 df = srcS3.toDF() df2 = df.withColumn('plist', array(df.pstr, df.pint))
df2.show() dyf = DynamicFrame.fromDF(df2, glueContext, 'dyf') # リレーショナルな
関係にある複数のフレームに分割します。 dfc = dyf.relationalize('my_dyf_root',
's3://my-bucket-20171124/tmp') for dyf_name in dfc.keys(): print dyf_name
dfc.select(dyf_name).toDF().show() +----+----+--------+ |pstr|pint| plist| +-
---+----+--------+ | aaa| 1|[aaa, 1]| | bbb| 2|[bbb, 2]| | ccc| 3|[ccc, 3]| |
ddd| 4|[ddd, 4]| | eee| 5|[eee, 5]| +----+----+--------+ my_dyf_root +----+--
--+-----+ |pstr|pint|plist| +----+----+-----+ | aaa| 1| 1| | bbb| 2| 2| |
ccc| 3| 3| | ddd| 4| 4| | eee| 5| 5| +----+----+-----+ my_dyf_root_plist +---
+-----+---------+ | id|index|plist.val| +---+-----+---------+ | 1| 0| aaa| |
1| 1| 1| | 2| 0| bbb| | 2| 1| 2| | 3| 0| ccc| | 3| 1| 3| | 4| 0| ddd| | 4| 1|
4| | 5| 0| eee| | 5| 1| 5| +---+-----+---------+

DataFrame / data conversion


Write direct SQL

You can write directly in SQL rather than method chain.

df = srcS3.toDF() df.createOrReplaceTempView('temptable') sql_df =


spark.sql('SELECT * FROM temptable') sql_df.show() print spark.sql('SELECT *
FROM temptable LIMIT 1').first().pstr +----+----+ |pstr|pint| +----+----+ |
aaa| 1| | bbb| 2| | ccc| 3| | ddd| 4| | eee| 5| +----+----+ aaa

Convert to Python object

Special processing difficult to resolve with SQL can be solved by converting to Python object by
the following procedure. Please refer to the following page about Python object processing.

• About tuples
• List manipulation
• Loop processing

Convert to list of tuples

df = srcS3.toDF() tuples = df.rdd.map(lambda row: (row.pstr,


row.pint)).collect() print tuples [(u'aaa', 1), (u'bbb', 2), (u'ccc', 3),
(u'ddd', 4), (u'eee', 5)]

If the result of processing is, for example, a list of tuples, it can be converted to DataFrame as
follows.

from pyspark.sql import Row spark.createDataFrame(map(lambda tup:


Row(pstr2=tup[0], pint2=tup[1]), tuples)).show() +-----+-----+ |pint2|pstr2|
+-----+-----+ | 1| aaa| | 2| bbb| | 3| ccc| | 4| ddd| | 5| eee| +-----+-----+

Basically, Apache Spark performs all processing on memory. In the case of Apache Spark
operating in Cluster mode like AWS Glue, one Driver and multiple Executors will appear. Driver
executes PySpark script processing. The Resilient Distributed Dataset (RDD) including the
DataFrame is divided and expanded in memory of multiple executors. The number of divisions
of DataFrame into Executors can be adjusted by repartition () . When processing RDD, multiple
Executor will divide the task. Therefore, even if RDD is printed, execution result of the task can
not be confirmed with Driver's standard output. The above collect() collects RDDs that exist
in multiple Executor to Driver. Therefore, if collecting data with collect() and handling data
size is huge, pay attention to Driver's memory shortage. In addition, the following error may
occur.

Container killed by YARN for exceeding memory limits. Xxx GB of yyy GB physical memory
used. Consider boosting spark.yarn.executor.memoryOverhead.

In RDD processing in Executors, do not share Python objects on Driver with Executors every
time , all processing of Python objects should be processed on Driver and fully converted to
DataFrame (RDD) and expanded to Executors There is a possibility to solve it. For example,
avoid referring to Python objects on Driver with udf() in RDD conversion processing. Also, if
the RDD being handled is very large relative to the memory capacity of the Executor in the first
place, it can cope with by increasing the DPU.

Create an empty DataFrame

Use emptyRDD , StructType , StructField .

from pyspark.sql.types import StructType from pyspark.sql.types import


StructField from pyspark.sql.types import LongType df =
spark.createDataFrame(sc.emptyRDD(), StructType([ StructField("pint",
LongType(), False) ]))

Generate pair RDD

In order to generate a special RDD called a pair RDD from DataFrame, we use KeyBy .

rdd = df.rdd.keyBy(lambda row: row.pint)

Processing such as subtractByKey becomes possible.

rdd2 = df2.rdd.keyBy(lambda row: row.pint) df3 =


spark.createDataFrame(rdd.subtractByKey(rdd2).values())

Read directly from S3

Even if you have not created catalog tables with Crawler, you can get data directly from S3 if
schema information is known in advance. For JSON, it is as follows.

df = spark.read.json('s3://my-bucket-20171124/s3.json')

Add field

Similar processing can be described with select below, but you can know that you can add fields
withColumn . When dynamically setting the value of a field using a function, it becomes as
follows. For details on how to define functions in Python please refer to this page . udf() careful
not to refer to objects on Driver in udf() above.

from pyspark.sql.types import IntegerType def my_f(x): return x * 2 df =


srcS3.toDF() df.withColumn('pint2', udf(my_f,
IntegerType())(df['pint'])).show() +----+----+-----+ |pstr|pint|pint2| +----
+----+-----+ | aaa| 1| 2| | bbb| 2| 4| | ccc| 3| 6| | ddd| 4| 8| | eee| 5|
10| +----+----+-----+

To add a constant field, use lit

from pyspark.sql.functions import lit df = srcS3.toDF()


df.withColumn('pint2', lit(123)).show() +----+----+-----+ |pstr|pint|pint2|
+----+----+-----+ | aaa| 1| 123| | bbb| 2| 123| | ccc| 3| 123| | ddd| 4| 123|
| eee| 5| 123| +----+----+-----+

select
df = srcS3.toDF() df.select(df['pstr'],
df['pstr'].substr(1,3).alias('pstr2'), (df['pint'] % 2).alias('peo')).show()
+----+-----+---+ |pstr|pstr2|peo| +----+-----+---+ | aaa| aaa| 1| | bbb| bbb|
0| | ccc| ccc| 1| | ddd| ddd| 0| | eee| eee| 1| +----+-----+---+

distinct, dropDuplicates
from pyspark.sql.functions import lit df = srcS3.toDF().limit(2) df2 =
df.union(df).withColumn('pint2', lit(123)) df2.show() +----+----+-----+
|pstr|pint|pint2| +----+----+-----+ | aaa| 1| 123| | bbb| 2| 123| | aaa| 1|
123| | bbb| 2| 123| +----+----+-----+

For all fields, distinctwe use to eliminate duplicates .

df2.distinct().show() +----+----+-----+ |pstr|pint|pint2| +----+----+-----+


| bbb| 2| 123| | aaa| 1| 123| +----+----+-----+

For certain fields, we dropDuplicateswill use to eliminate duplicates .

df2.dropDuplicates(['pint2']).show() +----+----+-----+ |pstr|pint|pint2| +--


--+----+-----+ | aaa| 1| 123| +----+----+-----+

where
df = srcS3.toDF() df.where(df['pint'] > 1).show() +----+----+ |pstr|pint| +-
---+----+ | bbb| 2| | ccc| 3| | ddd| 4| | eee| 5| +----+----+

groupBy / aggregate function

count

df = srcS3.toDF() df.groupBy(df['pint'] % 2).count().show() +----------+----


-+ |(pint % 2)|count| +----------+-----+ | 1| 3| | 0| 2| +----------+-----+
df.groupBy(df['pstr'], (df['pint'] % 2).alias('peo')).count().show() +----+--
-+-----+ |pstr|peo|count| +----+---+-----+ | ccc| 1| 1| | eee| 1| 1| | ddd|
0| 1| | aaa| 1| 1| | bbb| 0| 1| +----+---+-----+

sum

df = srcS3.toDF() df.groupBy(df['pint'] % 2).sum('pint').show() +----------


+---------+ |(pint % 2)|sum(pint)| +----------+---------+ | 1| 9| | 0| 6| +--
--------+---------+

min / max
df = srcS3.toDF() df.groupBy(df['pint'] % 2).min('pint').show() +----------
+---------+ |(pint % 2)|min(pint)| +----------+---------+ | 1| 1| | 0| 2| +--
--------+---------+ df.groupBy(df['pint'] % 2).max('pint').show() +----------
+---------+ |(pint % 2)|max(pint)| +----------+---------+ | 1| 5| | 0| 4| +--
--------+---------+

avg

df = srcS3.toDF() df.groupBy(df['pint'] % 2).avg('pint').show() +----------


+---------+ |(pint % 2)|avg(pint)| +----------+---------+ | 1| 3.0| | 0| 3.0|
+----------+---------+

orderBy
df = srcS3.toDF() df.orderBy(desc('pstr'), 'pint').show() +----+----+
|pstr|pint| +----+----+ | eee| 5| | ddd| 4| | ccc| 3| | bbb| 2| | aaa| 1| +--
--+----+

About RDD Persistence


For operation of RDD

• "Conversion" that does not need to use all of the RDDs that are divided into cluster-wide
Executors like the DynamicFrame filter and the DataFrame where.
• An "action" that must use all of the RDDs that are divided into cluster-wide Executors,
such as DynamicFrame count and DataFrame collect

There are two. The operation result of "conversion" is RDD. "Conversion" will not be executed
until it is needed in "Action". This is called delay evaluation. Also, even if the same "conversion"
is repeatedly delayed every time it becomes necessary in "action".

Accelerated by RDD persistence

When cache () is used, the result of RDD which is delayed and converted for action execution is
saved in Executors' memory.

df = src.xxx.xxx.xxx.xxx ← 複雑な RDD 変換df.cache() ← 次のアクションで永続化します


df.count() ← RDD 永続化はここでなされます

Even if the action using the RDD is executed again while the result of the complicated RDD
conversion is perpetuated

df.count()

Complex RDD conversions are not executed again. If you execute the above action without
cached, the following complicated RDD transformation will be delayed again. Including reading
data from HDFS is time-consuming, so try to persist RDDs that are required for multiple actions.
src.xxx.xxx.xxx.xxx.count()

cache()Is like a shorthand notation that you can specify storageLevel in detail . When the
persisted RDD becomes unnecessary, it explicitly deletes it from
Executors.persist()unpersist()

df.unpersist()

Visualization
When using Apache Zeppelin

df = srcS3.toDF() df.createOrReplaceTempView('df')

With the result saved as a graph, it can be graphed as follows.

%sql select * from df

Reference material for writing PySpark


script
References that can be used dictionary every time in actual development include the following.

• Spark SQL, DataFrames and Datasets Guide


• pyspark.sql API Docs
• Sample code 1
• Sample code 2
If you can use the homebrew of macOS, you can install Apache Spark and check the contents of
the reference while running it.

brew install apache-spark

Python

pyspark

Scala

spark-shell

Вам также может понравиться