Академический Документы
Профессиональный Документы
Культура Документы
lake, in-place querying, schema on read and data mashup have become a critical
and integral part of an organization’s analytics capabilities. Consequently, new
tools and features like AWS Redshift spectrum, Athena, Presto, Hive etc. are now
playing a game changing role for data analysts to explore and mine data and get
value out of existing data assets lying unused within the organization. Although,
with so many options to choose from, it becomes utterly necessary to understand
the subtle differences between each of them and advantages of each, to select
the right tool of choice for a particular use case. With this blog, we will explore
some capabilities of Redshift spectrum and advantages of using it in some of the
use cases, while keeping the comparison and differences between different tool
of choice to a separate blog post.
With Amazon Redshift Spectrum, you now have a fast, cost-effective engine that
minimizes data processed with dynamic partition pruning. You can further
improve query performance by reducing the data scanned. You do this by
partitioning and compressing data and by using a columnar format for storage.”
Pretty clear from the above that Redshift spectrum is a service that can be
enabled with AWS Redshift database to query on the data stored in AWS S3
buckets as files. This would sound similar to external tables in Oracle or SQL
server database, but with a lot more flexibility and features which I would discuss
later in the post.
AWS Redshift spectrum gives a huge upper hand for any organization trying to be
data driven and get immediate value out of their data.
Create a S3 bucket to store the data files within the bucket. For our example, we
are going to create a bucket “s3-kpi-monthly-datafile-gl”. A bucket in AWS can be
considered to be a master folder to store file objects within it. A bucket name is a
global name across the AWS platform and hence needs to be unique. If a name is
already used by any one else using AWS platform, AWS will not allow you to
create the bucket with same name.
Upload the data file to the S3 bucket
Create an IAM role having privileges to access S3 buckets, data catalog and
Redshift. For demo purpose, we will create role named
“RedshiftAccessForSpectrum” add Full access privilege for S3 and Redshift.
These access would enable Redshift spectrum to access data folder in S3 and
Athena data catalog for maintaining the metadata.
Create external schema in Redshift
All the external tables within Redshift has to be created inside an external
schema. The external schema “ext_Redshift_spectrum” created can either use a
data catalog or hive meta store to internally manage the metadata pertaining to
the external tables like table definitions and datafile locations.
Here,
athenadatacatalog – The data catalog that would be created if it doesn’t exist
that would maintain the spectrum metadata related to external tables.
External table to be created using a csv file uploaded in the S3 bucket “s3-kpi-
monthly-datafile-gl”.
External tables can refer to any number of files uploaded inside the bucket or
bucket folder provided the file structures remain the same. This would make the
table more dynamic and would not need to recreate the table every time a new
file is uploaded to the bucket.
The lines below are used to specify the file format to be accessed. The sample
file uses csv format.
AWS Redshift is a highly scalable MPP database, which can support a hugely
complex business intelligence and analytics workload. However, with capabilities
comes cost which needs to be kept in mind while designing a data warehouse on
top of a Redshift database. Whether it is addition of nodes to existing cluster or
spinning out a new cluster, it is always a costly affair for any organization. So, for
any data that needs to be stored in Redshift needs a careful consideration. AWS
Redshift is spawned as a cluster using either a DC or a DS node depending on the
query performance and sizing needs (less than or greater than 500 GB
recommended). As data in Redshift grows, need for adding new nodes becomes a
point to consider. With new nodes the cost of ownership also increases. However,
we all know from a data warehouse perspective, that not all data sitting in a data
warehouse are actively queried. So, it is reasonable to say that any cold data can
be kept in S3 storage (Standard, IA or glacier) and can be referenced whenever
the need arises. Redshift spectrum comes to rescue here as it provides a way to
reference data sitting in files inside S3 to be directly queried within Redshift with
other hot data stored in Redshift tables. Use case of this can be any data not
actively needed for day to day operational and analytical reporting but needs to
be stored in query able format for compliance purpose or any data that is mostly
needed for monthly/ quarterly / yearly reporting.
As an example, consider a use case when a customer has a total data warehouse
size of 5 TB with only 2 TB as hot data while the other 3 TB needs to be only
queried quarterly.
Scenario 2 - A one node Redshift cluster with an on-demand pricing ds2.xlarge
nodes and located in us-east-1 region would initially give the capability of store 2
TB worth of data in Redshift. While rest of 3 TB data can be kept in S3-IA storage.
Below table depicts the estimated cost of ownership in both the scenarios:
Athena is a Serverless querying service provided by AWS which can also be used
to query data stored in S3. There is no need to spin up a Redshift cluster for using
Athena. However, Athena is good for initial exploratory analysis to be done on any
data stored in S3. For any bigger workloads such as those to be used for business
intelligence analytics and dealing with huge volume of data to be combined and
joined with multiple Redshift tables, it is always recommended to use Redshift
spectrum.
Any user not invested in Redshift would still like to use Athena for querying the S3
data with TCO same or even lesser. However, any customer who are already
invested in Redshift will always like to use Redshift spectrum and its engine
capabilities to query huge volume of S3 data to get decisions out of both S3 and
Redshift data.
UDF stands for user defined function, which can be considered to be similar to
RDBMS functions. Redshift as well as Redshift spectrum supports UDF’s which
can be used in SQL queries written to access Redshift spectrum tables for
carrying out complex logic calculation. UDF’s in Redshift can be written in simple
SQL or can use language like python for carrying out complex computation like
model calculation in machine learning.
For the purpose of example, we tried creating a UDF using python logger module
to log messages to log file in S3, saving valuable dollars if it would have been
written in Redshift tables.
logger = logging.getLogger()
logger.setLevel(logging.INFO)
logger.info($1)
return 0
$$ language plpythonu;
The above UDF can be called by any ETL process using Redshift spectrum tables
and log errors and warning messages to log files in S3.
Security
As with Redshift, spectrum also comes with various security features to handle
data security requirements for data at rest and data in-transit. Data encryption
options are available at various places like AWS SSE (S3 or KMS) for files stored
in S3 which spectrum can decrypt transparently, while reading in Redshift.
Further security is also enforced by having Redshift cluster launched within a VPC
to restrict data flow inside a private cloud. A point to note here is that, Redshift
spectrum currently doesn’t support AWS S3 client-side encryption. This would
mean any data already stored in S3 with client-side encryption has to be stored
with AWS SSE in S3.
Conclusion
Redshift spectrum is a great tool to have in any organization’s bucket using AWS
or wanting to get maximum value out of their data in least costly way. It is a
smart way to add data lake features inside a data warehouse already hosted in
Redshift. For any customer not invested in Redshift already, and who are just
trying to explore data can always use AWS Athena and then extend to Redshift
and Redshift spectrum for their data warehouse, analytics and business
intelligence workload.