Вы находитесь на странице: 1из 10

Managing input data and data sources:

1. Avoid Select * by projecting columns that you need. Applying a LIMIT clause to a SELECT * query does
not affect the amount of data read. You are billed for reading all bytes in the entire table, and the query
counts against your free tier quota.
Use SELECT * EXCEPT to exclude one or more columns from the results.
2. If you do require queries against every column in a table, but only against a subset of data, consider:
• Materializing results in a destination table and querying that table instead
• Partitioning your tables by date and querying the relevant partition; for example, WHERE
_PARTITIONDATE="2017-01-01" only scans the January 1, 2017 partition
3. Prune partitioned queries: When querying a time-partitioned table, use the _PARTITIONTIME pseudo
column to filter the partitions. Use time partitioned tables instead of sharded tables if your data allows
it.
4. Denormalize data whenever possible.
Avoid denormalization in these use cases:
• You have a star schema with frequently changing dimensions.
• BigQuery complements an Online Transaction Processing (OLTP) system with row-level
mutation, but can't replace it.
Use nested (STRUCTs) and repeated (ARRAYs) data to maintain relationships.
5. Use of external data sources: If query performance is a top priority, do not use an external data
source.
Querying tables in BigQuery managed storage is typically much faster than querying external tables in
Google Cloud Storage, Google Drive, or Google Cloud Bigtable.
6. When querying wildcard tables, use the most granular prefix possible.
Wildcard tables are useful if your dataset contains:
• Multiple, similarly named tables with compatible schemas
• Sharded tables

Optimizing communication between slots:


1. Reduce your data before joins.
2. WITH clauses are used primarily for readability because they are not materialized. For example,
placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a
query appears in more than one WITH clause, it executes in each clause.
3. Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables.
Also, avoid oversharding.
Partitioned Tables perform better than date-named tables. When you create tables sharded by date,
BigQuery must maintain a copy of the schema and metadata for each date-named table. Also, when
date-named tables are used, BigQuery might be required to verify permissions for each queried table.

Optimizing Query Computation


While optimizing query performance, it is also absolutely necessary to look at the CPU
utilization one query does. To make sure the query runs in optimum time as well as optimum
resources following are some best practices that one can follow:

Avoid repeatedly transforming data via SQL queries:

• If you are using SQL to perform ETL operations, avoid situations where you are
repeatedly transforming the same data.

• For example, if you are using SQL to trim strings or extract data by using regular
expressions, it is more performant to materialize the transformed results in a
destination table. Functions like regular expressions require additional computation.
Querying the destination table without the added transformation overhead is much
more efficient.

Avoid JavaScript user-defined functions:

• Avoid using JavaScript user-defined functions. Use native UDFs instead.

• Calling a JavaScript UDF requires the instantiation of a subprocess. Spinning up this


process and running the UDF directly impacts query performance. If possible, use a
native (SQL) UDF instead.

Use approximate aggregation functions:

• If your use case supports it, use an approximate aggregation function.

• If the SQL aggregation function you're using has an equivalent approximation function,
the approximation function will yield faster query performance. For example, instead of
using COUNT(DISTINCT), use APPROX_COUNT_DISTINCT().

• You can also use HyperLogLog++ functions to do approximations (including custom


approximate aggregations).
Order query operations to maximize performance:

• Use ORDER BY only in the outermost query or within window clauses (analytic
functions). Push complex operations to the end of the query.

• If you need to sort data, filter first to reduce the number of values that you need to sort.
If you sort your data first, you sort much more data than is necessary. It is preferable to
sort on a subset of data than to sort all the data and apply a LIMIT clause.

• When you use an ORDER BY clause, it should appear only in the outermost query. Placing
an ORDER BY clause in the middle of a query greatly impacts performance unless it is
being used in a window (analytic) function.

• Another technique for ordering your query is to push complex operations, such as
regular expressions and mathematical functions to the end of the query. Again, this
technique allows the data to be pruned as much as possible before the complex
operations are performed.

Optimize your join patterns:

• For queries that join data from multiple tables, optimize your join patterns. Start with
the largest table.

• When you create a query by using a JOIN, consider the order in which you are merging
the data. The standard SQL query optimizer can determine which table should be on
which side of the join, but it is still recommended to order your joined tables
appropriately. The best practice is to place the largest table first, followed by the
smallest, and then by decreasing size.

• When you have a large table as the left side of the JOIN and a small one on the right side
of the JOIN, a broadcast join is created. A broadcast join sends all the data in the smaller
table to each slot that processes the larger table. It is advisable to perform the
broadcast join first.

Prune partitioned queries:

• When querying a time-partitioned table, use the _PARTITIONTIME pseudo column to


filter the partitions.
• When you query partitioned tables, use the _PARTITIONTIME pseudo column. Filtering
the data using _PARTITIONTIMEallows you to specify a date or range of dates.

• For example, the following WHERE clause uses the _PARTITIONTIMEpseudo column to
specify partitions between January 1, 2016 and January 31, 2016:

WHERE _PARTITIONTIME
BETWEEN TIMESTAMP(“20160101”)
AND TIMESTAMP(“20160131”)

• The query processes data only in the partitions that are indicated by the date range.
Filtering your partitions improves query performance and reduces costs.

Managing Query Outputs

When evaluating your output data, consider the number of bytes written by your query.
How many bytes are written for your result set? Are you properly limiting the amount of data
written? Are you repeatedly writing the same data? The amount of data written by a query
impacts query performance (I/O). If you are writing results to a permanent (destination) table,
the amount of data written also has a cost.

The following best practices provide guidance on controlling your output data.

Avoid repeated joins and subqueries:

• Avoid repeatedly joining the same tables and using the same subqueries.

• If you are repeatedly joining the same tables, consider revisiting your schema. Instead of
repeatedly joining the data, it might be more performant for you to use nested repeated
data to represent the relationships.

• Nested repeated data saves you the performance impact of the communication
bandwidth that is required by a join. It also saves you the I/O costs that are incurred by
repeatedly reading and writing the same data. Similarly, repeating the same subqueries
impacts performance through repetitive query processing.

• If you are using the same subqueries in multiple queries, consider materializing the
subquery results in a table. Then consume the materialized data in your queries.
• Materializing your subquery results improves performance and reduces the overall
amount of data that is read and written by BigQuery. The small cost of storing the
materialized data outweighs the performance impact of repeated I/O and query
processing.

Carefully consider materializing large result sets:

• Carefully consider materializing large result sets to a destination table. Writing large
result sets has performance and cost impacts.

• BigQuery limits cached results to approximately 128MB compressed. Queries that


return larger results overtake this limit and frequently result in the following error:
Response too large.

• This error often occurs when you select a large number of fields from a table with a
considerable amount of data. Issues writing cached results can also occur in ETL-style
queries that normalize data without reduction or aggregation.

• You can overcome the limitation on cached result size by:

• Using filters to limit the result set


• Using a LIMIT clause to reduce the result set, especially if you using an ORDER BY
clause
• Writing the output data to a destination table

• Be aware that writing very large result sets to destination tables impacts query
performance (I/O). In addition, you will incur a small cost for storing the destination
table.

• You can automatically delete a large destination table by using the dataset's default
table expiration.

query timeline and the Execution plan


• When BigQuery executes a query job, it converts the declarative SQL statement into a
graph of execution ,

• You can see details of query plan for a completed query by clicking the Details button

• For queries that are long-running, you can view the query plan as it progresses by
clicking on link within the query status line below the query composer pane.

• By looking at this information we know that data read from which columns ,name of
table, count , aggregation if any done.

• Embedded within query jobs, BigQuery includes diagnostic query plan and timing
information.

• This is similar to the information provided by statements such as EXPLAIN in other


database and analytical systems.

• This information can be retrieved from the API responses of methods such as jobs.get

Error reporting

It is possible for query jobs to fail mid-execution. Because plan information is updated
periodically, you can observe where within the execution graph the failure occurred.
Within the UI, successful and failed stages are labelled via check mark and exclamation
point next to the stage name.

Using execution information:-

BigQuery query plans provides information about how the service executes queries, but the
managed nature of the service limits whether some details are directly actionable. Many
optimizations happen automatically simply by using the service, which may differ from other
environments where tuning, provisioning, and monitoring may require dedicated,
knowledgeable staff.

Avoiding SQL Anti-Patterns

Self-joins
Best practice: Avoid self-joins. Use a window function instead.

• Typically, self-joins are used to compute row-dependent relationships. The result of


using a self-join is that it potentially doubles the number of output rows. This increase in
output data can cause poor performance.

• Instead of using a self-join, use a window (analytic) function to reduce the number of
additional bytes that are generated by the query

Data skew

Best practice: If your query processes keys that are heavily skewed to a few values, filter your
data as early as possible.

Partition skew, sometimes called data skew, is when data is partitioned into very unequally
sized partitions. This creates an imbalance in the amount of data sent between slots. You can't
share partitions between slots, so if one partition is especially large, it can slow down, or even
crash the slot that processes the oversized partition.

Partitions become large when your partition key has a value that occurs more often than any
other value. For example, grouping by a user_id field where there are many entries for guest or
null

To avoid performance issues that result from data skew:

• Use an approximate aggregate function such as APPROX_TOP_COUNT to determine if the


data is skewed.
• Filter your data as early as possible.

Cross joins (Cartesian product)

Best practice: Avoid joins that generate more outputs than inputs. When a CROSS JOIN is
required, pre-aggregate your data.

Cross joins are queries where each row from the first table is joined to every row in the second
table (there are non-unique keys on both sides).

The worst case output is the number of rows in the left table multiplied by the number of rows
in the right table. In extreme cases, the query might not finish.
To avoid performance issues associated with joins that generate more outputs than inputs:

• Use a GROUP BY clause to pre-aggregate the data.


• Use a window function. Window functions are often more efficient that using a cross
join. For more information, see analytic functions.

Case Scenarios:
1. Too many aggregations:
On a non-partitioned table, if we try to run a query with high number of aggregations lets
say around 150, BQ might not execute the query and give an error.
Solution:
a) If we cannot partition that table, we will need to find grain of that table and split our
aggregations into parts.e.g. in case of 150, split it into 3 to 4 parts having around 40
aggregations each and then join those temporary results.
b) partition the table on a date or timestamp column
2. Order by:
On a non-partitioned table, when we try to use order by, it sends the complete data on one
node and tries to sort it resulting in memory failure error as shown.

Solution:
a) After partitioning on timestamp or date column, we can use order by on each partition.
b) Compute your query in Spark.
3. Windowing functions:
Since we are unable to use ORDER BY clause, we will not be able to use windowing functions like
ROW_NUMBER(), RANK(), NTILE(), etc. as well while executing the query.
Solution: Compute your query in Spark.
4. Selecting only required columns before joining:

In BQ, if a query runs for more than 6 hours, it gets aborted automatically.
In the above query, as we can see we are performing an inner join between ITEMS and
PERSONA_DIMENSION. While we are performing this join, we are taking all the columns from both
tables whereas we only need INDIVIDUAL_KEY, TRANSACTION_DATETIME from ITEMS and
INDIVIDUAL_KEY, SIGN_UP_DATE_LPS from PERSONA_DIMENSION.
Also, sometimes our data in source is in such a way that INNER JOIN behaves as CROSS JOIN, so
limiting our columns will also help us to eradicate such cases.
So, our optimized query will look like:

5. Multiple subqueries:
a) There would be many scenarios where you would be doing UNION of many queries and getting
the final result.
Or many queries would be having same subquery across scripts.
Most of the times, this will impact your query performance and sometimes result in failure of query
if the execution time reaches the limit of 6 hours.
Solution: Materializing the results.
Store your subquery result in one table and use that table for further processing.

b) This is also helpful in cases when in your outer query, many aggregations are being done on
columns being retrieved from inner subquery.
We have seen a major improvement in query execution time when we take subquery results into a
table and ran aggregations over the new table.
Note: Follow this approach only when outer query has a high number of aggregates, probably
greater than 5.
To decrease cost, delete the created tables after use.

Вам также может понравиться