Вы находитесь на странице: 1из 10

LAB 7 - Getting Started with Google BigQuery

1. Overview
In this lab, you load a CSV file into a BigQuery table. After loading the data, you query it using the
BigQuery web user interface, the CLI, and the BigQuery shell.

What you need


To complete this lab, you need:

Access to a supported Internet browser:

The latest version of Google Chrome, Firefox, or Microsoft Edge


Microsoft Internet Explorer 11+
Safari 8+ (Safari private mode is not supported)

A Google Cloud Platform project

What you learn


In this lab, you:

Load a CSV file into a BigQuery table using the web UI


Query the data using the BigQuery web UI
Query the data using the CLI and the BigQuery shell

2. Introduction
Using BigQuery involves interacting with a number of Google Cloud Platform resources, including
projects, datasets, tables, and jobs. This lab introduces you to some of these resources and this brief
introduction summarizes their role in interacting with BigQuery.

Projects. A project contains information such as subscribed service API(s), authentication


information, billing information and Access Control Lists (ACLs) that determine access to the
datasets and the jobs. Projects are created and managed using the APIs Console. For information
about the related API type, see Projects.
Datasets. A dataset is a grouping mechanism that holds zero or more tables. A dataset is the lowest
level unit of access control. You cannot control access at the table level. Datasets are owned by
projects, which control billing and serve as a global namespace root - all of the object names in
BigQuery are relative to the project. Each dataset can be shared with individual users. Datasets are
also referenced in the SQL statements when interacting with BigQuery. For information about the
related API type, see Datasets.

Tables. Row-column structures that contain actual data. They belong to a dataset. You cannot
control access at the table level, you do it at dataset level. For information about the related API
type, see Tables.

Collection of columns and rows


Data stored in managed storage
Have a schema
Describes strongly-typed columns of values
Views are supported
Virtual tables defined by SQL query
Tables can be external (federated)
Query Google Cloud Storage directly (discussed later)

Before you can query your data, you must first load it into BigQuery or set up the data as a federated
data source. You can bulk load the data by using a job, or stream records individually. Alternately,
you can skip the loading process by setting up a table as a federated data source.

Load jobs support three data sources:

1. Objects in Google Cloud Storage


2. Data sent with the job or streaming insert
3. A Google Cloud Datastore backup

Loaded data can be added to a new table, appended to a table, or can overwrite a table. Data can
be represented as a flat or nested/repeated schema.

Jobs. Jobs are used to start all potentially long-running actions, such as queries, table import, and
export requests. Shorter actions, such as list or get requests, are not managed by a job resource.
For information about the related API type, see Jobs. Each job has a job id. A very good use of job id
is when you load a large dataset. BigQuery rejects load job with the same job id. Therefore,
guaranteeing that data would not be loaded twice.

You do not need to duplicate data across many clusters to achieve resource separation. Just use
ACLs.

The following diagram illustrates the relationship between projects, datasets, tables, and jobs.
3. Load the data
BigQuery supports the following data formats when loading data into tables: CSV, JSON, AVRO, or
Cloud Datastore backups. Data can be loaded into BigQuery using a job or by streaming records
individually. Load jobs support three data sources:

Objects in Google Cloud Storage


Data sent with the job or streaming insert
A Google Cloud Datastore backup

In this lab, you load the contents of a CSV file (from Google Cloud Storage) into a BigQuery table
using the BigQuery web UI. After loading the data, you query it using the web UI, the CLI, and the
BigQuery shell.

The CSV file used in the lab contains the top 1,000 most popular baby names from 2014. You can
download an archive of the data compiled for each year by the US government from the Social
Security Online website.

To load the data in the CSV file into a BigQuery table:

Step 1
Open the Google Cloud Platform Console, and if necessary, select the cp100 project.
Step 2
Click Big Data > BigQuery.

Step 3
Click the blue arrow to the right of your project name and choose Create new dataset.

Step 4
In the Create Dataset' dialog, for Dataset ID, type cp100 and then click OK.

Step 5
When the dataset is created, to the right of cp100, click the add table icon (the blue + icon).

Step 6
In the Source Data section:

For Location', select Google Cloud Storage, and in the field, type gs://cloud-
training/CP100/Lab12/yob2014.txt.
Verify File Format is set to CSV.

Step 7
In the Destination Table section:

For Table name, type namedata.


For Table type, verify Native table is selected.

Step 8
In the Specify schema section:

For Name, type name.


For Type choose STRING.
For Mode choose NULLABLE.

Click Add field.


For Name, type gender.
For Type choose STRING.
For Mode choose NULLABLE.

Click Add field.

For Name, type count.


For Type choose INTEGER.
For Mode choose NULLABLE.

Step 9
In the Options section:

For Field delimiter, verify Comma is selected.


Accept the remaining default values and click Create Table. BigQuery creates a load job to
create the table and upload data into the table (this may take a few seconds). You can track
job progress by clicking Job History.

Step 10
Once the load job is complete, click cp100 > namedata.

Step 11
On the Table Details page, click Details to view the table properties and then click Preview to view
the table data.

4. Query processing
Dremel turns your SQL query into a dynamic execution tree consisting of a Master and a number of
shards (groupings of compute resources).

The Master determines the the number of shards needed to retrieve data and performs record
assembly, ordering, and final filtering. The shards read data and perform aggregate and scalar
functions against the data.

BigQuery relies on Colossus, Google's latest generation distributed file system. Each Google
datacenter has its own Colossus cluster, and each Colossus cluster has enough disks to give every
BigQuery user thousands of dedicated disks at a time. Colossus also handles replication, recovery
(when disks crash) and distributed management (so there is no single point of failure). Colossus is
fast enough to allow BigQuery to provide similar performance to many in-memory databases, but
leveraging much cheaper yet highly parallelized, scalable, durable and performant infrastructure.
BigQuery leverages a columnar storage format and compression algorithm to store data in Colossus
in the most optimal way for reading large amounts of structured data. Colossus allows BigQuery
users to scale to dozens of Petabytes in storage seamlessly, without paying the penalty of attaching
much more expensive compute resources typical with most traditional databases.

To give you thousands of CPU cores dedicated to processing your task, BigQuery takes advantage
of Borg, Google's large-scale cluster management system. Borg clusters run on dozens of
thousands of machines and hundreds of thousands of cores.

Besides obvious needs for resource coordination and compute resources, Big Data workloads are
often throttled by networking throughput. Google's Jupiter network can deliver 1 Petabit/sec of total
bisection bandwidth, allowing us to efficiently and quickly distribute large
workloads. Jupiter networking infrastructure might be the single biggest differentiator in Google
Cloud Platform. It provides enough bandwidth to allow 100,000 machines to communicate with any
other machine at 10 Gbs.

To get started querying data, BigQuery includes a number of public datasets.

5. Query the data


In this section of the lab you query the namedata table you created previously using the BigQuery
web UI, the CLI, and the BigQuery shell.

To query your data:

Step 1
In the BigQuery web UI, click Compose Query.

Step 2
In the New Query window, type the following query to display the 5 most popular girls' names in
2014.

SELECT
name,
COUNT
FROM
cp100.namedata
WHERE
gender = 'F'
ORDER BY
COUNT DESC
LIMIT
5
Notice the Query Validator tells you the query syntax is valid (indicated by the green check mark)
and indicates how much data the query will process. The amount of data processed allows you to
determine the price of the query using the Cloud Platform Pricing Calculator.

Step 3
Click Run Query and examine the results.

Step 4
Open the Google Cloud Platform Console.

Step 5

In the top right corner of the console window, click the Activate Google Cloud Shell button ( ).

Step 6
Type the following query using the bq CLI. This query displays the 5 least popular boys' names in
2014.

bq query \
"SELECT name,count FROM cp100.namedata WHERE gender = 'M' ORDER BY count ASC LIMIT
5"

Step 7
Type the following command to launch the BigQuery shell.

bq shell

Step 8
Type the following command to display the 5 most popular boys' names in 2014.

SELECT name,count FROM cp100.namedata WHERE gender = 'M' ORDER BY count DESC LIMIT
5
Notice that when you use the BigQuery shell, you do not need to preface the query with bq
query and you do not need to place the query in quotes.

Step 9
Type exit to close the BigQuery shell.

Step 10
Type exit to close the Cloud Shell window.

Step 11
Leave the BigQuery web UI and the Cloud Platform Console open.

6. Clean up
To delete the resources used in the lab (and the remaining resources used in the course):

Step 1
Switch to the BigQuery web UI.

Step 2
To the right of the cp100 dataset, click the blue, drop-down arrow icon and choose Delete dataset.

Step 3
In the Delete Dataset' dialog, type the dataset ID in the field (cp100) and then click OK.

Step 4
Close the BigQuery web UI.

Step 5
(Optional) Follow the remaining steps to delete the project you created for this course.

In the upper right corner of the Cloud Platform Console, click the Utilities and more icon and then
choose Project settings.

Step 6
Click Delete Project.

Step 7
In the Shut down project...' dialog, type the project ID and then click Shut down.

Step 8
(Optional) Follow the remaining steps if you wish to deactivate your billing account.

In the Cloud Platform Console navigator, click Billing.

Step 9
If you have more than one billing account, click the billing account linked to your project.

Step 10
On the Overview page, at the top of the window, click Close Billing Account.

Step 11
Click Close billing account to confirm the action.

Step 12
Close the Cloud Platform Console.

Step 13
(Optional) You may also delete the zip file containing the Bookshelf cover images from your
computer.

Google, Inc. or its affiliates. All rights reserved. Do not distribute.

Вам также может понравиться