Saturday, 23 March 2019

Google BigQuery - bq Command-Line Tool Quickstart

This page provides a 15-minute exercise in using the bq command-line tool to run queries, load data, and export data.

Prerequisites

Before you can start this tutorial, you must complete the following prerequisites:
  1. Activate the BigQuery service with a Google APIs Console project
    If you are a member of an existing Google APIs Console project that already has BigQuery enabled, you can use that project, or you can create a new project. If you create a new project, you must also turn on billing for that project before you can use BigQuery with full capabilities, as described in the sign up process.
  2. Install the Google Cloud SDK
    If you haven't installed the SDK, download the Cloud SDK now.

Running bq in Interactive Mode

You can run bq in an interactive mode, where you don't need to prefix the commands with "bq". To start interactive mode, call bq shell. The prompt is the ID of the default project. To exit interactive mode, type "exit".
$ bq shell
Welcome to BigQuery! (Type help for more information.)
10000000021> ls
   datasetId
 -------------
  mydataset
10000000021> exit
Goodbye.

Examine a Table

If you're signed up to use BigQuery, you can run queries against or examine schemas of any of our sample tables. We'll try running some queries against the Shakespeare table, which contains an entry for every word in every play.
To examine the schema of a specific table, run bq show projectId:datasetId.tableId, where the project and dataset IDs can optionally be omitted if they are the default values for your bq tool. The following example examines the shakespeare table in the samples dataset:
$ bq show publicdata:samples.shakespeare
    tableId      Last modified                  Schema
 ------------- ----------------- ------------------------------------
  shakespeare   01 Sep 13:46:28   |- word: string (required)
                                  |- word_count: integer (required)
                                  |- corpus: string (required)
                                  |- corpus_date: integer (required)

Run a Query

Let's see how many times the substring "raisin" appears in Shakespeare's works.
To run a query, run the command bq query "query_string", where the query string must be quoted, and follow the BigQuery SQL syntax. Note that any quotation marks inside the query string must be escaped with a \ mark, or else use a different quotation mark type than the surrounding marks (" versus ').
The following example does a word count for the number of times that the substring "raisin" appears in all of Shakespeare's works. The query shown is case-sensitive; string comparisons are case-sensitive, unless you use the SQL feature IGNORE CASE.
$ bq query "SELECT word, COUNT(word) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
Waiting on job_dcda37c0bbed4c669b04dfd567859b90 ... (0s) Current status: DONE
+---------------+-------+
|     word      | count |
+---------------+-------+
| Praising      |   4   |
| raising       |   5   |
| raisins       |   1   |
| praising      |   7   |
| dispraising   |   2   |
| dispraisingly |   1   |
+---------------+-------+
In the next set of query results, we try to select "huzzah" from the "word" column, but since there are no matches, no results are returned.
$ bq query "SELECT word FROM publicdata:samples.shakespeare WHERE word = 'huzzah' IGNORE CASE"
Waiting on job_e19 ... (4s) Current status: DONE$

Create a New Table

Now try creating your own table. Every table must live inside a dataset, which is simply a group of tables. A dataset is assigned to a single project. For full details on managing and populating tables, see Tables.

Step 1: Create a data file

We'll use a file of the top 1,000 most popular baby names compiled each year by the US government. You can get a zipped set of files by year from the Social Security Online website. Download the national data file. This zipped collection includes a comma-separated file (CSV) of names for each year for the past 120 years. These files have the proper formatting for BigQuery (see BigQuery's CSV requirements). For our example, we'll use the file with names for 2010.
Open the file in a text editor to see what it looks like; the file is a comma-separated value (CSV) file with the following three columns: name, sex ("M" or "F"), and number of children with that name. The file has no header row.

(Optional) Upload the data file to Google Cloud Storage

If you would like to load data from Google Cloud Storage rather than directly from your computer, you can upload your data to Google Cloud Storage as follows:
  1. Enable Google Cloud Storage for your project. You will upload your data file to Google Cloud Storage, and load it from there. To enable Cloud Storage:
    1. Open the Google APIs Console to your BigQuery project
    2. Enable Cloud Storage on the Services pane.
    3. Enable billing on the Billing pane. The example table in this exercise is small enough that if you disable billing in the project after running the exercise, you shouldn't be billed anything.
  2. Open the Cloud Storage Manager. From the Google APIs Console for your project, open the Google Cloud Storage pane, and click the "Google Cloud Storage Manager" link.
  3. Upload your file to Google Cloud Storage:
    1. In the Cloud Storage Manager, create a universally unique bucket name, for example: bigquerybucket
    2. Select the new bucket, and click Upload to upload your file.
Follow the rest of the steps to continue loading your data.

Step 2: Create a new dataset

You can see whether the project has any existing datasets that you can use, or you can create a new one. Try out the following bq commands to list datasets, and optionally create a new dataset:
List Datasets
# Without modifiers, ls lists datasets in the default project:
$ bq ls
   datasetId
 -------------
  olddataset

# Provide the project ID to list datasets in a specific project
$ bq ls 292244645228:
  datasetId
 -----------
  workspace

# To get a list of all your projects, use the -p flag
$ bq ls -p
            projectId                  friendlyName
 ------------------------------- ------------------------
  588503209435                    Main Project
  38995755354                     Example Project
  292244645228                    My Project
  1099142187650                   Project with billing
Create a new dataset
A dataset name can be up to 1,024 characters long, and consist of A-Z, a-z, 0-9, and the underscore, but it cannot start with a number or underscore, or have spaces.
$ bq mk mydataset
Dataset dataset '1099142187650:mydataset' successfully created.
$ bq ls
   datasetId
 -------------
  olddataset
  mydataset

Step 3: Upload the table

The bq load command creates or updates a table and loads data in a single step. The command has the following syntax:
bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uri table_schema
destination_table
The fully-qualified table name of the table to create.
--source_format
BigQuery accepts both CSV and JSON files. By default, if you do not explicitly specify the type of file, BigQuery expects a CSV file. If you are uploading a JSON file, you must provide the --source_format=NEWLINE_DELIMITED_JSON flag. Your source file and schema must also follow the proper JSON structure.
data_source
The source CSV data file used to populate the table. Note that this can be one uncompressed local file or, alternatively, one or more fully-qualified Google Cloud Storage URIs referring to uncompressed or gzippedfiles, in the format gs://bucket/file. For example, the following are all valid file types: my_file.csvgs://my_file.csv.gz, or my_file.csv.gz. You can specify multiple source URIs using a comma-delimited list like so:
gs://my_file2.csv,gs://my_file3.csv.gz
Each file must have the same schema in order for them to be uploaded into one destination table.
Note: Currently, it is not possible to load more than one local file as a data source. You can only load multiple source URIs.
table_schema
A description of the table schema to use. This can be either a file name or a comma-separated list of column_name:datatype pairs. We will use the comma-separated list for this example. Try using the following schema descriptor for your table: name:string,gender:string,count:integer where "name", "gender", and "count" are labels that are assigned to the columns in the new table.
If you would like, you can also specify your schema in a separate file and provide that file as the table schema. Your schema file must contain a single array object with entries that provide the following properties:
  • "name": Name of the column
  • "type": Type of data, e.g. string
  • "mode" (optional): Whether this field can be null
A sample schema file might look like the following:
[
  {"name": "name", "type": "string", "mode": "required"},
  {"name": "gender", "type": "string", "mode": "nullable"},
  {"name": "count", "type": "integer", "mode": "required"}
]
Run the following command to load your source file into a new table called "names2010" in the dataset we created above. By default, this runs synchronously, and will take a few seconds for the load job to complete. To learn how to run bq commands asynchronously, see Running Asynchronous Operations.
$ bq load mydataset.names2010 yob2010.txt name:string,gender:string,count:integer
Waiting on job_4f0c0878f6184119abfdae05f5194e65 ... (35s) Current status: DONE
$ bq ls mydataset
     tableId
 ---------------
  names2010

$ bq show mydataset.names2010
   tableId     Last modified         Schema
 ----------- ----------------- -------------------
  names2010   25 Oct 17:40:53   |- name: string
                                |- gender: string
                                |- count: integer
By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data that is in ISO-8859-1 (or Latin-1) encoding and are having problems with your loaded data, you can tell BigQuery to treat your data as Latin-1 explicitly, using the -E flag:
bq load -E ISO-8859-1 mydataset.names2010 gs://bigquerybucket/yob2010.txt name:string,gender:string,count:integer
For more information, see Character Encodings.

Step 5: Run queries

Now you can start querying your new table:
# Most popular girls' names
$ bq query "SELECT name,count FROM mydataset.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"
Waiting on job_58c0f5ca52764ef1902eba611b71c651 ... (0s) Current status: DONE
+----------+-------+
|   name   | COUNT |
+----------+-------+
| Isabella | 22731 |
| Sophia   | 20477 |
| Emma     | 17179 |
| Olivia   | 16860 |
| Ava      | 15300 |
+----------+-------+

# Most unusual boys' names
# Data does not include names with < 4 occurrences
$ bq query "SELECT name,count FROM mydataset.names2010 WHERE gender = 'M' ORDER BY count ASC LIMIT 5"
Waiting on job_556ba2e5aad340a7b2818c3e3280b7a3 ... (1s) Current status: DONE
+----------+-------+
|   name   | COUNT |
+----------+-------+
| Aarian   |     5 |
| Aaidan   |     5 |
| Aamarion |     5 |
| Aadhavan |     5 |
| Aaqib    |     5 |
+----------+-------+

Character Encodings

By default, the BigQuery service expects all source data to be UTF-8 encoded. Optionally, if you have CSV files with data encoded in ISO-8859-1 format, you should explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process. Currently, it is only possible to import data that is ISO-8859-1 or UTF-8 encoded. Keep in mind the following when you specify the character encoding of your data:
  • If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.
    Generally, your data will be imported successfully but may not match byte-for-byte what you expect. To avoid this, specify the correct encoding and try your import again.
  • Delimiters must be encoded as ISO-8859-1.
    Generally, it is best practice to use a standard delimiter, such as a tab, pipe, or comma.
  • If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.
JSON files must always be encoded in UTF-8.
To specify encoding when you load data using the command-line tool, use the -E flag followed by the encoding type. For example, the following command specifies that the file is ISO-8859-1 encoded:
bq load -E ISO-8859-1 mydataset.mytable mydata.csv field1:string,field2:string

Next Steps