Prerequisites
Before you can start this tutorial, you must complete the following prerequisites:
- Activate the BigQuery service with a Google APIs Console project
If you are a member of an existing
Google APIs Console project that already has BigQuery enabled, you can use that project, or you can create a new project. If you create a new project, you must also turn on billing for that project before you can use BigQuery with full capabilities, as described in the
sign up process.
- Install the Google Cloud SDK
Running bq in Interactive Mode
You can run bq in an interactive mode, where you don't need to prefix the commands with "bq". To start interactive mode, call bq shell
. The prompt is the ID of the default project. To exit interactive mode, type "exit".
Hide Copy Code
$ bq shell
Welcome to BigQuery! (Type help for more information.)
10000000021> ls
datasetId
-------------
mydataset
10000000021> exit
Goodbye.
Examine a Table
If you're signed up to use BigQuery, you can run queries against or examine schemas of any of our
sample tables. We'll try running some queries against the
Shakespeare table, which contains an entry for every word in every play.
To examine the schema of a specific table, run bq show projectId:datasetId.tableId
, where the project and dataset IDs can optionally be omitted if they are the default values for your bq tool. The following example examines the shakespeare
table in the samples
dataset:
Hide Copy Code
$ bq show publicdata:samples.shakespeare
tableId Last modified Schema
------------- ----------------- ------------------------------------
shakespeare 01 Sep 13:46:28 |- word: string (required)
|- word_count: integer (required)
|- corpus: string (required)
|- corpus_date: integer (required)
Run a Query
Let's see how many times the substring "raisin" appears in Shakespeare's works.
To run a query, run the command bq query "query_string"
, where the query string must be quoted, and follow the BigQuery SQL syntax. Note that any quotation marks inside the query string must be escaped with a \ mark, or else use a different quotation mark type than the surrounding marks (" versus ').
The following example does a word count for the number of times that the substring "raisin" appears in all of Shakespeare's works. The query shown is case-sensitive; string comparisons are case-sensitive, unless you use the SQL feature IGNORE CASE
.
Hide Copy Code
$ bq query "SELECT word, COUNT(word) as count FROM publicdata:samples.shakespeare WHERE word CONTAINS 'raisin' GROUP BY word"
Waiting on job_dcda37c0bbed4c669b04dfd567859b90 ... (0s) Current status: DONE
+---------------+-------+
| word | count |
+---------------+-------+
| Praising | 4 |
| raising | 5 |
| raisins | 1 |
| praising | 7 |
| dispraising | 2 |
| dispraisingly | 1 |
+---------------+-------+
In the next set of query results, we try to select "huzzah" from the "word" column, but since there are no matches, no results are returned.
Hide Copy Code
$ bq query "SELECT word FROM publicdata:samples.shakespeare WHERE word = 'huzzah' IGNORE CASE"
Waiting on job_e19 ... (4s) Current status: DONE$
Create a New Table
Now try creating your own table. Every table must live inside a dataset, which is simply a group of tables. A dataset is assigned to a single project. For full details on managing and populating tables, see
Tables.
Step 1: Create a data file
We'll use a file of the top 1,000 most popular baby names compiled each year by the US government. You can get a zipped set of files by year from the
Social Security Online website. Download the national data file. This zipped collection includes a comma-separated file (CSV) of names for each year for the past 120 years. These files have the proper formatting for BigQuery (see BigQuery's
CSV requirements). For our example, we'll use the file with names for 2010.
Open the file in a text editor to see what it looks like; the file is a comma-separated value (CSV) file with the following three columns: name, sex ("M" or "F"), and number of children with that name. The file has no header row.
(Optional) Upload the data file to Google Cloud Storage
If you would like to load data from Google Cloud Storage rather than directly from your computer, you can upload your data to Google Cloud Storage as follows:
- Enable Google Cloud Storage for your project. You will upload your data file to Google Cloud Storage, and load it from there. To enable Cloud Storage:
- Open the Google APIs Console to your BigQuery project
- Enable Cloud Storage on the Services pane.
- Enable billing on the Billing pane. The example table in this exercise is small enough that if you disable billing in the project after running the exercise, you shouldn't be billed anything.
- Open the Cloud Storage Manager. From the Google APIs Console for your project, open the Google Cloud Storage pane, and click the "Google Cloud Storage Manager" link.
- Upload your file to Google Cloud Storage:
- In the Cloud Storage Manager, create a universally unique bucket name, for example: bigquerybucket
- Select the new bucket, and click Upload to upload your file.
Follow the rest of the steps to continue loading your data.
Step 2: Create a new dataset
You can see whether the project has any existing datasets that you can use, or you can create a new one. Try out the following bq commands to list datasets, and optionally create a new dataset:
List Datasets
Hide Copy Code
# Without modifiers, ls lists datasets in the default project:
$ bq ls
datasetId
-------------
olddataset
# Provide the project ID to list datasets in a specific project
$ bq ls 292244645228:
datasetId
-----------
workspace
# To get a list of all your projects, use the -p flag
$ bq ls -p
projectId friendlyName
------------------------------- ------------------------
588503209435 Main Project
38995755354 Example Project
292244645228 My Project
1099142187650 Project with billing
Create a new dataset
A dataset name can be up to 1,024 characters long, and consist of A-Z, a-z, 0-9, and the underscore, but it cannot start with a number or underscore, or have spaces.
Hide Copy Code
$ bq mk mydataset
Dataset dataset '1099142187650:mydataset' successfully created.
$ bq ls
datasetId
-------------
olddataset
mydataset
Step 3: Upload the table
The bq load
command creates or updates a table and loads data in a single step. The command has the following syntax:
Hide Copy Code
bq load [--source_format=NEWLINE_DELIMITED_JSON|CSV] destination_table data_source_uri table_schema
destination_table
- The fully-qualified table name of the table to create.
--source_format
- BigQuery accepts both CSV and JSON files. By default, if you do not explicitly specify the type of file, BigQuery expects a CSV file. If you are uploading a JSON file, you must provide the
--source_format=NEWLINE_DELIMITED_JSON
flag. Your source file and schema must also follow the proper JSON structure.
data_source
- The source CSV data file used to populate the table. Note that this can be one uncompressed local file or, alternatively, one or more fully-qualified Google Cloud Storage URIs referring to uncompressed or gzippedfiles, in the format
gs://bucket/file
. For example, the following are all valid file types: my_file.csv
, gs://my_file.csv.gz
, or my_file.csv.gz
. You can specify multiple source URIs using a comma-delimited list like so:
Hide Copy Code
gs:
Each file must have the same schema in order for them to be uploaded into one destination table.
Note: Currently, it is not possible to load more than one local file as a data source. You can only load multiple source URIs.
table_schema
- A description of the table schema to use. This can be either a file name or a comma-separated list of column_name:datatype pairs. We will use the comma-separated list for this example. Try using the following schema descriptor for your table:
name:string,gender:string,count:integer
where "name", "gender", and "count" are labels that are assigned to the columns in the new table.
If you would like, you can also specify your schema in a separate file and provide that file as the table schema. Your schema file must contain a single array object with entries that provide the following properties:
"name"
: Name of the column
"type"
: Type of data, e.g. string
"mode"
(optional): Whether this field can be null
A sample schema file might look like the following:
Hide Copy Code
[
{"name": "name", "type": "string", "mode": "required"},
{"name": "gender", "type": "string", "mode": "nullable"},
{"name": "count", "type": "integer", "mode": "required"}
]
Run the following command to load your source file into a new table called "names2010" in the dataset we created above. By default, this runs synchronously, and will take a few seconds for the load job to complete. To learn how to run bq commands asynchronously, see
Running Asynchronous Operations.
Hide Copy Code
$ bq load mydataset.names2010 yob2010.txt name:string,gender:string,count:integer
Waiting on job_4f0c0878f6184119abfdae05f5194e65 ... (35s) Current status: DONE
$ bq ls mydataset
tableId
---------------
names2010
$ bq show mydataset.names2010
tableId Last modified Schema
----------- ----------------- -------------------
names2010 25 Oct 17:40:53 |- name: string
|- gender: string
|- count: integer
By default, when you load data, BigQuery expects UTF-8 encoded data. If you have data that is in ISO-8859-1 (or Latin-1) encoding and are having problems with your loaded data, you can tell BigQuery to treat your data as Latin-1 explicitly, using the -E
flag:
bq load -E ISO-8859-1 mydataset.names2010 gs://bigquerybucket/yob2010.txt name:string,gender:string,count:integer
Step 5: Run queries
Now you can start querying your new table:
Hide Copy Code
# Most popular girls' names
$ bq query "SELECT name,count FROM mydataset.names2010 WHERE gender = 'F' ORDER BY count DESC LIMIT 5"
Waiting on job_58c0f5ca52764ef1902eba611b71c651 ... (0s) Current status: DONE
+----------+-------+
| name | COUNT |
+----------+-------+
| Isabella | 22731 |
| Sophia | 20477 |
| Emma | 17179 |
| Olivia | 16860 |
| Ava | 15300 |
+----------+-------+
# Most unusual boys' names
# Data does not include names with < 4 occurrences
$ bq query "SELECT name,count FROM mydataset.names2010 WHERE gender = 'M' ORDER BY count ASC LIMIT 5"
Waiting on job_556ba2e5aad340a7b2818c3e3280b7a3 ... (1s) Current status: DONE
+----------+-------+
| name | COUNT |
+----------+-------+
| Aarian | 5 |
| Aaidan | 5 |
| Aamarion | 5 |
| Aadhavan | 5 |
| Aaqib | 5 |
+----------+-------+
Character Encodings
By default, the BigQuery service expects all source data to be UTF-8 encoded. Optionally, if you have CSV files with data encoded in ISO-8859-1 format, you should explicitly specify the encoding when you import your data so that BigQuery can properly convert your data to UTF-8 during the import process. Currently, it is only possible to import data that is ISO-8859-1 or UTF-8 encoded. Keep in mind the following when you specify the character encoding of your data:
- If you don't specify an encoding, or explicitly specify that your data is UTF-8 but then provide a CSV file that is not UTF-8 encoded, BigQuery attempts to convert your CSV file to UTF-8.
Generally, your data will be imported successfully but may not match byte-for-byte what you expect. To avoid this, specify the correct encoding and try your import again.
- Delimiters must be encoded as ISO-8859-1.
Generally, it is best practice to use a standard delimiter, such as a tab, pipe, or comma.
- If BigQuery cannot convert a character, it is converted to the standard Unicode replacement character: �.
JSON files must always be encoded in UTF-8.
To specify encoding when you load data using the command-line tool, use the -E
flag followed by the encoding type. For example, the following command specifies that the file is ISO-8859-1 encoded:
Hide Copy Code
bq load -E ISO-8859-1 mydataset.mytable mydata.csv field1:string,field2:string
Next Steps