Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Loading Data into BigQuery

Loading data into BigQuery is pretty easy to do.

I like poking around the web looking for data. A couple of days ago. The data set contains all registration of trademarks from the 1950s until 2014.

BigQuery is a great, basically free place to analyze data. With their current model, you can analyze about 5 terabytes a day before you start incurring any charges. Most data that I look at is small relative to that, so playing with BigQuery is essentially free.

In this article, I’ll take you through the steps I took to load the trademark data into BigQuery. You can find the loaded tables here (if you have an account on cloud.google.com):

https://bigquery.cloud.google.com/table/lookerdata:trademark.case_file 71

Download the CSV files

First step was to download an unzip the data. It took a few hours to download, but after unzipping, I had these files:


-rw-rw-r--@ 1 ltabb  staff   1.8G Mar 24  2015 case_file.csv

-rw-rw-r--@ 1 ltabb  staff   514M Mar 19  2015 classification.csv

-rw-rw-r--@ 1 ltabb  staff   781M Mar 19  2015 correspondent_domrep_attorney.csv

-rw-rw-r--@ 1 ltabb  staff    71M Mar 19  2015 design_search.csv

-rw-rw-r--@ 1 ltabb  staff   2.9G Apr  2  2015 event.csv

-rw-rw-r--@ 1 ltabb  staff    38K Mar 19  2015 event_description.csv

-rw-rw-r--@ 1 ltabb  staff    20M Mar 19  2015 foreign_app.csv

-rw-rw-r--@ 1 ltabb  staff   180M Mar 19  2015 intl_class.csv

-rw-rw-r--@ 1 ltabb  staff    14M Mar 19  2015 madrid_event.csv

-rw-rw-r--@ 1 ltabb  staff   3.6K Mar 19  2015 madrid_event_description.csv

-rw-rw-r--@ 1 ltabb  staff   4.3M Mar 19  2015 madrid_intl_file.csv

-rw-rw-r--@ 1 ltabb  staff   1.7G Apr  2  2015 owner.csv

-rw-rw-r--@ 1 ltabb  staff    20M Mar 19  2015 owner_name_change.csv

-rw-rw-r--@ 1 ltabb  staff    40M Mar 19  2015 prior_mark.csv

-rw-rw-r--@ 1 ltabb  staff   2.1G Mar 19  2015 statement.csv

-rw-rw-r--@ 1 ltabb  staff   1.5K Mar 19  2015 tm_app_daily.csv

-rw-rw-r--@ 1 ltabb  staff   592M Apr  2  2015 us_class.csv

Set Up For Using Google Cloud

First you need to set up a project, install an SDK, and copy the files into a bucket.

Make a Project (if you don’t have one)

Machines, databases and services are contained by projects in Google Cloud. Create a new project or use an existing project to load your data. You can create a project at:

https://cloud.google.com/

Install the Google Cloud SDK on Your Local Machine

Google provides an SDK that runs on your local machine to move resources to and from the cloud and to control services.

I used the installation process described on the page below to install the SDK.

https://cloud.google.com/sdk/ 61

Create a Bucket in Google Storage and Use gsutil to Copy Files

Google Storage is basically an infinite disk drive in the cloud. When installing the SDK, we installed a program gsutil, that will enable us to copy files to and from Google Storage and manipulate them while they are there.

In Google Cloud, directories are called ‘buckets’. We’re simply going to make a bucket and copy stuff into that bucket.

Make a bucket to store the trademark CSVs.

gsutil mb gs://lloyd_trademark

Copy all the csv files into the bucket.

gsutil cp *.csv gs://lloyd_trademark/

Creating a Dataset and Table Schemas

In BigQuery, a dataset is a set of tables. Next, we need to create a dataset then create the tables.

BigQuery has a great web-based console, which you can use to create tables from files in Google Storage. Unfortunately, it is an entirely manual process, loading the files one-by-one into BigQuery. If you are only loading a small amount of data, use the web interface. It is pretty easy.

bigquery.cloud.google.com/table/githubarchive:github.timeline

We’ve got a bunch of CSV files and we might get updates in the future so I’m going to load data using the Google SDK’s bq tool. bq lets us manipulate BigQuery from the command line.

Make a Dataset

We’re going use bq to make a dataset called ‘trademark’ to hold all the data.

bq mk trademark

Generate Schemas for the Tables from the Comma-Separated Value (CSV) Files

A ‘schema’ for a table is a description of its fields names and types. BigQuery has a couple of forms in which to specify schemas—we are going to use JSON.

In most of the CSV files, the first line of the CSV contains the field names for each of the columns of data.

For example, the file dataset owner.csv looks like:


own_addr_1,own_addr_2,own_addr_city,own_composed_of,own_addr_country_cd,own_altn_name,own_entity_desc,own_seq,own_entity_cd,own_nalty_country_cd,own_nalty_other_cd,own_nalty_state_cd,own_addr_other_cd,own_name,own_type_cd,own_addr_postal,own_addr_state_cd,own_id,serial_no

,501 FIFTH AVENUE,NEW YORK,,,,,1,3,,,NY,,AUTO ORDNANCE CORPORATION,10,90015,NY,1,60146682

,302 Broadway,New York,,,,,1,3,,,NY,,AUTO-ORDNANCE CORPORATION,10,90015,NY,2,60149828

2090 THORNTON ROAD,,FERNDALE,,,,,1,3,,,WA,,"SAMSON ROPE TECHNOLOGIES, INC.",43,98248,WA,3,70011210

...

I wrote a simple script that reads the first line of the CSV file, and creates a JSON schema that we can pass to BigQuery. The script makes sure that the field names follow BigQuery’s field naming rules.

bq_schema.rb


#!/usr/bin/env ruby

require 'csv'


fn = ARGV[0]

line = File.open(fn,&:readline)

open("#{fn}.json",'w') do |f|

    CSV.parse(line) do |row|

      names = []

      row.each do |column|

        name = column.gsub(/[^A-Za-z0-9_]/,'_').downcase

        names.push name

      end

      f.puts "["

      f.puts names.collect{|name|"  {\"name\":\"#{name}\", \"type\":\"STRING\" }"}.join(",\n")

      f.puts "]"

    end

end

Run the script to create .json files for each of the each of the CSV files.

ls -w1 *.csv | xargs -n1 bq_schema.rb

Loading Data

At this point, we have everything we need to load data. We’ve got the CSV files pushed to Google Storage. We also have schema definitions for each of the CSV files and a dataset to receive all the tables.

To load a single file, we have to tell BigQuery where the data is. Also, we’ll want to skip the first line of the file. (That line is the header that we used to generate the definitions.)

bq load --source_format=CSV --skip_leading_rows=1 trademark.tm_cf_no gs://lloyd_trademark/tm_cf_no.csv tm_cf_no.json

Using that bq command, we can use the following command to load all the tables with a single command.

ls -w1 *.csv | sed -e s/.csv$// | xargs -IXX bq load --source_format=CSV --skip_leading_rows=1 trademark.XX gs://lloyd_trademark/XX.csv XX.csv.json

And the Data is Loaded

Some final thoughts. We took some shortcuts here. We loaded all the data as ‘strings’ instead of trying to figure out what type the fields actually were. That isn’t as bad as it might sound on the surface. If you get a type failure when loading data, the loading stops. Rather than having to solve these problems on the load, we’ll fix all the data problems once we have the data loaded.

BigQuery is a scanning distributed database so conversion between data types only affects compute resources but probably not all that much. We’re going to need to transform this data to make it useful anyway, so first step is to just get it loaded. And it is.

Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect

Loading Data into BigQuery