Loading Data into BigQuery





Loading data into BigQuery is pretty easy to do.
I like poking around the web looking for data. A couple of days ago. The data set contains all registration of trademarks from the 1950s until 2014.

BigQuery is a great, basically free place to analyze data. With their current model, you can analyze about 5 terabytes a day before you start incurring any charges. Most data that I look at is small relative to that, so playing with BigQuery is essentially free.
In this article, I’ll take you through the steps I took to load the trademark data into BigQuery. You can find the loaded tables here (if you have an account on cloud.google.com):

Download the CSV files

First step was to download an unzip the data. It took a few hours to download, but after unzipping, I had these files:
-rw-rw-r--@ 1 ltabb staff 1.8G Mar 24 2015 case_file.csv
-rw-rw-r--@ 1 ltabb staff 514M Mar 19 2015 classification.csv
-rw-rw-r--@ 1 ltabb staff 781M Mar 19 2015 correspondent_domrep_attorney.csv
-rw-rw-r--@ 1 ltabb staff 71M Mar 19 2015 design_search.csv
-rw-rw-r--@ 1 ltabb staff 2.9G Apr 2 2015 event.csv
-rw-rw-r--@ 1 ltabb staff 38K Mar 19 2015 event_description.csv
-rw-rw-r--@ 1 ltabb staff 20M Mar 19 2015 foreign_app.csv
-rw-rw-r--@ 1 ltabb staff 180M Mar 19 2015 intl_class.csv
-rw-rw-r--@ 1 ltabb staff 14M Mar 19 2015 madrid_event.csv
-rw-rw-r--@ 1 ltabb staff 3.6K Mar 19 2015 madrid_event_description.csv
-rw-rw-r--@ 1 ltabb staff 4.3M Mar 19 2015 madrid_intl_file.csv
-rw-rw-r--@ 1 ltabb staff 1.7G Apr 2 2015 owner.csv
-rw-rw-r--@ 1 ltabb staff 20M Mar 19 2015 owner_name_change.csv
-rw-rw-r--@ 1 ltabb staff 40M Mar 19 2015 prior_mark.csv
-rw-rw-r--@ 1 ltabb staff 2.1G Mar 19 2015 statement.csv
-rw-rw-r--@ 1 ltabb staff 1.5K Mar 19 2015 tm_app_daily.csv
-rw-rw-r--@ 1 ltabb staff 592M Apr 2 2015 us_class.csv

Set Up For Using Google Cloud

First you need to set up a project, install an SDK, and copy the files into a bucket.

Make a Project (if you don’t have one)

Machines, databases and services are contained by projects in Google Cloud. Create a new project or use an existing project to load your data. You can create a project at:
https://cloud.google.com/

Install the Google Cloud SDK on Your Local Machine

Google provides an SDK that runs on your local machine to move resources to and from the cloud and to control services.
I used the installation process described on the page below to install the SDK.

Create a Bucket in Google Storage and Use gsutil to Copy Files

Google Storage is basically an infinite disk drive in the cloud. When installing the SDK, we installed a program gsutil, that will enable us to copy files to and from Google Storage and manipulate them while they are there.
In Google Cloud, directories are called ‘buckets’. We’re simply going to make a bucket and copy stuff into that bucket.
Make a bucket to store the trademark CSVs.
gsutil mb gs://lloyd_trademark
Copy all the csv files into the bucket.
gsutil cp *.csv gs://lloyd_trademark/

Creating a Dataset and Table Schemas

In BigQuery, a dataset is a set of tables. Next, we need to create a dataset then create the tables.
BigQuery has a great web-based console, which you can use to create tables from files in Google Storage. Unfortunately, it is an entirely manual process, loading the files one-by-one into BigQuery. If you are only loading a small amount of data, use the web interface. It is pretty easy.


We’ve got a bunch of CSV files and we might get updates in the future so I’m going to load data using the Google SDK’s bq tool. bq lets us manipulate BigQuery from the command line.

Make a Dataset

We’re going use bq to make a dataset called ‘trademark’ to hold all the data.
bq mk trademark

Generate Schemas for the Tables from the Comma-Separated Value (CSV) Files

A ‘schema’ for a table is a description of its fields names and types. BigQuery has a couple of forms in which to specify schemas—we are going to use JSON.
In most of the CSV files, the first line of the CSV contains the field names for each of the columns of data.
For example, the file dataset owner.csv looks like:

own_addr_1,own_addr_2,own_addr_city,own_composed_of,own_addr_country_cd,own_altn_name,own_entity_desc,own_seq,own_entity_cd,own_nalty_country_cd,own_nalty_other_cd,own_nalty_state_cd,own_addr_other_cd,own_name,own_type_cd,own_addr_postal,own_addr_state_cd,own_id,serial_no
,501 FIFTH AVENUE,NEW YORK,,,,,1,3,,,NY,,AUTO ORDNANCE CORPORATION,10,90015,NY,1,60146682
,302 Broadway,New York,,,,,1,3,,,NY,,AUTO-ORDNANCE CORPORATION,10,90015,NY,2,60149828
2090 THORNTON ROAD,,FERNDALE,,,,,1,3,,,WA,,"SAMSON ROPE TECHNOLOGIES, INC.",43,98248,WA,3,70011210
...

I wrote a simple script that reads the first line of the CSV file, and creates a JSON schema that we can pass to BigQuery. The script makes sure that the field names follow BigQuery’s field naming rules.

bq_schema.rb

#!/usr/bin/env ruby
require 'csv'
fn = ARGV[0]
line = File.open(fn,&:readline)
open("#{fn}.json",'w') do |f|
CSV.parse(line) do |row|
names = []
row.each do |column|
name = column.gsub(/[^A-Za-z0-9_]/,'_').downcase
names.push name
end
f.puts "["
f.puts names.collect{|name|" {\"name\":\"#{name}\", \"type\":\"STRING\" }"}.join(",\n")
f.puts "]"
end
end
Run the script to create .json files for each of the each of the CSV files.
ls -w1 *.csv | xargs -n1 bq_schema.rb

Loading Data

At this point, we have everything we need to load data. We’ve got the CSV files pushed to Google Storage. We also have schema definitions for each of the CSV files and a dataset to receive all the tables.
To load a single file, we have to tell BigQuery where the data is. Also, we’ll want to skip the first line of the file. (That line is the header that we used to generate the definitions.)
bq load --source_format=CSV --skip_leading_rows=1 trademark.tm_cf_no gs://lloyd_trademark/tm_cf_no.csv tm_cf_no.json
Using that bq command, we can use the following command to load all the tables with a single command.
ls -w1 *.csv | sed -e s/.csv$// | xargs -IXX bq load --source_format=CSV --skip_leading_rows=1 trademark.XX gs://lloyd_trademark/XX.csv XX.csv.json

And the Data is Loaded

Some final thoughts. We took some shortcuts here. We loaded all the data as ‘strings’ instead of trying to figure out what type the fields actually were. That isn’t as bad as it might sound on the surface. If you get a type failure when loading data, the loading stops. Rather than having to solve these problems on the load, we’ll fix all the data problems once we have the data loaded.
BigQuery is a scanning distributed database so conversion between data types only affects compute resources but probably not all that much. We’re going to need to transform this data to make it useful anyway, so first step is to just get it loaded. And it is.

No comments:

Post a Comment