Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Google Cloud Platform

Showing posts with label Google Cloud Platform. Show all posts

Monday, 5 November 2018

Optimizing BigQuery: Cluster your tables

BigQuery just announced the ability to cluster tables — which I’ll describe here. If you are looking for massive savings in costs and querying times, this post is for you.

Warning: This post plays with several terabytes of data. Set up your BigQuery cost controls, buckle up, keep calm, and query on.

Can you spot the huge difference between the query on the left and the one on the right?

Same query over almost the same tables — one is clustered, the other is not. Faster and more efficient.

What can we see here:

Kubernetes was a popular topic during 2017 in Wikipedia — with more than 400,000 views among the English Wikipedia sites.
Kubernetes gets way less pageviews on the mobile Wikipedia site than on the traditional desktop one (90k vs 343k).

But what’s really interesting:

The query on the left was able to process 2.2 TB in 20 seconds. Impressive, but costly.
The query on the right went over a similar table — but it got the results in only 5.4 seconds, and for one tenth of the cost (227 GB).

What’s the secret?

Clustered tables!

Now you can tell BigQuery to store your data “sorted” by certain fields — and when your queries filter over these fields, BigQuery will be smart enough to only open the matching clusters. In other words, you’ll get faster and cheaper results.

Previously I explained my lazy loading of Wikipedia views into partitioned yearly BigQuery tables. As a reminder, just for 2017 we will be querying more than 190 billion of pageviews, over 3,927 Wikipedia sites, covered by over 54 billion rows (2.2 TBs of data per year).

To move my existing data into a newly clustered table I can use some DDL:

CREATE TABLE `fh-bigquery.wikipedia_v3.pageviews_2017`
PARTITION BY DATE(datehour)
CLUSTER BY wiki, title
OPTIONS(
   description="Wikipedia pageviews - partitioned by day, clustered by (wiki, title). Contact https://twitter.com/felipehoffa"
   , require_partition_filter=true
)
AS SELECT * FROM `fh-bigquery.wikipedia_v2.pageviews_2017`
WHERE datehour > '1990-01-01' # nag

-- 4724.8s elapsed, 2.20 TB processed

Note these options:

CLUSTER BY wiki, title: Whenever people query using the wiki column, BigQuery will optimize these queries. These queries will be optimized even further if the user also filters by title. If the user only filters by title, clustering won’t work, as the order is important (think boxes inside boxes).
require_partition_filter=true: This option reminds my users to always add a date filtering clause to their queries. That’s how I remind them that their queries could be cheaper if they only query through a fraction of the year.
4724.8s elapsed, 2.20 TB processed: It took a while to re-materialize and cluster 2.2 TB of data — but the future savings in querying makes up for it. Note that you could run the same operation for free with a bq export/load combination — but I like the flexibility that DDL supports. Within theSELECT * FROM I could have filtered and transformed my data, or added more columns just for clustering purposes.

Better clustered: Let’s test some queries

SELECT * LIMIT 1

SELECT * LIMIT 1 is a known BigQuery anti-pattern. You get charged for a full table scan, while you could have received the same results for free with a table preview. Let’s try it on our v2 table:

SELECT * 
FROM `fh-bigquery.wikipedia_v2.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
LIMIT 1

1.7s elapsed, 180 GB processed

Ouch. 180 GB processed — but at least the date partitioning to cover only June worked. But the same in v3, our clustered table:

SELECT * 
FROM `fh-bigquery.wikipedia_v3.pageviews_2017`
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
LIMIT 1

1.8s elapsed, 112 MB processed

Yes!!! With a clustered table we can now do a SELECT * LIMIT 1, and it only scanned 0.06% of the data (in this case).

Base query: Barcelona, English, 180->10GB

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki LIKE 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

18.1s elapsed, 180 GB processed

Any query similar to this will incur costs of 180 GB on our v2 table. Meanwhile in v3:

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki LIKE 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

3.5s elapsed, 10.3 GB processed

On a clustered table we get the results on 1/6th of the time, for 5% of the costs.

Don’t forget the clusters order

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
-- AND wiki = 'en'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

3.7s elapsed, 114 GB processed

If I don’t make use of my of my clustering strategy (in this case: filtering by which of the multiples wikis first), then I won’t get all the possible efficiencies — but it’s still better than before.

Smaller clusters, less data

The English Wikipedia has way more data than the Albanian one. Check how the query behaves when we look only at that one:

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki = 'sq'
AND title = 'Barcelona'
GROUP BY wiki ORDER BY wiki

2.6s elapsed, 3.83 GB

Clusters can LIKE and REGEX

Let’s query for all the wikis that start with r, and for the titles that match Bar.*na:

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki LIKE 'r%'
AND REGEXP_CONTAINS(title, '^Bar.*na')
GROUP BY wiki ORDER BY wiki

4.8s elapsed, 14.3 GB processed

That’s nice behavior —clusters can work with LIKE and REGEX expressions, but only when looking for prefixes.

If I look for suffixes instead:

SELECT wiki, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
AND wiki LIKE '%m'
AND REGEXP_CONTAINS(title, '.*celona')
GROUP BY wiki ORDER BY wiki

(5.1s elapsed, 180 GB processed

That takes us back to 180 GB — this query is not using clusters at all.

JOINs and GROUP BYs

This query will scan all data within the time range, regardless of what table I use:

SELECT wiki, title, SUM(views) views
FROM `fh-bigquery.wikipedia_v2.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
GROUP BY wiki, title
ORDER BY views DESC
LIMIT 10

64.8s elapsed, 180 GB processed

That’s a pretty impressive query — we are grouping thru 185 million combinations of (wiki, title) and picking the top 10. Not a simple task for most databases, but here it looks like it was. Can clustered tables improve on this?

SELECT wiki, title, SUM(views) views
FROM `fh-bigquery.wikipedia_v3.pageviews_2017` 
WHERE DATE(datehour) BETWEEN '2017-06-01' AND '2017-06-30'
GROUP BY wiki, title
ORDER BY views DESC
LIMIT 10

22.1 elapsed, 180 GB processed

As you can see here, with v3 the query took one third of the time: As the data is clustered, way less shuffling is needed to group all combinations before ranking them.

FAQ

My data can’t be date partitioned, how do I use clustering?

Two alternatives:
1. Use ingestion time partitioned table with clustering on the fields of your interest. This is the preferred mechanism if you have > ~10GB of data per day.
2. If you have smaller amounts of data per day, use a column partitioned table with clustering, partitioned on a “fake” date optional column. Just use the value NULL for it (or leave it unspecified, and BigQuery will assume it is NULL). Specify the clustering columns of interest.

We discussed clustering on an id without date partitioning on Stack Overflow.

What we learned today

Cluster your tables — it’s free and it can optimize many of your queries.
The main cost of clustering is longer load times, but it’s worth it.
Make a sensible decision of what columns you’ll cluster by and the order of them. Think about how people will query your tables.
Clustered tables are in beta — check the docs for limitations and updates.
Engineers tip: I’ve heard that for now it’s better if you load data into your clustered tables once per day. On that tip, I’ll keep loading the hourly updates to v2, and insert the results into v3 once the day is over. From the docs: “Over time, as more and more operations modify a table, the degree to which the data is sorted begins to weaken, and the table becomes partially sorted”.

Next steps

Clustered tables in only one of many new BigQuery capabilities: Check out GIS support (geographical), and BQML capabilities (machine learning within BigQuery).

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Tuesday, 9 October 2018

Getting started with Google BigQuery by Kiran Vasadi

Fully Managed structured data store queryable with SQL

very easy to use

Fast, in that almost no slowdown with BigData

Cost-effective

Built on Google's infrastructure Components

Basic Operations are available within WebUI

You can 'dryrun' from Web UI to check the amount of data to be scanned

You can use the command line interface (bq) to integrate BigQuery into your workflow

API are provided for Python Java

What you pay for

Storage - $0.020 per GB / month

Queries - $5 per TB processed (scanned)

Streaming inserts - $0.01 per 100,000 rows until July 20.2015. After July 20,2015 $0.01 per 200MB, with individual rows calculated using a 1 KB minimum size

A Simple Example

Load 1TB data to a table every day, keep each table for a month

Query the daily data 5 times every day to aggregation

For storage:

1TB * 20 (tables) = $0.020 * 1000 * 30 = $600

For Queries:

1TB * 5 (Queries) * 30(days) = $750

How your data is stored

Your data is stored

1. in thousands of disk (depending on the size)

2. in columnar format

3. Compressed

(However, the cost is based on uncompressed size)

Record: A Collection of one or more other fields

References for this Blog

https://cloud.google.com/bigquery/
https://cloud.google.com/bigquery/what-is-bigquery
https://cloud.google.com/bigquery/docs/reference/rest/v2/
https://en.wikipedia.org/wiki/BigQuery

https://cloud.google.com/files/Redbus.pdf

https://www.reddit.com/r/bigquery/comments/28ialf/173_million_2013_nyc_taxi_rides_shared_on_bigquery/

Regards,

Kiran Kumar Vasadi

Architect Big Data Cloud

Mobile: +918977081119

Skype ID : kiranvasadi

Google Cloud Certified Professional Data Engineer

I am thankful to all those who said No to me It's because of them i did it myself ! ..Einstein

Saturday, 22 September 2018

Google Cloud Architect Exam Study Materials

Google Cloud Architect Exam Study Materials

After recently completing the Google Cloud Architect certification, I wanted to share the preparation materials that I used. Due to the newness of the exam, one challenge is there is not the same abundance of preparation material as there is for other Cloud exams. The official Exam Guide leaves a bit to be desired and there is no official Practice Exam at the current time, so I hope this material is helpful for folks preparing.

I prepared through a combination of methods outlined below (in addition to real-world GCP usage):

Official GCP Documentation
Google Cloud Next ’17 Conference Videos
Coursera
Linux Academy

This is probably a little overkill, but none of the aforementioned alone went into the depth in all of the areas that I had hoped. The answer was using the combination, and skipping areas in each that may have been redundant.

Official Documentation

This is the just the official Google Cloud Platform documentation: https://cloud.google.com/docs/.

Side-note: One thing I liked about the exam is that I didn’t feel like any of the questions asked me for point-in-time questions (i.e. what did this feature do when the exam was released versus what it may do now as of July of 2017). As a result, you can prepare well by reviewing the most up-to-date documentation without a fear that your knowledge will be too accurate. That’s kind of a silly thing to say, but other exams from other vendors do have questions where you have to answer them based on a previous point in time, even if the version of the exam and product still match.

Most of these documents are overviews or FAQs. You may want to branch off of them into deeper areas, but I felt the Product Overview and FAQs were solid. I reviewed the product documentation last of all of the study material I used, and mostly used it to fill in gaps of knowledge, though; if you review this material first, you may want to go deeper.

Overview

Compute Engine

App Engine

Container Engine

https://cloud.google.com/container-engine/

Storage Decisions

https://cloud.google.com/docs/storing-your-data

Cloud Storage

Spanner

https://cloud.google.com/spanner/docs/overview

Cloud SQL

Bigtable

https://cloud.google.com/bigtable/docs/overview

Datastore

Transfer Service

https://cloud.google.com/storage/transfer/

Networking

Stackdriver Logging

Stackdriver Monitoring

https://cloud.google.com/monitoring/

Stackdriver Error Reporting

Stackdriver Trace

Stackdriver Debugger

Endpoints

Security Scanner

Identity and Security

Identity-Aware Proxy

KMS

Developer Tools

https://cloud.google.com/tools/docs/

Resource Management

Deployment Manager

DataProc

DataFlow

Pub/Sub

Google Cloud Next Sessions

The sessions from Google Cloud Next are on YouTube now (217 of them), and if there are any areas where you feel you’d like a little more depth (e.g. App Engine, Cloud Storage, Datastore, Stackdriver), these sessions can be helpful. They’re also just all really good sessions in general. Even if you’re not preparing for the exam, I’d recommend watching as many of the videos as you can. I like to watch them at 2.0x speed for maximum productivity.

Google Cloud Next 2017 Playlist: https://www.youtube.com/playlist?list=PLIivdWyY5sqI8RuUibiH8sMb1ExIw0lAR

Coursera

The Coursera courses are made available by Google to partners but are available for anyone. They have a couple of options for courses, but if you are a current AWS Certified Architect Professional then there is a course available based on that, which is what I took. It compares GCP products with AWS products and is slimmed down from the other option.

I thought it was a valuable course, but lecture videos were fairly short. Total lecture time was 121 minutes, but there are quizzes and labs. The non-AWS specific course appears to be a little longer.

Coursera — Google Cloud Platform Fundamentals for AWS Professionals: https://www.coursera.org/learn/gcp-fundamentals-aws/

Linux Academy

I subscribe to Linux Academy because they have a ton of great courses in general even outside of GCP, but they also have several Google Cloud Platform courses available. Honestly, I only skimmed through the course because I had done all of the other preparation prior, but the material seems solid.

Linux Academy — Google Cloud Platform Essentials: https://linuxacademy.com/cp/modules/view/id/133

Final Thoughts

Overall, I thought the exam was done well. It’s not too long, and the questions are good Architect-level questions: it’s about being able to architect solutions, not necessarily memorizing every low-level command. With that said, a few final thoughts as you prepare for the exam:

As the official exam guide notes, there are Case Studies as part of the exam. I recommend preparing yourself by reading those and figuring out how to define the requirements in each and how those requirements match up to GCP services.
Know this decision tree well, and understand when to use what:

Don’t forget about services like Dataflow, Dataproc, Pub/Sub, and other Big Data concepts.
Like with similar exams, there will be several good options and the best option. Make sure to really dive into each question to understand specific requirements that will help you determine which is best. You really need to know when to use X and when to use Y, even if Y would kind of work.
Get into Google Cloud Platform and build something! There’s nothing better than real-world experience, and Google makes it really easy to get in and use the platform for free.