Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: HDInsight Hadoop Hive

Let’s explore Hadoop Hive, shall we?

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

It was originally build by Facebook as an abstraction on top of Hadoop Map Reduce and now is an open source (Apache) project.

As with other Hadoop main components, you have a few options to work with it:

You can download it and install the cluster on a server farm on premise ; good luck with that, I’m sure it’s going to be a nice learning experience.
You can use HDInsight on-premise which is a port on Windows done in collaboration between Hortonworks & Microsoft.
You can use Azure HDInsight

Yes, if you read anything in this blog, you know I’m gona go with the Azure option!

I’ll give two major things to Azure HDInsight:

It performs all the cluster installation / maintenance for you
It (mostly) shields you from all the Unixy / 3000 open projects mix that is Hadoop

And for that I am grateful! When I want to use Hive I don’t want to fight with Bash shells, I want to run some SQL over Big Data.

Hadoop

Just before we start… what is Hive in a nutshell?

Hive allows you to query massive data sets (Big Data) without hand coding Java Map Reduce into Hadoop.

So it’s an abstraction layer and a productivity enhancer.

Setting it up

Alright, Azure Portal (preview) everyone. Yeah! Give the preview portal some love!

SQL Database

What? Yeah… there’s the thing…

Here I’ll show you how to manage an Hadoop / Hive cluster on the cheap. One of the nice feature of Azure HDInsight is to externalize the Hive metastore (where the schemas of your tables will be stored) from the cluster.

This will allow you to tear down the cluster without losing your Hive Metadata.

By default the metastore is persisted on a Derby database (yes, chances are you never heard about that, remember what I told you about Azure HDInsight shielding you from all this Open Source Techs?) which is hosted on the cluster.

The alternative is to host it on an Azure SQL Database, outside the cluster.

So let’s create a DB.

Top left corner, (+ NEW), Data + Storage, SQL Database: this opens up the New SQL Database blade.

Go for it. Just be careful of a few things, namely:

Make sure it’s in the right subscription
Make sure it’s in the region (e.g. East US) you want your cluster to be
Make sure you tick “Allow azure services to access server” to let Hadoop access it

HDInsight

Now the HDInsight cluster!

Top left corner, (+ NEW), Data + Analytics, HDInsight: this opens up the New HDInsight Cluster blade.

Cluster Name: choose one, needs to be unique within the universe of HDInsight, hence I won’t give you mine, you would clash with it.

Cluster Type: Hadoop.

Cluster Operating System: leave it to Windows Server.

Make sure it’s in the right subscription and the right resource group.

Click Credentials, that’s going to pull the credentials blade.

Those are the credentials to authenticate to the cluster. We’re just toying around, so don’t go overboard. You do not need to enable the remote desktop for this.

Data Source

Ok, Data Source. This will open the… Data Source blade!

One of the differentiator of Azure HDInsight compared to other Hadoop implementation is that the distributed file system (HDFS) is implemented on top of Azure Blob Storage. This is commonly known as WASB: Windows Azure Storage Blob.

To make a story short:

Hadoop comes with HDFS, Hadoop File System, a distributed File System
Files on HDFS are spread and replicated on many data nodes of your clusters
This means files in HDFS belongs to your cluster and can’t exist without it
WASB is an implementation of HDFS that simply forwards to Blob Storage which is already a distributed / replicated file system
WASB allows your files to exist without your cluster

The last point is key. It means you can tear your cluster down and you keep your file. You can later on recreate a cluster and continue where you were.

So… I suggest you create a storage account just for your Hive discovery.

Again, give it a unique name. You can choose your default container which will itself default to the name of your cluster.

Node Pricing Tier

Again, we’re in discovery. Let’s go easy on it: one worker node and you can drop the type of VMs for the worker & header nodes to A3 (the cheapest).

Optional Configuration

In optional configuration, let’s select external metastores and select the SQL Database we created earlier.

Only do that for the Hive metastore (we won’t use Oozie today). You can give your server admin password for today but DO NOT DO THAT in production. In production, of course, create a separate user that has rights only on the specific db.

Create

Finally, hit the create button! This will churn away for a little while and you’ll have your cluster ready soon!

Key take aways

As key take aways, there are two pieces of your Azure HDInsight configuration that can survive the deletion of your cluster:

HDFS content (via a storage account)
Hive Metastore (via an Azure SQL Database)

This makes Azure HDInsight a component you can prop up to do heavy computation and tear down until next time.

You could even go a step further and backup your database into a .bacpac and tear down the SQL Database until next use. This way you would end up paying only for storage which is dirt cheap compare to both SQL and Hadoop.

Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect

HDInsight Hadoop Hive – Setup