Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Beginner’s Guide for Hive

Why was Hive introduced?

A few years ago, Hadoop came into existence for solving queries on huge structured, Semi-structured, and unstructured datasets. Hadoop is considered to be the best solution to store and process those huge datasets, because of its advantages like scalability, less infrastructure costs(commodity hardware), data security(replication factor) and MapReduce, which is the best programming structure.

In the beginning, big organizations like Facebook, who received huge datasets of raw data on a daily basis, were writing Python scripts to ingest data into the Oracle databases. As the amount of data kept increasing, it was not practical to continue writing lengthy and complex programs. So to handle huge datasets and to process them, Facebook came out with Hive, a language similar to SQL, where there was no need for programmers to write complex programs anymore.

What is Hive?

Hive is a data warehouse solution built on top of Hadoop for managing huge amount of data stored in HDFS. It provides a SQL-like query language called HiveQL. Hive is designed in such a way that it allows easy data summarization, ad-hoc querying, and analysis of Big Data to Process structured data in Hadoop cluster.

Hive uses HDFS for storage but stores all the meta information about the database and table in Derby or MySQL database.

HIVE Vs SQL

HIVE	SQL
Hive is a SQL-like scripting language built on MapReduce	According to ANSI, SQL is the standard language for RDMBS, used to communicate with databases
Used for analytics	Used for transactional processing(OLTP) & analytics
Data per query in PBs	Data per query in GBs
Faster execution while performing analytics on Huge data sets compared to SQL	Slower execution while performing analytics on huge data sets compared to HIVE
No Normalization required	Supports Normalization

How to install Hive?

To install Hive, please follow the procedure in the below link:

https://docs.google.com/document/d/15BuqSqoV4AbOEAEX5gPwq41sDZbR2mK17YqD1ChLVzc/edit?usp=sharing

Hive Shell

Hive provides a default interface, where it allows users to run Hive commands. The CLI (Command Line Interface) is the default Hive shell service which allows users to work on Hive programs.

The following are the commands used to start the Hive shell:

Creating a Database

The purpose of creating a database before creating a table is that the database in the Hive is a namespace, where a table or collection of tables should be placed to work on Hive queries. Here the database and the schema represents the same category.

The command to create a database is as follows:

hive>CREATE DATABASEEMPdb;

Listing Databases

The show databases command displays the list of databases in HDFS so that the user can select the database he wants to work with.

hive> SHOW DATABASES;

Using a Database

The command to use database is USE. The use database command allows the user to utilize the mentioned database to work with, where he can create new tables in that database and alter particular table contents too.

hive> USE EMPdb;

Creating a Table

The command for creating a table is ‘Create’.

Before creating a table, a user must know the following details:

The create table command allows the user to create a new table with user input attributes/columns.
Row format delimited Fields terminated by ‘\t’ – This line informs Hive that each column in the file is separated by a tab.

hive> Create table emp(id int, name string, sal float)

> row format delimited

> fields terminated by ‘\t’ ;

List Tables

The ‘show tables’ command displays the list of tables present in a particular database.

Hive> show Tables;

Once the table names are displayed, the user can select a particular table and work on it. The user can also create a new table and start processing the required query.

Describe Schema of the Table

The describe command is similar to a desc command in SQL. The describe command displays table definition/metadata of the mentioned table and the users can see column names and data types associated with that particular column.

Hive>DESCRIBE emp_table;

Load a File from the Local File System

We know that Hadoop is designed to work only on huge datasets instead of a few selected rows/columns. So, the user needs to load the entire file to HDFS and process it using Hive commands.

The command used to load the file from local system to HDFS is as follows:

Hive>load data local inpath<filename> into table<tablename>

Load File from HDFS

The command used for loading file from HDFS is ‘load data’.

Hive>load data inpath<filename> into table<tablename>

Show Table Contents

Once the selected file is loaded to HDFS, the contents of the table can be checked using the select statement.

The select * statement of a table displays all the columns which are present in that table.

Commonly Used Alter Commands

Some of the commonly used alter commands in Hive are as follows:

Renaming the Current Table:

hive> ALTER TABLE EMP RENAME TO EMP_TABLE

Adding New Columns to an Existing Table:

hive> ALTER TABLE EMP_TABLE ADD COLUMNS (YOJ DATE)

Truncating a Table

The Truncating table command allows users to delete only the content values in the table and not the table schema.

After performing the Truncating command, we can see that the table schema is unaffected but the contents of the table have been deleted. This is how you can truncate/delete table contents without affecting the table schema.

Dropping a Database

The command to drop a database is ‘drop database’. The drop database database_name allows users to drop a particular database from the HDFS memory.

hive> DROP DATABASE EMPdb;

You can send us your queries and feedback in the comment section below

Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect

Beginner’s Guide for Hive