Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Apache Hive Architecture & Components by KiranVasadi

Apache Hive Architecture & Components

Contents
1. Objective
2. What is Hive?
3. Hadoop Hive Architecture and its Components
3.1. Hive Clients
3.2. Hive Services
4. How to process data with Apache Hive?
5. Conclusion

1. Objective

In our previous blog, we have discussed what is Apache Hive in detail. Now we are going to discuss the Architecture of Apache Hive. We will also cover the different components of Hive in the Hive Architecture. At last, we will provide you the steps for data processing in Apache Hive in this Hive Architecture tutorial.

2. What is Hive?

Apache Hive is an ETL and Data warehousing tool built on top of Hadoop. It makes job easy for performing operations like

Analysis of huge datasets
Ad-hoc queries
Data encapsulation

3. Hadoop Hive Architecture and its Components

The below diagram describes the Architecture of Hive and Hive components. It also describes the flow in which a query is submitted into Hive and finally processed using the MapReduce framework:

Above diagram shows the major components of Apache Hive-

Hive Clients – Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice.
Hive Services – Hive provides various services like web Interface, CLI etc. to perform queries.
Processing framework and Resource Management – Hive internally uses Hadoop MapReduce framework to execute the queries.
Distributed Storage – As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.

Now let us discuss the Hive client and Hive services in detail-

3.1. Hive Clients
Learn Hadoop from Industry Experts
The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types:

Thrift Clients – As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift.
JDBC Clients – Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class apache.hadoop.hive.jdbc.HiveDriver.

ODBC Clients – ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.

3.2. Hive Services
Apache Hive provides various services as shown in above diagram. Now, let us look at each in detail:

a) CLI(Command Line Interface) – This is the default shell that Hive provides, in which you can execute your Hive queries and command directly.

b) Web Interface – Hive also provides web based GUI for executing Hive queries and commands.

Aapche Hadoop Interview Questions and Answers
c) Hive Server – It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result.

d) Hive Deriver – Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client.

Complier –After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore.
Optimizer – It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks.
Executor – Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop.

e) Metastore – Metastore is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units:

A service that provides metastore access to other Apache Hive services.

Disk storage for the Hive metadata which is separate from HDFS storage.

4. How to process data with Apache Hive?

Now we will discuss how a typical query flows through the system-

User Interface (UI) calls the execute interface to the Driver.
The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan.
The compiler needs the metadata. So it sends a request for getMetaData. Thus receives the sendMetaData request from Metastore.
Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is DAG of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. The plan contains map operator trees and a reduce operator tree for map/reduce stages.
Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations.
Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver.

5. Conclusion

Hence, Hive is a Data Warehousing package built on top of Hadoop used for structure and semi-structured data analysis and processing. It provides flexible query language such as HQL for better querying and processing of data. Thus it offers so many features compared to RDBMS which has certain limitations. As you have learned Apache Hive and its Architecture, let us now learn Apache Hive Installation on ubuntu to use the functionality of Apache Hive.

6 comments:

Tejuteju28 June 2018 at 01:58
It was really a nice article and i was really impressed by reading this Big data hadoop online training India

Unknown27 September 2018 at 09:31
Really good article...
veera cynixit21 July 2020 at 00:19
Really very nice Blog.Keep updating More Blogs.

big data hadoop training
big data training
big data and hadoop course
Glenn Hannan22 September 2020 at 13:57
This comment has been removed by the author.
Glenn Hannan22 September 2020 at 13:58
Wow what a great blog, i really enjoyed reading this, good luck in your work. Inteligencia Artificial aplicada a la empresa