This blog focuses on the concepts involved in indexing in Hive. This post includes the following topics:
- When to use indexing.
- How indexing is helpful.
- How to create indexes for your tables.
- Perform some operations regarding the indexing in Hive.
What is an Index?
An Index acts as a reference to the records. Instead of searching all the records, we can refer to the index to search for a particular record. Indexes maintain the reference of the records. So that it is easy to search for a record with minimum overhead. Indexes also speed up the searching of data.
Why to use indexing in Hive?
Hive is a data warehousing tool present on the top of Hadoop, which provides the SQL kind of interface to perform queries on large data sets. Since Hive deals with Big Data, the size of files is naturally large and can span up to Terabytes and Petabytes. Now if we want to perform any operation or a query on this huge amount of data it will take large amount of time.
In a Hive table, there are many numbers of rows and columns. If we want to perform queries only on some columns without indexing, it will take large amount of time because queries will be executed on all the columns present in the table.
The major advantage of using indexing is; whenever we perform a query on a table that has an index, there is no need for the query to scan all the rows in the table. Further, it checks the index first and then goes to the particular column and performs the operation.
So if we maintain indexes, it will be easier for Hive query to look into the indexes first and then perform the needed operations within less amount of time.
Eventually, time is the only factor that everyone focuses on.
When to use Indexing?
Indexing can be use under the following circumstances:
- If the dataset is very large.
- If the query execution is more amount of time than you expected.
- If a speedy query execution is required.
- When building a data model.
Indexes are maintained in a separate table in Hive so that it won’t affect the data inside the table, which contains the data. Another major advantage for indexing in Hive is that indexes can also be partitioned depending on the size of the data we have.
Types of Indexes in Hive
- Compact Indexing
- Bitmap Indexing
Bit map indexing was introduced in Hive 0.8 and is commonly used for columns with distinct values.
Differences between Compact and Bitmap Indexing
The main difference is the storing of the mapped values of the rows in the different blocks. When the data inside a Hive table is stored by default in the HDFS, they are distributed across the nodes in a cluster. There needs to be a proper identification of the data, like the data in block indexing. This data will be able to identity which row is present in which block, so that when a query is triggered it can go directly into that block. So, while performing a query, it will first check the index and then go directly into that block.
Compact indexing stores the pair of indexed column’s value and its blockid.
Bitmap indexing stores the combination of indexed column value and list of rows as a bitmap.
Let’s now understand what is bitmap?
A bitmap is is a type of memory organization or image file format used to store digital images so with this meaning of bitmap, we can redefine bitmap indexing as given below.
“Bitmap index stores the combination of value and list of rows as a digital image.”
The following are the different operations that can be performed on Hive indexes:
- Creating index
- Showing index
- Alter index
- Dropping index
Creating Index in Hive
Syntax for creating a compact index in Hive is as follows:
CREATE INDEX index_nameON TABLE table_name (columns,....)AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITH DEFERRED REBUILD;
- Here, in the place of index_name we can give any name of our choice, which will be the table’s INDEX NAME.
- In the ON TABLE line, we can give the table_name for which we are creating the index and the names of the columns in brackets for which the indexes are to be created. We should specify the columns which are available only in the table.
- The org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’ line specifies that a built in CompactIndexHandler will act on the created index, which means we are creating a compact index for the table.
- The WITH DEFERRED REBUILD statement should be present in the created index because we need to alter the index in later stages using this statement.
This syntax will create an index for our table, but to complete the creation, we need to complete the REBUILD statement. For this to happen, we need to add one more alter statement. A MapReduce job will be launched and the index creation is now completed.
ALTER INDEX index_nam on table_name REBUILD;
This ALTER statement will complete our REBUILDED index creation for the table.
Examples – Creating Index
In this section we will first execute the hive query on non-indexed table and will note down the time taken by query to fetch the result.
In the second part, we will be performing the same query on indexed table and then will compare the time taken by query to fetch the result with the earlier case.
We will be demonstrating this difference of time with practical examples.
In first scenario we are performing operations on non-indexed table.
Let’s create a normal managed table to contain the olympic dataset first.
Table Creation
create table olympic(athelete STRING,age INT,country STRING,year STRING,closing STRING,sport STRING,gold INT,silver INT,bronze INT,total INT) row format delimited fields terminated by '\t' stored as textfile;
Here we are creating a table with name ‘olympic’. The schema of the table is as specified and the data inside the input file is delimited by tab space.
At the end of the line, we have specified ‘stored as textfile’, which means we are using a TEXTFILE format.
You can check the schema of your created table using the command ‘describe olympic;’
We can load data into the created table as follows:
load data local inpath ‘path of your file‘into table olympic;
We have successfully loaded the input file data into the table which is in the TEXTFILE format.
Let’s perform an Average operation on this ‘olympic’ data. Let’s calculate the average age of the athletes using the following command:
SELECT AVG(age) from olympic;
Here we can see the average age of the athletes to be 26.405433646812956 and the time for performing this operation is 21.08 seconds.
Now, let’s create the index for this table:
CREATE INDEX olympic_indexON TABLE olympic (age)AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'WITH DEFERRED REBUILD;
ALTER INDEX olympic_index on olympic REBUILD;
Here we have created an index for the ‘olympic’ table on the age column. We can view the indexes created for the table by using the below command:
show formatted index on olympic;
We can see the indexes available for the ‘olympic’ table in the above image.
Now, let’s perform the same Average operation on the same table.
We have now got the average age as 26.405433646812956, which is same as the above, but now the time taken for performing this operation is 17.26 seconds, which is less than the above case.
Now we know that by using indexes we can reduce the time of performing the queries.
Can we have different indexes for the same table?
Yes! We can have any number of indexes for a particular table and any type of indexes as well.
Let’s create a Bitmap index for the same table:
CREATE INDEX olympic_index_bitmapON TABLE olympic (age)WITH DEFERRED REBUILD;AS 'BITMAP'ALTER INDEX olympic_index_bitmap on olympic REBUILD;
Here, As ‘BITMAP’ defines the type of index as BITMAP.
We have successfully created the Bitmap index for the table.
We can check the available indexes for the table using the below command:
show formatted index on olympic;
We can now see that we have two indexes available for our table.
Average Operation with Two Indexes
Now, let’s perform the same Average operation having the two indexes.
This time, we have got the same result in 17.614 seconds which is same as in the case of compact index.
Note: With different types (compact,bitmap) of indexes on the same columns, for the same table, the index which is created first is taken as the index for that table on the specified columns.
Now let’s delete one index using the following command:
DROP INDEX IF EXISTS olympic_index ON olympic;
We can check the available indexes on the table to verify whether the index is deleted or not.
We have successfully deleted one index i.e., olympic_index ,which is a compact index.
We now have only one index available for our table, which is a bitmap index.
Average Operation with Bitmap Index
Let’s perform the same Average age operation on the same table with bitmap index.
We have got the average age as 26.105433646812956, which is same as the above cases but the operation was done in just 16.47 seconds, which is less than the above two cases.
Through the above examples, we have proved the following:
- Indexes decrease the time for executing the query.
- We can have any number of indexes on the same table.
- We can use the type of index depending on the data we have.
- In some cases, Bitmap indexes work faster than the Compact indexes and vice versa.
When not to use indexing?
It is essential to know when and where indexing shouldn’t be used. They should not be used in the following scenarios:
- Indexes are advised to build on the columns on which you frequently perform operations.
- Building more number of indexes also degrade the performance of your query.
- Type of index to be created should be identified prior to its creation (if your data requires bitmap you should not create compact). This leads to an increase in time for executing your query.
We hope this blog helped you in understanding Indexing in Hive.
No comments:
Post a Comment