Map-Side Joins in Hive

Image result for map join in hive

In this blog, we will be showing a demo on Map-side joins in Hive. Map side join is a process where joins between two tables are performed in the Map phase without the involvement of Reduce phase.
Map-side Joins allows a table to get loaded into memory ensuring a very fast join operation, performed entirely within a mapper and that too without having to use both map and reduce phases.
In case your queries frequently run with small table joins , you might see a very substantial decrease in the time taken to compute the queries after usage of map-side joins.
You can download the datasets that are used in this demo from the link presented below.
Here is the metadata of the datasets.
Dataset description of the first dataset is as follows:
id,first_name,last_name,email,gender,ip_address
Dataset description of the second dataset is as follows:
id,first_name,last_name
Now, let us create two tables to store these two datasets. Please refer to the following:
CREATE TABLE IF NOT EXISTS dataset1 ( eid int, first_name String, last_name String, email String, gender String, ip_address String) row format delimited fields terminated BY ',' tblproperties("skip.header.line.count"="1");
CREATE TABLE IF NOT EXISTS dataset2 ( eid int, first_name String, last_name String) row format delimited fields terminated BY ',' tblproperties("skip.header.line.count"="1");
Here, we have successfully created tables with names dataset1 & dataset2. Here you can see the same in the screenshot given below.
Now, let’s load the data into two tables, as can be seen in the following screenshot:
Now, we will perform the Map-side join on these two datasets. Map-side joins can be enabled in two ways.
Map Join
1. By specifying the keyword, /*+ MAPJOIN(b) */ in the join statement.
2. By setting the following property to true.
 hive.auto.convert.join=true
For performing Map-side joins, there should be two files, one is of larger size and the other is of smaller size. You can set the small file size by using the following property:
 hive.mapjoin.smalltable.filesize=(default it will be 25MB)
Now, let us perform Map-side joins and join the two datasets based on their IDs.
SELECT /*+ MAPJOIN(dataset2) */ dataset1.first_name, dataset1.eid,dataset2.eid FROM dataset1 JOIN dataset2 ON dataset1.first_name = dataset2.first_name;
As it is a Map-side join, the number of reducers will be set to 0 automatically.
In the following screenshot, you can see the efficiency of this join.
It has taken 16.69 seconds to run the query. Now we will test the same using Map-Bucket join.
Bucket-Map join
The constraint for performing Bucket-Map join is:
If tables being joined are bucketed on the join columns, and the number of buckets in one table is a multiple of the number of buckets in the other table, the buckets can be joined with each other.
To perform bucketing, we need to have bucketed tables. Let’s create them.
You can refer to our blog on Bucketing in Hive 
CREATE TABLE IF NOT EXISTS dataset1_bucketed ( eid int,first_name String, last_name String, email String, gender String, ip_address String) clustered by(first_name) into 4 buckets row format delimited fields terminated BY ',';
CREATE TABLE IF NOT EXISTS dataset2_bucketed (eid int,first_name String, last_name String) clustered by(first_name) into 8 buckets row format delimited fields terminated BY ',' ;
Now we will insert the data into the dataset1_bucketed table.
insert into dataset1_bucketed select * from dataset1;
We will now insert the data into dataset2_bucketed table.
insert into dataset2_bucketed select * from dataset2;
Here, we have two tables that are bucketed. We can now perform Bucket-map join between these two datasets.
Here, for the first table we have created 4 buckets and for the second table we have created 8 buckets on the same column. Now, we can perform Bucket-map join on these two tables.>
For performing Bucket-Map join, we need to set this property in the Hive shell.
set hive.optimize.bucketmapjoin = true
SELECT /*+ MAPJOIN(dataset2_bucketed) */ dataset1_bucketed.first_name,dataset1_bucketed.eid, dataset2_bucketed.eid FROM dataset1_bucketed JOIN dataset2_bucketed ON dataset1_bucketed.first_name = dataset2_bucketed.first_name ;
This query took 16.378 seconds to complete. You can see the same in the screenshot shown below:
Now, let us try with SMBMJ>
Sort Merge Bucket(SMB) Map Join
If the tables being joined are sorted and bucketized on the join columns and have the same number of buckets, a sort-merge join can be performed. The corresponding buckets are joined with each other at the mapper.
Here we have 4 buckets for dataset1 and 8 buckets for dataset2. Now, we will create another table with 4 buckets for dataset2.
For performing the SMB-Map join, we need to set the following properties:
Set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
To perform this join, we need to have the data in the bucketed tables sorted by the join column. Now, we will re-insert the data into the bucketed tables by using sorting the records.
insert overwrite table dataset1_bucketed select * from dataset1 sort by first_name;
The above command will overwrite the data in the old table and insert the data as per the query.
So now the data in the dataset1_bucketed table is sorted by first_name. You can see the same in the following screenshot:
Now, let us perform the SMB-Map join on the two tables with 4 buckets in one table and 8 buckets in one table.
We will now overwrite the data into the dataset2_bucketed table, using the following command:
insert overwrite table dataset2_bucketed select * from dataset2 sort by first_name;
Now, let us perform the join between tables having 4 buckets and 8 buckets.
SELECT /*+ MAPJOIN(dataset2_bucketed) */ dataset1_bucketed.first_name,dataset1_bucketed.eid, dataset2_bucketed.eid FROM dataset1_bucketed JOIN dataset2_bucketed ON dataset1_bucketed.first_name = dataset2_bucketed.first_name ;
In the above screenshot, you can see that the query took 23.844 seconds.
To perform SMB-Map join, we need to have the same number of buckets in both the tables with the bucketed column sorted.
Now, we will create another table for dataset2 having 4 buckets and will insert the data that is sorted by first_name.
CREATE TABLE IF NOT EXISTS dataset2_bucketed1 (eid int,first_name String, last_name String) clustered by(first_name) into 4 buckets row format delimited fields terminated BY ',' ;
insert overwrite table dataset2_bucketed1 select * from dataset2 sort by first_name;
Now, we have two tables with 4 buckets and the joined column sorted. Let us perform the join query again.
SELECT /*+ MAPJOIN(dataset2_sbucketed1) */dataset1_bucketed.first_name, dataset1_bucketed.eid, dataset2_bucketed1.eid FROM dataset1_bucketed JOIN dataset2_bucketed1 ON dataset1_bucketed.first_name = dataset2_bucketed1.first_name ;
In the above screen shot, you can see that it has taken 22.69 seconds. As the dataset sizes are too small, this analysis results may vary based on the dataset size.
This analysis is performed under ideal conditions for all the joins. So in this case, in order to perform a Map join, there are no constraints. But for performing a Bucket-Map join, we need to have the tables bucketed on the join column and the number of buckets can be multiples of each other.
To perform the SortMergeBucket Map join, we need to have two tables with the same number of buckets on the join column and the records are to be sorted on the join column.
We hope this instructional blog post helped you in understanding how to perform Map-side joins in Hive

No comments:

Post a Comment