Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Querying HBase using Apache Spark

In this blog, we will see how to access and query HBase tables using Apache Spark.

Spark can work on data present in multiple sources like a local filesystem, HDFS, Cassandra, Hbase, MongoDB etc.

Now, we will see the steps for accessing hbase tables through spark.

The first step first, you must start HMaster.

Create an HBASE_PATH environmental variable to store the hbase paths

Start the spark shell by passing HBASE_PATH variable to include all the hbase jars.

Now we have started hbase and spark we will create the connection to hbase through spark shell

Import the required libraries as given below:


import org.apache.hadoop.hbase.HBaseConfiguration

import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin

import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes

import org.apache.hadoop.hbase.client.{Put,HTable}

// create hbase configuration object


val conf = HBaseConfiguration.create()

val tablename = "Acadgild_spark_Hbase"

// create Admin instance and set input format


conf.set(TableInputFormat.INPUT_TABLE,tablename)

val admin = new HBaseAdmin(conf)

//Create table


if(!admin.isTableAvailable(tablename)){
print("creating table:"+tablename+"\t")

val tableDescription = new HTableDescriptor(tablename)
tableDescription.addFamily(new HColumnDescriptor("cf".getBytes()));

}
admin.createTable(tableDescription);
} else {

print("table already exists")

//Check the create table exists or not

admin.isTableAvailable(tablename)

If the table exists, it will return ‘True’.

Now we will put some data into it;


val table = new HTable(conf,tablename);
for(x <- 1 to 10){

var p = new Put(new String("row" + x).getBytes());

p.add("colfamily1".getBytes(),"column1".getBytes(),new String("value" + x).getBytes());
table.put(p);

}

Now we can create the HadoopRDD from the data present in HBase using newAPIHadoopRDD by InputFormat , output key and value class.

We can perform all the transformations and actions on created RDD

We hope this blog helped you in understanding integration of Spark HBase.

Querying HBase using Apache Spark

1 comment: