Querying HBase using Apache Spark

In this blog, we will see how to access and query HBase tables using Apache Spark.
Spark can work on data present in multiple sources like a local filesystem, HDFS, Cassandra, Hbase, MongoDB etc.
Now, we will see the steps for accessing hbase tables through spark.
The first step first, you must start HMaster.
Create an HBASE_PATH environmental variable to store the hbase paths
Start the spark shell by passing HBASE_PATH variable to include all the hbase jars.
Now we have started hbase and spark we will create the connection to hbase through spark shell
Import the required libraries as given below:
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.{HTableDescriptor,HColumnDescriptor}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.client.{Put,HTable}

// create hbase configuration object
val conf = HBaseConfiguration.create()
val tablename = "Acadgild_spark_Hbase"

// create Admin instance and set input format
conf.set(TableInputFormat.INPUT_TABLE,tablename)
val admin = new HBaseAdmin(conf)
//Create table
if(!admin.isTableAvailable(tablename)){
print("creating table:"+tablename+"\t")
val tableDescription = new HTableDescriptor(tablename)
tableDescription.addFamily(new HColumnDescriptor("cf".getBytes()));
}
admin.createTable(tableDescription); } else {
print("table already exists")
//Check the create table exists or not
admin.isTableAvailable(tablename)
If the table exists, it will return ‘True’.
Now we will put some data into it;
val table = new HTable(conf,tablename);
for(x <- 1 to 10){
var p = new Put(new String("row" + x).getBytes());
p.add("colfamily1".getBytes(),"column1".getBytes(),new String("value" + x).getBytes());
table.put(p);
}

Now we can create the HadoopRDD from the data present in HBase using newAPIHadoopRDD by InputFormat , output key and value class.
We can perform all the transformations and actions on created RDD
We hope this blog helped you in understanding integration of Spark HBase.

1 comment: