Working With Hive Complex Data Types

What are Complex Data Types in Hive?

Complex Data types are also known as nested types which allow storing multiple values within a single row/column position in a Hive table. Generally, in Hive and other databases, we have more experience on working with primitive data types like:
Numeric Types
  • TINYINT : 1-byte signed integer, from -128 to 127
  • SMALLINT : 2-byte signed integer, from -32,768 to 32,767
  • INT : 4-byte signed integer, from -2,147,483,648 to 2,147,483,647
  • BIGINT : 8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
  • FLOAT : 4-byte single precision floating point number
  • DOUBLE : 8-byte double precision floating point number
  • DECIMAL : Hive 0.13.0 introduced user definable precision and scale
String Types
  • STRING
  • VARCHAR
  • CHAR
Date/Time Types
  • TIMESTAMP
  • DATE
Misc Types
  • BOOLEAN
  • BINARY
Apart from these primitive data types, Hive offers some complex data types which are as listed below:

Complex Data Types
  • arrays: ARRAY<data_type>
  • maps: MAP<primitive_type, data_type>
  • structs: STRUCT<col_name : data_type [COMMENT col_comment], …>
Thus, let us know what are these complex data types and their working in Hive.
So, in our example, we will be using our Hive default database to store the complex data type tables.  

Array:
The first complex type is an array. It is nothing but a collection of items of similar data type. i.e, an array can contain one or more values of the same data type.
In our Array example, we will be using the dataset Temperature.txt where the fields are delimited by tab and the complex data type Array values are delimited by the comma.
Dataset_Temperature

Dataset Description
Date
State
Temperature [Depending on their district level wise]
So, let us create a table to store the above values of dataset Temperature using below code.
create table Temperature(date string,city string,MyTemp array<double>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating an array named Myarray which will hold only the double type values.
To check the schema of the table Temperature we can use the command describe Temperature;

Now, let us load our input dataset Temperature.txt using the below command.
Load data local inpath ‘/home/acadgild/Desktop/Temperature.txt’ into table Temperature;
We can view the contents of the table using the command select * from Temperature;

We can observe, from the above image all the array values are collected and stored in a single column.
To select a column and a value from the table we can use the below command.
select city,MyTemp[0] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp Zeroth position values.
select date,city,MyTemp[3] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp 4th position values.
So, we can follow the above steps to work with complex data type array values in Hive.
Now, let us know what is Map and its working. 
Map:
Map is a collection of key-value pairs where fields are accessed using array notation of keys
Eg: [‘Key’]
In our Map example we will be using the dataset Schools.txt where the fields are delimited by tab, the complex type Map values are delimited by the comma. 
Dataset_School_Data

Dataset Description
School Type
State
Gender 
Total
So, let us create a table to store the above values of dataset Schools using the below code.
create table MySchools(schooltype string,state string,gender string, total map<int,int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’;

We can observe, in the above code, we are creating a collection named total which will hold the Values of type int and int.
To check the schema of the table MySchools we can use the command describe MySchools;

Now, let us load our input dataset School_Data.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/School_Data.txt’ into table MySchools;

We can view the contents of the table using the command select * from MySchools;

We can observe, from the above image each Map total column contains to data i.e, key and value.
Here, year represents the key and the subsequent data represents the value.
To select a column and its key value from the table we can use the below command.
select total[2016] from MySchools where state=’Assam’;

We can observe from the above image, we are selecting the column state = ‘Assam’ to find the male and female strength in the year 2016.
select total[2017] from MySchools where state=’Chhattisgarh’ and gender=’Female’;

We can observe from the above image; we are selecting the column state = ‘chhattisgarh’ to find the female strength in the year 2017.
So, we can follow the above steps to work on a collection Map Key value pairs in Hive.
Now, let us know what is Struct and its working. 
Struct:
Struct is a record type which encapsulates a set of named fields that can be any primitive data type. An element in STRUCT type can be accessed using the DOT (.) notation.
In our Struct example, we will be using the dataset Bikes.txt where the fields are delimited by tab and the complex type Array values are delimited by the comma.
Dataset_Bikes

Dataset Description
Name 
BikeFeatures struct<EngineType,cc,power,gears>
So, let us create a table to store the above values of dataset Bikes using below code.
create table MyBikes(name string, BikeFeatures struct<EngineType:string,cc:float,power:float,gears:int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating a collection Struct named BikeFeatures which will hold string, float, float, and int value types.
To check the schema of the table MySchools we can use the command describe MyBikes;

Now, let us load our input dataset Bikes.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/Bikes.txt’ into table MyBikes;

We can view the contents of the table using the command select * from MyBikes;

We can observe, from the above image the Struct BikeFeatures column contains multiple values of different types.
To select a Struct column values from the table we can use the below command.
select BikeFeatures.EngineType from MyBikes;

To select a column and its Struct column values from the table we can use the below command.
We can observe from the above image we are using Dot (.) operator to access Struct Bike Features Engine
Types column values from the table.
select BikeFeatures.EngineType from MyBikes where name=’Suzuki Swish’;

We can observe from the above image; we are selecting the column name where Bike name = ‘Suzuki Swish’ to find its EngineType.
So, we can follow the above steps to work on collection Struct values in Hive & other Hive data types.

No comments:

Post a Comment