Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect: Working With Hive Complex Data Types

What are Complex Data Types in Hive?

Complex Data types are also known as nested types which allow storing multiple values within a single row/column position in a Hive table. Generally, in Hive and other databases, we have more experience on working with primitive data types like:
Numeric Types

TINYINT : 1-byte signed integer, from -128 to 127
SMALLINT : 2-byte signed integer, from -32,768 to 32,767
INT : 4-byte signed integer, from -2,147,483,648 to 2,147,483,647
BIGINT : 8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807
FLOAT : 4-byte single precision floating point number
DOUBLE : 8-byte double precision floating point number
DECIMAL : Hive 0.13.0 introduced user definable precision and scale

String Types

STRING
VARCHAR
CHAR

Date/Time Types

TIMESTAMP
DATE

Misc Types

BOOLEAN
BINARY

Apart from these primitive data types, Hive offers some complex data types which are as listed below:

Complex Data Types

arrays: ARRAY<data_type>
maps: MAP<primitive_type, data_type>
structs: STRUCT<col_name : data_type [COMMENT col_comment], …>

Thus, let us know what are these complex data types and their working in Hive.
So, in our example, we will be using our Hive default database to store the complex data type tables.

Array:

The first complex type is an array. It is nothing but a collection of items of similar data type. i.e, an array can contain one or more values of the same data type.
In our Array example, we will be using the dataset Temperature.txt where the fields are delimited by tab and the complex data type Array values are delimited by the comma.

Dataset_Temperature

Dataset Description

Date
State
Temperature [Depending on their district level wise]
So, let us create a table to store the above values of dataset Temperature using below code.
create table Temperature(date string,city string,MyTemp array<double>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating an array named Myarray which will hold only the double type values.

To check the schema of the table Temperature we can use the command describe Temperature;

Now, let us load our input dataset Temperature.txt using the below command.
Load data local inpath ‘/home/acadgild/Desktop/Temperature.txt’ into table Temperature;

We can view the contents of the table using the command select * from Temperature;

We can observe, from the above image all the array values are collected and stored in a single column.

To select a column and a value from the table we can use the below command.
select city,MyTemp[0] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp Zeroth position values.
select date,city,MyTemp[3] from Temperature;

We can observe from the above image; we are selecting the column state and the array MyTemp 4th position values.
So, we can follow the above steps to work with complex data type array values in Hive.

Now, let us know what is Map and its working.

Map:

Map is a collection of key-value pairs where fields are accessed using array notation of keys
Eg: [‘Key’]

In our Map example we will be using the dataset Schools.txt where the fields are delimited by tab, the complex type Map values are delimited by the comma.

Dataset_School_Data

Dataset Description
School Type
State
Gender
Total

So, let us create a table to store the above values of dataset Schools using the below code.
create table MySchools(schooltype string,state string,gender string, total map<int,int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’ map keys terminated by ‘:’;

We can observe, in the above code, we are creating a collection named total which will hold the Values of type int and int.

To check the schema of the table MySchools we can use the command describe MySchools;

Now, let us load our input dataset School_Data.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/School_Data.txt’ into table MySchools;

We can view the contents of the table using the command select * from MySchools;

We can observe, from the above image each Map total column contains to data i.e, key and value.
Here, year represents the key and the subsequent data represents the value.

To select a column and its key value from the table we can use the below command.
select total[2016] from MySchools where state=’Assam’;

We can observe from the above image, we are selecting the column state = ‘Assam’ to find the male and female strength in the year 2016.
select total[2017] from MySchools where state=’Chhattisgarh’ and gender=’Female’;

We can observe from the above image; we are selecting the column state = ‘chhattisgarh’ to find the female strength in the year 2017.

So, we can follow the above steps to work on a collection Map Key value pairs in Hive.

Now, let us know what is Struct and its working.

Struct:

Struct is a record type which encapsulates a set of named fields that can be any primitive data type. An element in STRUCT type can be accessed using the DOT (.) notation.
In our Struct example, we will be using the dataset Bikes.txt where the fields are delimited by tab and the complex type Array values are delimited by the comma.

Dataset_Bikes

Dataset Description
Name
BikeFeatures struct<EngineType,cc,power,gears>

So, let us create a table to store the above values of dataset Bikes using below code.
create table MyBikes(name string, BikeFeatures struct<EngineType:string,cc:float,power:float,gears:int>) row format delimited fields terminated by ‘\t’ collection items terminated by ‘,’;

We can observe, in the above code, we are creating a collection Struct named BikeFeatures which will hold string, float, float, and int value types.

To check the schema of the table MySchools we can use the command describe MyBikes;

Now, let us load our input dataset Bikes.txt using the below command.
load data local inpath ‘/home/acadgild/Desktop/Bikes.txt’ into table MyBikes;

We can view the contents of the table using the command select * from MyBikes;

We can observe, from the above image the Struct BikeFeatures column contains multiple values of different types.

To select a Struct column values from the table we can use the below command.
select BikeFeatures.EngineType from MyBikes;

To select a column and its Struct column values from the table we can use the below command.
We can observe from the above image we are using Dot (.) operator to access Struct Bike Features Engine

Types column values from the table.
select BikeFeatures.EngineType from MyBikes where name=’Suzuki Swish’;

We can observe from the above image; we are selecting the column name where Bike name = ‘Suzuki Swish’ to find its EngineType.

So, we can follow the above steps to work on collection Struct values in Hive & other Hive data types.

Kiran Kumar Vasadi Google Cloud Certified Professional Data Engineer & Cloud Architect

Working With Hive Complex Data Types

What are Complex Data Types in Hive?

No comments:

Post a Comment