In this instructional post, we will explore and understand few important relational operators in Pig which is widely used in big data industry. Before we understand relational operators, let us see what Pig is.
Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in writing mapper and reducer programs. Pig enables users to write complex data analysis code without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig Latin and has its own Pig runtime environment where PigLatin programs are executed. For more details, I would suggest you to go through this blog.
Once you complete this blog, I would suggest you to get your hands dirty with a POC from this blog.
Below are two datasets that will be used in this post.
Employee_details.txt
This data set have 4 columns i.e.
Emp_id: unique id for each employee
Name: name of the employee
Salary: salary of an employee
Ratings: Rating of an employee.
Employee_expenses.txt
This data set have 2 columns i.e.
Emp_id: id of an employee
Expense: expenses made by an employee
As you have got the idea of datasets, let us proceed and perform some relational operations using this data.
NOTE: All the analysis is performed in local mode. To work in local mode, you need to start pig grunt shell using “pig -x local” command.
Apache Pig, developed by Yahoo! helps in analyzing large datasets and spend less time in writing mapper and reducer programs. Pig enables users to write complex data analysis code without prior knowledge of Java. Pig’s simple SQL-like scripting language is called Pig Latin and has its own Pig runtime environment where PigLatin programs are executed. For more details, I would suggest you to go through this blog.
Once you complete this blog, I would suggest you to get your hands dirty with a POC from this blog.
Below are two datasets that will be used in this post.
Employee_details.txt
This data set have 4 columns i.e.
Emp_id: unique id for each employee
Name: name of the employee
Salary: salary of an employee
Ratings: Rating of an employee.
Employee_expenses.txt
This data set have 2 columns i.e.
Emp_id: id of an employee
Expense: expenses made by an employee
As you have got the idea of datasets, let us proceed and perform some relational operations using this data.
NOTE: All the analysis is performed in local mode. To work in local mode, you need to start pig grunt shell using “pig -x local” command.
Relational Operators:
Load
To load the data either from local filesystem or Hadoop filesystem.
Syntax:
LOAD ‘path_of_data’ [USING function] [AS schema];
Where;
path_of_data : file/directory name in single quotes.
USING : is the keyword.
function : If you choose to omit this, default load function PigStorage() is used.
AS : is the keyword
schema : schema of your data along with data type.
Eg:
The file named employee_details.txt is comma separated file and we are going to load it from local file system.
A = LOAD ‘/home/acadgild/pig/employee_details.txt’ USING PigStorage(‘,’) AS (id:int, name:chararray, salary:int, ratings:int);
To check the result, you can use DUMP command.
Similarly, you can load another data into another relation, say ‘B’
B = LOAD ‘/home/acadgild/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS (id:int, expenses:int);
As the fields in this file are tab separated, you need to use ‘\t’
NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.
Syntax:
LOAD ‘path_of_data’ [USING function] [AS schema];
Where;
path_of_data : file/directory name in single quotes.
USING : is the keyword.
function : If you choose to omit this, default load function PigStorage() is used.
AS : is the keyword
schema : schema of your data along with data type.
Eg:
The file named employee_details.txt is comma separated file and we are going to load it from local file system.
A = LOAD ‘/home/acadgild/pig/employee_details.txt’ USING PigStorage(‘,’) AS (id:int, name:chararray, salary:int, ratings:int);
To check the result, you can use DUMP command.
Similarly, you can load another data into another relation, say ‘B’
B = LOAD ‘/home/acadgild/pig/employee_expenses.txt’ USING PigStorage(‘\t’) AS (id:int, expenses:int);
As the fields in this file are tab separated, you need to use ‘\t’
NOTE: If you load this dataset in relation A, the earlier dataset will not be accessible.
Limit
Used to limit the number of outputs to the desired number.
Syntax:
Alias = LIMIT alias n;
Where;
alias : name of the relation.
n : number of tuples to be displayed.
Ex:
We will be limiting the result of relation A (described above) to 5.
limited_val = LIMIT A 5;
NOTE: there is no guarantee which 5 tuples will be the output.
Syntax:
Alias = LIMIT alias n;
Where;
alias : name of the relation.
n : number of tuples to be displayed.
Ex:
We will be limiting the result of relation A (described above) to 5.
limited_val = LIMIT A 5;
NOTE: there is no guarantee which 5 tuples will be the output.
Order
Sorts a relation based on single or multiple fields.
Syntax:
alias = ORDER alias BY {field_name [ASC | DESC] Where;
alias : is the relation
ORDER : is the keyword.
BY : is the keyword.
field_name : column on which you want to sort the relation.
ASC : sort in ascending order
DESC : sort in descending order.
Eg:
We will sort the relation A based on the ratings field and get top 3 employee details with highest ratings.
Sorted = ORDER A by ratings DESC;
Result = LIMIT Sorted 3;
You can also Order the relation based on multiple fields. Let’s order the relation A based on Descending ‘ratings’ and Ascending ‘names’ and generate top 3 result.
Ex:
Double_sorted = ORDER A by ratings DESC, name ASC;
Final_result = LIMIT Double_sorted 3;
Now, compare the ‘Result’ and ‘Final_result’ relation.
Syntax:
alias = ORDER alias BY {field_name [ASC | DESC] Where;
alias : is the relation
ORDER : is the keyword.
BY : is the keyword.
field_name : column on which you want to sort the relation.
ASC : sort in ascending order
DESC : sort in descending order.
Eg:
We will sort the relation A based on the ratings field and get top 3 employee details with highest ratings.
Sorted = ORDER A by ratings DESC;
Result = LIMIT Sorted 3;
You can also Order the relation based on multiple fields. Let’s order the relation A based on Descending ‘ratings’ and Ascending ‘names’ and generate top 3 result.
Ex:
Double_sorted = ORDER A by ratings DESC, name ASC;
Final_result = LIMIT Double_sorted 3;
Now, compare the ‘Result’ and ‘Final_result’ relation.
Group
Groups the data based on one or multiple fields. It groups together tuples that have the same group key (key field). The key field will be a tuple if the group key has more than one field, otherwise it will be the same type as that of the group key.
Syntax:
alias = GROUP alias {ALL | BY field};
Where;
alias : is the relation
GROUP : is the keyword
ALL : keyword. Use ALL if you want all tuples to go to a single group
BY : keyword
Field : field name on which you want to group your data.
Ex:
We will group our relation A based on ratings.
Grouped = GROUP A BY ratings;
You can see that the output is a tuple based on ‘ratings’ with multiple bags. You can also group the data based on more than one fields. For example,
Multi_group = GROUP A BY (ratings, salary);
Syntax:
alias = GROUP alias {ALL | BY field};
Where;
alias : is the relation
GROUP : is the keyword
ALL : keyword. Use ALL if you want all tuples to go to a single group
BY : keyword
Field : field name on which you want to group your data.
Ex:
We will group our relation A based on ratings.
Grouped = GROUP A BY ratings;
You can see that the output is a tuple based on ‘ratings’ with multiple bags. You can also group the data based on more than one fields. For example,
Multi_group = GROUP A BY (ratings, salary);
Foreach
It generates data transformations based on desired columns of data.
Syntax:
alias = FOREACH alias GENERATE {expression | field};
Where;
alias : is the relation
FOREACH : is the keyword
GENERATE : is the keyword
Ex:
In our previous example we saw how to group a relation. Now, using FOREACH, we will generate the count of employees belonging to a particular group.
Result = FOREACH Grouped GENERATE group,COUNT(A.ratings);
From the result, we can conclude that there are 4 employees who got 1 as their rating. It is indicated by the first row. Basically, if want to operate at column level, you can use Foreach.
Syntax:
alias = FOREACH alias GENERATE {expression | field};
Where;
alias : is the relation
FOREACH : is the keyword
GENERATE : is the keyword
Ex:
In our previous example we saw how to group a relation. Now, using FOREACH, we will generate the count of employees belonging to a particular group.
Result = FOREACH Grouped GENERATE group,COUNT(A.ratings);
From the result, we can conclude that there are 4 employees who got 1 as their rating. It is indicated by the first row. Basically, if want to operate at column level, you can use Foreach.
Filter
Filters a relation based on certain condition.
Syntax:
alias = FILTER alias BY expression;
Where;
alias : is the relation
FILTER : is the keyword
BY : is the keyword
expression : condition on which filter will be performed.
Ex:
We will filter our data (relation A) based on ratings greater than or equal to 4.
Filtered = FILTER A BY ratings >= 4;
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH …GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.
We can also use multiple conditions to filter data at one go.
Eg:
Multi_condition = FILTER A BY (ratings >= 4) AND (salary >1000)
This will produce results that follows in the category of ratings greater than equals 4 and salary greater than 1000.
Syntax:
alias = FILTER alias BY expression;
Where;
alias : is the relation
FILTER : is the keyword
BY : is the keyword
expression : condition on which filter will be performed.
Ex:
We will filter our data (relation A) based on ratings greater than or equal to 4.
Filtered = FILTER A BY ratings >= 4;
Use the FILTER operator to work with tuples or rows of data (if you want to work with columns of data, use the FOREACH …GENERATE operation).
FILTER is commonly used to select the data that you want; or, conversely, to filter out (remove) the data you don’t want.
We can also use multiple conditions to filter data at one go.
Eg:
Multi_condition = FILTER A BY (ratings >= 4) AND (salary >1000)
This will produce results that follows in the category of ratings greater than equals 4 and salary greater than 1000.
Store
Stores and saves the data into a filesystem.
Syntax:
STORE alias INTO ‘directory’ [USING function] Where;
STORE : is a keyword
alias : is the relation which you want to store.
INTO : is the keyword
directory : name of directory where you want to store your result.
NOTE: If directory already exists, you will receive an error and STORE operation will fail.
function : the store function. By default, PigStorage is the default storage function and hence it is not mandatory to mention this explicitly.
Ex:
We will store our result named “Multi_condition” (achieved in previous operation) into local file system.
Only catch here is, I have specified PigStorage(‘|’) , means, I will loading result into local fs and the delimiter will be pipeline i.e. ‘|’
Let’s check the result.
Hope this blog helped you in understanding/refreshing the working of relational operators in pig.
Syntax:
STORE alias INTO ‘directory’ [USING function] Where;
STORE : is a keyword
alias : is the relation which you want to store.
INTO : is the keyword
directory : name of directory where you want to store your result.
NOTE: If directory already exists, you will receive an error and STORE operation will fail.
function : the store function. By default, PigStorage is the default storage function and hence it is not mandatory to mention this explicitly.
Ex:
We will store our result named “Multi_condition” (achieved in previous operation) into local file system.
Only catch here is, I have specified PigStorage(‘|’) , means, I will loading result into local fs and the delimiter will be pipeline i.e. ‘|’
Let’s check the result.
Hope this blog helped you in understanding/refreshing the working of relational operators in pig.
No comments:
Post a Comment