Group Operator:
TheGROUPoperator is used to group the data in one or more relations. It collects the data having the same key.
Syntax
Given below is the syntax of thegroupoperator.
grunt> Group_data = GROUP Relation_name BY age;
Example
Assume that we have a file namedstudent_details.txtin the HDFS directory/pig_data/as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
And we have loaded this file into Apache Pig with the relation namestudent_detailsas shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us group the records/tuples in the relation by age as shown below.
grunt> group_data = GROUP student_details by age;
Verification
Verify the relation group_data using the DUMP operator as shown below.
grunt> Dump group_data;
Output
Then you will get output displaying the contents of the relation namedgroup_dataas shown below. Here you can observe that the resulting schema has two columns −
One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with the respective age.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})
You can see the schema of the table after grouping the data using the describe command as shown below.
grunt> Describe group_data;
group_data: {group: int,student_details: {(id: int,firstname: chararray,
lastname: chararray,age: int,phone: chararray,city: chararray)}}
In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.
$ Illustrate group_data;
It will produce the following output −
-------------------------------------------------------------------------------------------------
|group_data| group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}|
-------------------------------------------------------------------------------------------------
| | 21 | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}|
| | 2 | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}|
-------------------------------------------------------------------------------------------------
Grouping by Multiple Columns
Let us group the relation by age and city as shown below.
grunt> group_multiple = GROUP student_details by (age, city);
You can verify the content of the relation named group_multiple using the Dump operator as shown below.
grunt> Dump group_multiple;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
Group All
You can group a relation by all the columns as shown below.
grunt>group_all = GROUP student_details All;
Now, verify the content of the relation group_all as shown below.
grunt> Dump group_all;
(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram),
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar),
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi),
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})
Cogroup Operator:
TheCOGROUPoperator works more or less in the same way as theGROUPoperator. The only difference between the two operators is that thegroupoperator is normally used with one relation, while thecogroupoperator is used in statements involving two or more relations.
Grouping Two Relations using Cogroup
Assume that we have two files namelystudent_details.txtandemployee_details.txtin the HDFS directory/pig_data/as shown below.
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
employee_details.txt
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below.
grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
grunt> emp_details = LOAD 'hdfs://localhost:9000/pig_data/emp_details.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.
grunt> cogroup_data = COGROUP student_details by age, employee_details by age;
Verification
Verify the relationcogroup_datausing theDUMPoperator as shown below.
grunt> Dump cogroup_data;
Output
It will produce the following output, displaying the contents of the relation namedcogroup_dataas shown below.
(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)},
{ })
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },
{ (6,Maggy,22,Chennai),(1,Robin,22,newyork) })
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)},
{(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)},
{ })
(25,{ },
{(4,Sara,25,London)})
The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.
For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags:
the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and
the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.
In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.