Group Operator:

TheGROUPoperator is used to group the data in one or more relations. It collects the data having the same key.

Syntax

Given below is the syntax of thegroupoperator.

grunt> Group_data = GROUP Relation_name BY age;

Example

Assume that we have a file namedstudent_details.txtin the HDFS directory/pig_data/as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

And we have loaded this file into Apache Pig with the relation namestudent_detailsas shown below.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

Now, let us group the records/tuples in the relation by age as shown below.

grunt> group_data = GROUP student_details by age;

Verification

Verify the relation group_data using the DUMP operator as shown below.

grunt> Dump group_data;

Output

Then you will get output displaying the contents of the relation namedgroup_dataas shown below. Here you can observe that the resulting schema has two columns −

One is age, by which we have grouped the relation.
The other is a bag, which contains the group of tuples, student records with the respective age.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hydera bad)})
(22,{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,984802233 8,Kolkata)})
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)})

You can see the schema of the table after grouping the data using the describe command as shown below.

grunt> Describe group_data;


group_data: {group: int,student_details: {(id: int,firstname: chararray,
               lastname: chararray,age: int,phone: chararray,city: chararray)}}

In the same way, you can get the sample illustration of the schema using the illustrate command as shown below.

$ Illustrate group_data;

It will produce the following output −

------------------------------------------------------------------------------------------------- 
|group_data|  group:int | student_details:bag{:tuple(id:int,firstname:chararray,lastname:chararray,age:int,phone:chararray,city:chararray)}|
------------------------------------------------------------------------------------------------- 
|          |     21     | { 4, Preethi, Agarwal, 21, 9848022330, Pune), (1, Rajiv, Reddy, 21, 9848022337, Hyderabad)}| 
|          |     2      | {(2,siddarth,Battacharya,22,9848022338,Kolkata),(003,Rajesh,Khanna,22,9848022339,Delhi)}| 
-------------------------------------------------------------------------------------------------

Grouping by Multiple Columns

Let us group the relation by age and city as shown below.

grunt> group_multiple = GROUP student_details by (age, city);

You can verify the content of the relation named group_multiple using the Dump operator as shown below.

grunt> Dump group_multiple;


((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
(24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

Group All

You can group a relation by all the columns as shown below.

grunt>group_all = GROUP student_details All;

Now, verify the content of the relation group_all as shown below.

grunt> Dump group_all;


(all,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334 ,trivendram), 
(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuw aneshwar), 
(4,Preethi,Agarwal,21,9848022330,Pune),(3,Rajesh,Khanna,22,9848022339,Delhi), 
(2,siddarth,Battacharya,22,9848022338,Kolkata),(1,Rajiv,Reddy,21,9848022337,Hyd erabad)})

Cogroup Operator:

TheCOGROUPoperator works more or less in the same way as theGROUPoperator. The only difference between the two operators is that thegroupoperator is normally used with one relation, while thecogroupoperator is used in statements involving two or more relations.

Grouping Two Relations using Cogroup

Assume that we have two files namelystudent_details.txtandemployee_details.txtin the HDFS directory/pig_data/as shown below.

student_details.txt

001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

employee_details.txt

001,Robin,22,newyork 
002,BOB,23,Kolkata 
003,Maya,23,Tokyo 
004,Sara,25,London 
005,David,23,Bhuwaneshwar 
006,Maggy,22,Chennai

And we have loaded these files into Pig with the relation names student_details and employee_details respectively, as shown below.

grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

grunt> emp_details = LOAD 'hdfs://localhost:9000/pig_data/emp_details.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, 
   city:chararray );

Now, let us group the records/tuples of the relations student_details and employee_details with the key age, as shown below.

grunt> cogroup_data = COGROUP student_details by age, employee_details by age;

Verification

Verify the relationcogroup_datausing theDUMPoperator as shown below.

grunt> Dump cogroup_data;

Output

It will produce the following output, displaying the contents of the relation namedcogroup_dataas shown below.

(21,{(4,Preethi,Agarwal,21,9848022330,Pune), (1,Rajiv,Reddy,21,9848022337,Hyderabad)}, 
   {    })  
(22,{ (3,Rajesh,Khanna,22,9848022339,Delhi), (2,siddarth,Battacharya,22,9848022338,Kolkata) },  
   { (6,Maggy,22,Chennai),(1,Robin,22,newyork) })  
(23,{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336 ,Bhuwaneshwar)}, 
   {(5,David,23,Bhuwaneshwar),(3,Maya,23,Tokyo),(2,BOB,23,Kolkata)}) 
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334, trivendram)}, 
   { })  
(25,{   }, 
   {(4,Sara,25,London)})

The cogroup operator groups the tuples from each relation according to age where each group depicts a particular age value.

For example, if we consider the 1st tuple of the result, it is grouped by age 21. And it contains two bags:

the first bag holds all the tuples from the first relation (student_details in this case) having age 21, and
the second bag contains all the tuples from the second relation (employee_details in this case) having age 21.

In case a relation doesn’t have tuples having the age value 21, it returns an empty bag.

Group Operations

Group Operator:

Syntax

Example

Verification

Output

Grouping by Multiple Columns

Group All

Cogroup Operator:

Grouping Two Relations using Cogroup

Verification

Output

results matching ""

No results matching ""