Join Operation:

The JOIN operator is used to combine records from two or more relations. While performing a join operation, we declare one (or a group of) tuple(s) from each relation, as keys. When these keys match, the two particular tuples are matched, else the records are dropped. Joins can be of the following types −

Self-join
Inner-join
Outer-join − left join, right join, and full join

This chapter explains with examples how to use the join operator in Pig Latin. Assume that we have two files namely customers.txt and orders.txt in the /pig_data/ directory of HDFS as shown below.

customers.txt

1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00 
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00

orders.txt

102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060

And we have loaded these two files into Pig with the relationscustomersandordersas shown below.

grunt
>
 customers 
=
 LOAD 
'hdfs://localhost:9000/pig_data/customers.txt'
 USING 
PigStorage
(
','
)
as
(
id
:
int
,
 name
:
chararray
,
 age
:
int
,
 address
:
chararray
,
 salary
:
int
);


grunt
>
 orders 
=
 LOAD 
'hdfs://localhost:9000/pig_data/orders.txt'
 USING 
PigStorage
(
','
)
as
(
oid
:
int
,
 date
:
chararray
,
 customer_id
:
int
,
 amount
:
int
);

Let us now perform various Join operations on these two relations.

Self - join

Self-joinis used to join a table with itself as if the table were two relations, temporarily renaming at least one relation.

Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the filecustomers.txtas two tables as shown below.

grunt
>
 customers1 
=
 LOAD 
'hdfs://localhost:9000/pig_data/customers.txt'
 USING 
PigStorage
(
','
)
as
(
id
:
int
,
 name
:
chararray
,
 age
:
int
,
 address
:
chararray
,
 salary
:
int
);


grunt
>
 customers2 
=
 LOAD 
'hdfs://localhost:9000/pig_data/customers.txt'
 USING 
PigStorage
(
','
)
as
(
id
:
int
,
 name
:
chararray
,
 age
:
int
,
 address
:
chararray
,
 salary
:
int
);

Syntax

Given below is the syntax of performingself-joinoperation using theJOINoperator.

grunt
>
 Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

Example

Let us performself-joinoperation on the relationcustomers, by joining the two relationscustomers1andcustomers2as shown below.

grunt
>
 customers3 
=
 JOIN customers1 BY id
,
 customers2 BY id
;

Verification

Verify the relationcustomers3using theDUMPoperator as shown below.

grunt
>
Dump
 customers3
;

Output

It will produce the following output, displaying the contents of the relationcustomers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

Inner Join

Inner Joinis used quite frequently; it is also referred to asequijoin. An inner join returns rows when there is a match in both tables.

It creates a new relation by combining column values of two relations (say A and B) based upon the join-predicate. The query compares each row of A with each row of B to find all pairs of rows which satisfy the join-predicate. When the join-predicate is satisfied, the column values for each matched pair of rows of A and B are combined into a result row.

Syntax

Here is the syntax of performinginner joinoperation using theJOINoperator.

grunt
>
 result = JOIN relation1 BY columnname, relation2 BY columnname;

Example

Let us performinner joinoperation on the two relationscustomersandordersas shown below.

grunt
>
 coustomer_orders 
=
 JOIN customers BY id
,
 orders BY customer_id
;

Verification

Verify the relationcoustomer_ordersusing theDUMPoperator as shown below.

grunt
>
Dump
 coustomer_orders
;

Output

You will get the following output that will the contents of the relation namedcoustomer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Note−

Outer Join: Unlike inner join,outer joinreturns all the rows from at least one of the relations. An outer join operation is carried out in three ways −

Left outer join
Right outer join
Full outer join

Left Outer Join

Theleft outer Joinoperation returns all rows from the left table, even if there are no matches in the right relation.

Syntax

Given below is the syntax of performingleft outer joinoperation using theJOINoperator.

grunt
>
 Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

Example

Let us perform left outer join operation on the two relations customers and orders as shown below.

grunt
>
 outer_left 
=
 JOIN customers BY id LEFT OUTER
,
 orders BY customer_id
;

Verification

Verify the relationouter_leftusing theDUMPoperator as shown below.

grunt
>
Dump
 outer_left
;

Output

It will produce the following output, displaying the contents of the relationouter_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Right Outer Join

Theright outer joinoperation returns all rows from the right table, even if there are no matches in the left table.

Syntax

Given below is the syntax of performingright outer joinoperation using theJOINoperator.

grunt
>
 outer_right = JOIN customers BY id RIGHT, orders BY customer_id;

Example

Let us performright outer joinoperation on the two relationscustomersandordersas shown below.

grunt
>
 outer_right 
=
 JOIN customers BY id RIGHT
,
 orders BY customer_id
;

Verification

Verify the relationouter_rightusing theDUMPoperator as shown below.

grunt
>
Dump
 outer_right

Output

It will produce the following output, displaying the contents of the relationouter_right.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

Full Outer Join

Thefull outer joinoperation returns rows when there is a match in one of the relations.

Syntax

Given below is the syntax of performingfull outer joinusing theJOINoperator.

grunt
>
 outer_full = JOIN customers BY id FULL OUTER, orders BY customer_id;

Example

Let us performfull outer joinoperation on the two relationscustomersandordersas shown below.

grunt
>
 outer_full 
=
 JOIN customers BY id FULL OUTER
,
 orders BY customer_id
;

Verification

Verify the relationouter_fullusing theDUMPoperator as shown below.

grun
>
Dump
 outer_full
;

Output

It will produce the following output, displaying the contents of the relationouter_full.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)

Using Multiple Keys

We can perform JOIN operation using multiple keys.

Syntax

Here is how you can perform a JOIN operation on two tables using multiple keys.

grunt
>
 Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

Assume that we have two files namelyemployee.txtandemployee_contact.txtin the/pig_data/directory of HDFS as shown below.

employee.txt

001,Rajiv,Reddy,21,programmer,003
002,siddarth,Battacharya,22,programmer,003
003,Rajesh,Khanna,22,programmer,003
004,Preethi,Agarwal,21,programmer,003
005,Trupthi,Mohanthy,23,programmer,003
006,Archana,Mishra,23,programmer,003
007,Komal,Nayak,24,teamlead,002
008,Bharathi,Nambiayar,24,manager,001

employee_contact.txt

001,9848022337,[email protected],Hyderabad,003
002,9848022338,[email protected],Kolkata,003
003,9848022339,[email protected],Delhi,003
004,9848022330,[email protected],Pune,003
005,9848022336,[email protected],Bhuwaneshwar,003
006,9848022335,[email protected],Chennai,003
007,9848022334,[email protected],trivendram,002
008,9848022333,[email protected],Chennai,001

And we have loaded these two files into Pig with relationsemployeeandemployee_contactas shown below.

grunt
>
 employee 
=
 LOAD 
'hdfs://localhost:9000/pig_data/employee.txt'
 USING 
PigStorage(',')as(id:int, firstname:chararray, lastname:chararray, age:int, designation:chararray, jobid:int);

grunt> employee_contact = LOAD 'hdfs://localhost:9000/pig_data/employee_contact.txt' USING PigStorage(',')as(id:int, phone:chararray, email:chararray, city:chararray, jobid:int);

Now, let us join the contents of these two relations using theJOINoperator as shown below.

grunt> emp = JOIN employee BY (id,jobid), employee_contact BY (id,jobid);

Verification

Verify the relation emp using the DUMP operator as shown below.

grunt>Dump emp;

Output

It will produce the following output, displaying the contents of the relation namedempas shown below.

(1,Rajiv,Reddy,21,programmer,113,1,9848022337,[email protected],Hyderabad,113)
(2,siddarth,Battacharya,22,programmer,113,2,9848022338,[email protected],Kolka ta,113)  
(3,Rajesh,Khanna,22,programmer,113,3,9848022339,[email protected],Delhi,113)  
(4,Preethi,Agarwal,21,programmer,113,4,9848022330,[email protected],Pune,113)  
(5,Trupthi,Mohanthy,23,programmer,113,5,9848022336,[email protected],Bhuwaneshw ar,113)  
(6,Archana,Mishra,23,programmer,113,6,9848022335,[email protected],Chennai,113)  
(7,Komal,Nayak,24,teamlead,112,7,9848022334,[email protected],trivendram,112)  
(8,Bharathi,Nambiayar,24,manager,111,8,9848022333,[email protected],Chennai,111)

Join Operation

Join Operation:

Self - join

Syntax

Example

Verification

Output

Inner Join

Syntax

Example

Verification

Output

Left Outer Join

Syntax

Example

Verification

Output

Right Outer Join

Syntax

Example

Verification

Output

Full Outer Join

Syntax

Example

Verification

Output

Using Multiple Keys

Syntax

Verification

Output

results matching ""

No results matching ""