Hadoop Programming
Basic Hadoop API:
Mapper Class:
The Mapper class defines the Map job. Maps input key-value pairs to a set of intermediate key-value pairs. Maps are the individual tasks that transform the input records into intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
map is the most prominent method of the Mapper class. The syntax is defined below:
//Mapper
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
void configure(JobConf job)
void close() throws IOException
Reducer Class:
The Reducer class defines the Reduce job in MapReduce. It reduces a set of intermediate values that share a key to a smaller set of values. Reducer implementations can access the Configuration for a job via the JobContext.getConfiguration() method. A Reducer has three primary phases − Shuffle, Sort, and Reduce.
Shuffle− The Reducer copies the sorted output from each Mapper using HTTP across the network.
Sort− The framework merge-sorts the Reducer inputs by keys (since different Mappers may have output the same key). The shuffle and sort phases occur simultaneously, i.e., while outputs are being fetched, they are merged.
Reduce− In this phase the reduce (Object, Iterable, Context) method is called for each <key, (collection of values)> in the sorted inputs.
reduce is the most prominent method of the Reducer class. The syntax is defined below −
//Reducer
void reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output,
Reporter reporter)
void configure(JobConf job)
void close() throws IOException
Partitioner Class:
A partitioner works like a condition in processing an input dataset. The partition phase takes place after the Map phase and before the Reduce phase.
The number of partitioners is equal to the number of reducers. That means a partitioner will divide the data according to the number of reducers. Therefore, the data passed from a single partitioner is processed by a single Reducer.
A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer tasks for the job. Let us take an example to understand how the partitioner works.
The syntax is defined below:
//Partitioner
void getPartition(K2 key, V2 value, int numPartitions)
Example: "Hello World" Word Count
Map(String docid, String text):
for each word w in text:
Emit(w, 1);
Reduce(String term, Iterator<Int> values):
int sum = 0;
for each v in values:
sum += v;
Emit(term, value);
Three Gotchas:
- Avoid Object Creation, at all costs
- Execution framework reuses value in reducer
- Passing parameters into mappers and reducers: DistributedCache for larger data