MapReduce:
MapReduce is a core component of Apache Hadoop software framework.
MapReduce serves two essential functions:
- It parcels out work to various nodes within the cluster or map
- It organizes and reduces the results from each node into a cohesive answer to a query.
Components of MapReduce:
JobTracker : the master node that manages all jobs and resources in a cluster
TaskTrackers : agents deployed to each machine in the cluster to run the map and reduce tasks
JobHistoryServer : a component that tracks completed jobs, and is typically deployed as a separate function or with JobTrackers
To distribute input data and collate results, MapReduce operates in parallel across massive cluster sizes. Because cluster size doesn't affect a processing job's final results, jobs can be split across almost any number of servers. MapReduce is also fault-tolerant, with each node periodically reporting its status to a master node. If a node doesn't respond as expected, the master node re-assigns that piece of the job to other available nodes in the cluster. This creates resiliency and makes it practical for MapReduce to run on inexpensive commodity servers.
MapReduce handles:
- Scheduling and assigns workers to map and reduce tasks.
- Data Distribution by moving processes to data.
- Synchronization by gathers, sorts, and shuffles intermediate data
- Errors and Faults by detecting worker failures and restarts