In this Hadoop Reducer tutorial, we will learn what is Reducer in Hadoop MapReduce, what are the different phases of Hadoop MapReduce Reducer, shuffling and sorting in Hadoop, Hadoop reduce phase, functioning of Hadoop reducer class, how many reducers are required in Hadoop and how to change the number of reducers in Hadoop MapReduce.
2. Reducer in Hadoop MapReduce
The output of the mapper is processed by the Reducer. After processing the data, it produces a new set of output, which will be stored in theHDFS.
Reducer takes a set of an intermediate key-value pair produced by the mapper as the input and runs a Reducer function on each of them. This data (key, value) can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing.Reducer first processes the intermediate values for particular key generated by the map function and then generates the output (zero or more key-value pair). One-one mapping takes place between keys and reducers. Reducers run in parallel since they are independent of one another. The user decides the number of reducers. By default number of reducers is 1.
3. Phases of Reducer
There are 3 phases of Reducer inHadoop MapReduce.
a. Shuffle phase
In this phase, the sorted output from the mapper is the input to the Reducer. In this phase, with the help of HTTP, the framework fetches the relevant partition of the output of all the mappers.
b. Sort phase
In this phase, the input from different mappers is again sorted based on the similar keys in different Mappers. The shuffle and sort phases occur concurrently.
c. Reduce phase
In this phase, after shuffling and, sorting, reduce task aggregates the key value pairs. By OutputCollector.collect(), the output of the reduce task is written to the Filesystem. Reducer output is not sorted.
4. How many Reducers in Hadoop?
With the help of Job.setNumreduceTasks(int) the user set the number of reducers for the job. The right number of reducers are 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum container per node>).
With 0.95, all reducers immediately launch and start transferring map outputs as the maps finish. With 1.75, first round of reducers is finished by the faster nodes and second wave of reducers is launched doing a much better job of load balancing.
Increasing the number of reducers:
- Increases the Framework overhead.
- Increases load balancing.
- Lowers the cost of failures.
Learn aboutHadoop Mapper class here.