What is Mapper in Hadoop MapReduce

Datetime:2017-04-20 05:44:23         Topic: Hadoop          Share        Original >>
Here to See The Original Article!!!

1. Objective

In this Hadoop mapper tutorial, we will learn what is a mapper inMapReduce, how to generate key value pair in Hadoop, what is InputSplit and RecordReader in Hadoop, how mapper works in Hadoop, number of mapper in Hadoop MapReduce for running any program and how to calculate number of mappers required for a given data.

2. Mapper in Hadoop MapReduce

Let us understand what is a mapper in MapReduce and what are its functions?

Mapper task processes each input record and it generates a new <key,value> pairs. The <key, value> pairs can be completely different from the input pair. In mapper task, the output is the full collection of all these <key, value> pairs. Before writing the output for each mapper task, partitioning of output take place on the basis of the key and then sorting is done. This partitioning specifies that all the values for each key are grouped together.

MapReduce frame generates one map task for each InputSplit generated by the InputFormat for the job.

Mapper only understands <key, value> pairs of data, so before passing data to the mapper, data should be first converted into <key, value> pairs.

3. How is key value pair generated in Hadoop?

  • InputSplit – It is the logical representation of data. It describes a unit of work that contains a single map task in a MapReduce program.
  • RecordReader- It communicates with the inputSplit and it converts the data into key value pairs suitable for reading by the Mapper. By default, it uses TextInputFormat for converting data into key value pair. RecordReader communicates with the inputsplit until the file reading is not completed.

4. Mapper process in Hadoop

Let us now see how mapper works inHadoop.

Physical representation of the blocks is converted into logical for mapper by InputSplits. To read 100MB file, two InputSlits are required. One InputSplit is created for each block and one RecordReader and one mapper are created for each InputSplit.

InputSlpits not always depend on the number of blocks, we can customize the number of splits for a particular file by setting mapred.max.split.size property during job execution.

RecordReader ‘s responsibility is to keep reading/converting data into key value pairs until the end of the file. Byteoffset (unique number) is assigned to each line present in the file by RecordReader. Further, this key value pair is sent to the mapper. The output of the mapper program is called as intermediate data (key-value pairs which are understandable to reduce).

5. How many map tasks in Hadoop?

The number of map tasks in a program is handled by the total number of blocks of the input files. For maps, the right level of parallelism seems to be around 10-100 maps/node, although for cpu-light map tasks it has been set up to 300 maps. Since task setup takes some time, so it’s better if the maps take at least a minute to execute.

If we have a block size of 128 MB and we expect 10TB of input data, we will have 82,000 maps. Ultimately the number of maps is determined by the InputFormat.

Mapper= {(total data size)/ (input split size)}

Example- data size is 1 TB and input split size is 100 MB.

Mapper= (1000*1000)/100= 10,000

Learn about Reducer in Hadoop MapReduce .