MapReduce: Simplified Data Processing on Large Clusters

4 min readJul 15, 2024

MapReduce is a programming model and an associated implementation for processing and generating large data sets.

Count words in large collection of documents

We can write map function which will read document content and for each word simply emit count as 1.

map(String key, String value):
// key: document name
// value: document contents
for each word w in value:
  EmitIntermediate(w, "1");

We can also write reduce function which will count the total occurrence of given word .

reduce(String key, Iterator values):
// key: a word
// values: a list of counts
int result = 0;
for each v in values:
  result += ParseInt(v);
Emit(AsString(result));

User then writes code to fill in a mapreduce specification object with the names of the input and output files, and optional tuning parameters.

Execution Overview

Following is execution flow for MapReduce.

When the user program calls the MapReduce function, the following sequence of actions occurs.

The MapReduce library in the user program first splits the input files into M pieces of typically 64 megabytes per piece . It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special — the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory.
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of the map workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keys so that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used.
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

Locality

The MapReduce master takes the location information of the
input files into account and attempts to schedule a map task on a machine that contains a replica of the corresponding input data.

Task Granularity

Ideally, M and R should be much larger than the number of worker machines. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails: the many map tasks it has completed can be spread out across all the other
worker machines.

Ordering Guarantees

It guarantee that within a given partition, the intermediate key/value pairs are processed in increasing key order. This ordering guarantee makes it easy to generate a sorted output file per partition, which is useful when
the output file format needs to support efficient random access lookups by key, or users of the output find it convenient to have the data sorted.

Local Execution

To help facilitate debugging, profiling, and small-scale testing, it can also sequentially executes all of the work for a MapReduce operation on the local machine

Status Information

The master runs an internal HTTP server and exports a set of status pages for human consumption. The status pages show the progress of the computation, such as how many tasks have been completed, how many are in progress, bytes of input, bytes of intermediate data, bytes of output, processing rates, etc.

Counters

The MapReduce library provides a counter facility to count occurrences of various events. For example, user code may want to count total number of words processed or the number of German documents indexed, etc

Reference

To read more about MapReduce, refer to original paper https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Happy learning :-)