17.2. MapReduceTask Distributed Execution
Distributed Execution of the MapReduceTask occurs in three phases:
- Mapping phase.
- Outgoing Key and Outgoing Value Migration.
- Reduce phase.
The Mapping phase occurs as follows:
Procedure 17.1. Mapping Phase
- The input keys are grouped according to their owner nodes.
- On each node the
Mapper
function processes all key/value pairs local to that node. - The results of the mapping process are collected using a
Collector
. - If a
Reducer
is specified, it is applied to all intermediate values collected for each outgoing key (KOut
,VOut
).
The Outgoing Key and Outgoing Value migration phase occurs as follows:
Procedure 17.2. Outgoing Key and Outgoing Value Migration Phase
- Intermediate keys exposed by the
Mapper
are grouped by the intermediate outgoing key (KOut
values. This grouping preserves the keys, as the mapping phase, when applied to other nodes in the cluster, may generate identical intermediate keys. - Once the Reduce phase has been invoked, (as described in Step 4 of the Mapping Phase above), an underlying hashing mechanism, a temporary distributed cache, and a DeltaAware cache insertion mechanism are used to:
- hash the intermediate key
KOut
using its owner node. - migrate the hashed
KOut
key and its correspondingVOut
value to the same owner node.
- The list of
KOut
keys are returned to the master node.VOut
values are not returned as they are not required by the master node.
The Reduce phase occurs as follows:
Procedure 17.3. Reduce Phase
KOut
keys are grouped according to owner node.- The reduce operation applies to each node and its input (grouped
KOut
keys) as follows:- The reduce operation locates the temporary distributed cache created during the migration phase on the target node.
- For each
KOut
key, a list ofVOut
values is taken from the temporary cache. - The
KOut
andVOut
values are wrapped in anIterator
and the reduce operation is applied to the result.
- The reduce operation generates a map where each key is
KOut
and each value isVOut
.Each node has its own map. - The maps from each node in the cluster are combined into a single map (
M
). - The MapReduce task returns map (
M
) to the node that initiated theMapReduce
task.