Chapter 16. MapReduce
16.1. About MapReduce
The JBoss Data Grid MapReduce model is an adaptation of Google's MapReduce model.
MapReduce is a programming model used to process and generate large data sets. It is typically used in distributed computing environments where nodes are clustered. In JBoss Data Grid, MapReduce allows transparent distributed processing of very large amounts of data across the data grid by performing most computations as locally possible to where the data is stored.
MapReduce uses the two distinct computational phases of map and reduce to process information requests through the data grid. The process occurs as follows:
- The user initiates a task on a cache instance, which runs on a cluster node (the master node).
- The master node receives the task input, divides the task, and sends tasks for map phase execution on the grid.
- Each node executes a
Mapper
function on its input, and returns intermediate results back to the master node.- If the
distributedReducePhase
parameter is set to"true"
, the map results are inserted in an intermediary cache, rather than being returned to the master node. - If a
Combiner
has been specified withtask.combinedWith(Reducer)
, theCombiner
is called on theMapper
results and the combiner's results are retured to the master node or inserted in the intermediary cache.
- The master node collects all intermediate results from the map phase and merges all intermediate values associated with the same intermediate key.
- If the
distributedReducePhase
parameter is set to"true"
, the merging of the intermediate values is done on each node, as theMapper
orCombiner
results are inserted in the intermediary cache.The master node only receives the intermediate keys.
- The master node sends intermediate key/value pairs for reduction on the grid.
- If the
distributedReducePhase
parameter is set to"false"
, the reduction phase is executed only on the master node.
- The final results of the reduction phase are returned.
- If the
distributedReducePhase
parameter is set to"true"
, the master node running the task receives all results from the reduction phase and returns the final result to the MapReduce task initiator. - If a
Collator
has been specified withtask.execute(Collator)
, theCollator
is executed on the reduction phase results, and theCollator
result is returned to the task initiator.