此内容没有您所选择的语言版本。
27.3. Using the Hadoop Connector
InfinispanInputFormat
and InfinispanOutputFormat
In Hadoop, the InputFormat
interface indicates how a specific data source is partitioned, along with how to read data from each of the partitions, while the OutputFormat
interface specifies how to write data.
There are two methods of importance defined in the
InpoutFormat
interface:
List<InputSplit> getSplits(JobContext context);
List<InputSplit> getSplits(JobContext context);
Copy to Clipboard Copied! Toggle word wrap Toggle overflow RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);
RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
The
getSplits
method defines a data partitioner, returning one or more InputSplit
instances that contain information regarding a certain section of the data. The InputSplit
can then be used to obtain a RecordReader
which will be used to iterate over the resulting dataset. These two operations allow for parallelization of data processing across multiple nodes, resulting in Hadoop's high throughput over large datasets.
In regards to JBoss Data Grid, partitions are generated based on segment ownership, meaning that each partition is a set of segments on a certain server. By default there will be as many partitions as servers in the cluster, and each partition will contain all segments associated with that specific server.
Running a Hadoop Map Reduce job on JBoss Data Grid
Example of configuring a Map Reduce job targeting a JBoss Data Grid cluster:
In order to target the JBoss Data Grid, the job needs to be configured with the
Copy to Clipboard
Copied!
Toggle word wrap
Toggle overflow
InfinispanInputFormat
and InfinispanOutputFormat
classes: