Chapter 32. Integration with Apache Hadoop

32.1. Integration with Apache Hadoop
Copy link

The JBoss Data Grid connector allows the JBoss Data Grid to be a Hadoop compliant data source. It accomplishes this integration by providing implementations of Hadoop’s InputFormat and OutputFormat, allowing applications to read and write data to a JBoss Data Grid server with best data locality. While JBoss Data Grid’s implementation of the InputFormat and OutputFormat allow one to run traditional Hadoop Map/Reduce jobs, they may also be used with any tool or utility that supports Hadoop’s InputFormat data source.

32.2. Hadoop Dependencies
Copy link

The JBoss Data Grid implementations of Hadoop’s formats are found in the following Maven dependency:

<dependency>
    <groupId>org.infinispan.hadoop</groupId>
    <artifactId>infinispan-hadoop-core</artifactId>
    <version>0.2.0.Final-redhat-1</version>
</dependency>

<dependency>
    <groupId>org.infinispan.hadoop</groupId>
    <artifactId>infinispan-hadoop-core</artifactId>
    <version>0.2.0.Final-redhat-1</version>
</dependency>

Copy to Clipboard

Toggle word wrap

32.3. Supported Hadoop Configuration Parameters
Copy link

The following parameters are supported:

Expand

Table 32.1. Supported Hadoop Configuration Parameters
Parameter Name	Description	Default Value
`hadoop.ispn.input.filter.factory`	The name of the filter factory deployed on the server to pre-filter data before reading.	null (no filtering enabled)
`hadoop.ispn.input.cache.name`	The name of cache where data will be read.	default
`hadoop.ispn.input.remote.cache.servers`	List of servers of the input cache, in the format: `host1:port;host2:port2` Copy to Clipboard Toggle word wrap	localhost:11222
`hadoop.ispn.output.cache.name`	The name of cache where data will be written.	default
`hadoop.ispn.output.remote.cache.servers`	List of servers of the output cache, in the format: `host1:port;host2:port2` Copy to Clipboard Toggle word wrap	null (no output cache)
`hadoop.ispn.input.read.batch`	Batch size when reading from the cache.	5000
`hadoop.ispn.output.write.batch`	Batch size when writing to the cache.	500
`hadoop.ispn.input.converter`	Class name with an implementation of `org.infinispan.hadoop.KeyValueConverter`, applied after reading from the cache.	null (no converting enabled).
`hadoop.ispn.output.converter`	Class name with an implementation of `org.infinispan.hadoop.KeyValueConverter` , applied before writing.	null (no converting enabled).

32.4. Using the Hadoop Connector
Copy link

InfinispanInputFormat and InfinispanOutputFormat

In Hadoop, the InputFormat interface indicates how a specific data source is partitioned, along with how to read data from each of the partitions, while the OutputFormat interface specifies how to write data.

There are two methods of importance defined in the InpoutFormat interface:

The getSplits method defines a data partitioner, returning one or more InputSplit instances that contain information regarding a certain section of the data.
```
List<InputSplit> getSplits(JobContext context);
```
```
List<InputSplit> getSplits(JobContext context);
```
Copy to Clipboard Toggle word wrap
The InputSplit can then be used to obtain a RecordReader which will be used to iterate over the resulting dataset.
```
RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);
```
```
RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);
```
Copy to Clipboard Toggle word wrap

These two operations allow for parallelization of data processing across multiple nodes, resulting in Hadoop’s high throughput over large datasets.

In regards to JBoss Data Grid, partitions are generated based on segment ownership, meaning that each partition is a set of segments on a certain server. By default there will be as many partitions as servers in the cluster, and each partition will contain all segments associated with that specific server.

Running a Hadoop Map Reduce job on JBoss Data Grid

Example of configuring a Map Reduce job targeting a JBoss Data Grid cluster:

import org.infinispan.hadoop.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;

[...]
Configuration configuration = new Configuration();
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_NAME, "map-reduce-in");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_NAME, "map-reduce-out");

Job job = Job.getInstance(configuration, "Infinispan Integration");
[...]

import org.infinispan.hadoop.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;

[...]
Configuration configuration = new Configuration();
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_NAME, "map-reduce-in");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_NAME, "map-reduce-out");

Job job = Job.getInstance(configuration, "Infinispan Integration");
[...]

Copy to Clipboard

Toggle word wrap

In order to target the JBoss Data Grid, the job needs to be configured with the InfinispanInputFormat and InfinispanOutputFormat classes:

[...]
// Define the Map and Reduce classes
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);

// Define the JBoss Data Grid implementations
job.setInputFormatClass(InfinispanInputFormat.class);
job.setOutputFormatClass(InfinispanOutputFormat.class);
[...]

[...]
// Define the Map and Reduce classes
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);

// Define the JBoss Data Grid implementations
job.setInputFormatClass(InfinispanInputFormat.class);
job.setOutputFormatClass(InfinispanOutputFormat.class);
[...]

Copy to Clipboard

Toggle word wrap

32.1. Integration with Apache Hadoop
Copy link

32.2. Hadoop Dependencies
Copy link

32.3. Supported Hadoop Configuration Parameters
Copy link

32.4. Using the Hadoop Connector
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 32. Integration with Apache Hadoop

32.1. Integration with Apache HadoopCopy linkLink copied to clipboard!

32.2. Hadoop DependenciesCopy linkLink copied to clipboard!

32.3. Supported Hadoop Configuration ParametersCopy linkLink copied to clipboard!

32.4. Using the Hadoop ConnectorCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

32.1. Integration with Apache Hadoop
Copy link

32.2. Hadoop Dependencies
Copy link

32.3. Supported Hadoop Configuration Parameters
Copy link

32.4. Using the Hadoop Connector
Copy link