이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 35. Integration with Apache Hadoop


35.1. Integration with Apache Hadoop

The JBoss Data Grid connector allows the JBoss Data Grid to be a Hadoop compliant data source. It accomplishes this integration by providing implementations of Hadoop’s InputFormat and OutputFormat, allowing applications to read and write data to a JBoss Data Grid server with best data locality. While JBoss Data Grid’s implementation of the InputFormat and OutputFormat allow one to run traditional Hadoop Map/Reduce jobs, they may also be used with any tool or utility that supports Hadoop’s InputFormat data source.

35.2. Hadoop Dependencies

The JBoss Data Grid implementations of Hadoop’s formats are found in the following Maven dependency:

<dependency>
    <groupId>org.infinispan.hadoop</groupId>
    <artifactId>infinispan-hadoop-core</artifactId>
    <version>0.3.0.Final-redhat-9</version>
</dependency>

35.3. Supported Hadoop Configuration Parameters

The following parameters are supported:

Table 35.1. Supported Hadoop Configuration Parameters
Parameter NameDescriptionDefault Value

hadoop.ispn.input.filter.factory

The name of the filter factory deployed on the server to pre-filter data before reading.

null (no filtering enabled)

hadoop.ispn.input.cache.name

The name of cache where data will be read.

___defaultcache

hadoop.ispn.input.remote.cache.servers

List of servers of the input cache, in the format:

host1:port;host2:port2

localhost:11222

hadoop.ispn.output.cache.name

The name of cache where data will be written.

default

hadoop.ispn.output.remote.cache.servers

List of servers of the output cache, in the format:

host1:port;host2:port2

null (no output cache)

hadoop.ispn.input.read.batch

Batch size when reading from the cache.

5000

hadoop.ispn.output.write.batch

Batch size when writing to the cache.

500

hadoop.ispn.input.converter

Class name with an implementation of org.infinispan.hadoop.KeyValueConverter, applied after reading from the cache.

null (no converting enabled).

hadoop.ispn.output.converter

Class name with an implementation of org.infinispan.hadoop.KeyValueConverter , applied before writing.

null (no converting enabled).

35.4. Using the Hadoop Connector

InfinispanInputFormat and InfinispanOutputFormat

In Hadoop, the InputFormat interface indicates how a specific data source is partitioned, along with how to read data from each of the partitions, while the OutputFormat interface specifies how to write data.

There are two methods of importance defined in the InputFormat interface:

  1. The getSplits method defines a data partitioner, returning one or more InputSplit instances that contain information regarding a certain section of the data.

    List<InputSplit> getSplits(JobContext context);
  2. The InputSplit can then be used to obtain a RecordReader which will be used to iterate over the resulting dataset.

    RecordReader<K,V> createRecordReader(InputSplit split,TaskAttemptContext context);

These two operations allow for parallelization of data processing across multiple nodes, resulting in Hadoop’s high throughput over large datasets.

In regards to JBoss Data Grid, partitions are generated based on segment ownership, meaning that each partition is a set of segments on a certain server. By default there will be as many partitions as servers in the cluster, and each partition will contain all segments associated with that specific server.

Running a Hadoop Map Reduce job on JBoss Data Grid

Example of configuring a Map Reduce job targeting a JBoss Data Grid cluster:

import org.infinispan.hadoop.*;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.mapreduce.Job;

[...]
Configuration configuration = new Configuration();
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.INPUT_REMOTE_CACHE_NAME, "map-reduce-in");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_SERVER_LIST, "localhost:11222");
configuration.set(InfinispanConfiguration.OUTPUT_REMOTE_CACHE_NAME, "map-reduce-out");

Job job = Job.getInstance(configuration, "Infinispan Integration");
[...]

In order to target the JBoss Data Grid, the job needs to be configured with the InfinispanInputFormat and InfinispanOutputFormat classes:

[...]
// Define the Map and Reduce classes
job.setMapperClass(MapClass.class);
job.setReducerClass(ReduceClass.class);

// Define the JBoss Data Grid implementations
job.setInputFormatClass(InfinispanInputFormat.class);
job.setOutputFormatClass(InfinispanOutputFormat.class);
[...]
Red Hat logoGithubRedditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

© 2024 Red Hat, Inc.