Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 314. Apache Spark Component

Available as of Camel version 2.17

This documentation page covers the Apache Spark component for the Apache Camel. The main purpose of the Spark integration with Camel is to provide a bridge between Camel connectors and Spark tasks. In particular Camel connector provides a way to route message from various transports, dynamically choose a task to execute, use incoming message as input data for that task and finally deliver the results of the execution back to the Camel pipeline.

314.1. Supported architectural styles
Link kopieren

Spark component can be used as a driver application deployed into an application server (or executed as a fat jar).

Spark component can also be submitted as a job directly into the Spark cluster.

While Spark component is primary designed to work as a long running job serving as an bridge between Spark cluster and the other endpoints, you can also use it as a fire-once short job.

314.2. Running Spark in OSGi servers
Link kopieren

Currently the Spark component doesn’t support execution in the OSGi container. Spark has been designed to be executed as a fat jar, usually submitted as a job to a cluster. For those reasons running Spark in an OSGi server is at least challenging and is not support by Camel as well.

314.3. URI format
Link kopieren

Currently the Spark component supports only producers - it it intended to invoke a Spark job and return results. You can call RDD, data frame or Hive SQL job.

Spark URI format

spark:{rdd|dataframe|hive}

spark:{rdd|dataframe|hive}

Copy to Clipboard

Toggle word wrap

314.3.1. Spark options
Link kopieren

The Apache Spark component supports 3 options, which are listed below.

Expand

Name	Description	Default	Type
rdd (producer)	RDD to compute against.		JavaRDDLike
rddCallback (producer)	Function performing action against an RDD.		RddCallback
resolveProperty Placeholders (advanced)	Whether the component should resolve property placeholders on itself when starting. Only properties which are of String type can use property placeholders.	true	boolean

The Apache Spark endpoint is configured using URI syntax:

spark:endpointType

spark:endpointType

Copy to Clipboard

Toggle word wrap

with the following path and query parameters:

314.3.2. Path Parameters (1 parameters):
Link kopieren

Expand

Name	Description	Default	Type
endpointType	Required Type of the endpoint (rdd, dataframe, hive).		EndpointType

314.3.3. Query Parameters (6 parameters):
Link kopieren

Expand

Name	Description	Default	Type
collect (producer)	Indicates if results should be collected or counted.	true	boolean
dataFrame (producer)	DataFrame to compute against.		Dataset
dataFrameCallback (producer)	Function performing action against an DataFrame.		DataFrameCallback
rdd (producer)	RDD to compute against.		JavaRDDLike
rddCallback (producer)	Function performing action against an RDD.		RddCallback
synchronous (advanced)	Sets whether synchronous processing should be strictly used, or Camel is allowed to use asynchronous processing (if supported).	false	boolean

314.4. Spring Boot Auto-Configuration
Link kopieren

The component supports 4 options, which are listed below.

Expand

Name	Description	Default	Type
camel.component.spark.enabled	Enable spark component	true	Boolean
camel.component.spark.rdd	RDD to compute against. The option is a org.apache.spark.api.java.JavaRDDLike type.		String
camel.component.spark.rdd-callback	Function performing action against an RDD. The option is a org.apache.camel.component.spark.RddCallback type.		String
camel.component.spark.resolve-property-placeholders	Whether the component should resolve property placeholders on itself when starting. Only properties which are of String type can use property placeholders.	true	Boolean

314.5. RDD jobs
Link kopieren

To invoke an RDD job, use the following URI:

Spark RDD producer

spark:rdd?rdd=#testFileRdd&rddCallback=#transformation

spark:rdd?rdd=#testFileRdd&rddCallback=#transformation

Copy to Clipboard

Toggle word wrap

 Where `rdd` option refers to the name of an RDD instance (subclass of
`org.apache.spark.api.java.JavaRDDLike`) from a Camel registry, while
`rddCallback` refers to the implementation
of `org.apache.camel.component.spark.RddCallback` interface (also from a
registry). RDD callback provides a single method used to apply incoming
messages against the given RDD. Results of callback computations are
saved as a body to an exchange.

 Where `rdd` option refers to the name of an RDD instance (subclass of
`org.apache.spark.api.java.JavaRDDLike`) from a Camel registry, while
`rddCallback` refers to the implementation
of `org.apache.camel.component.spark.RddCallback` interface (also from a
registry). RDD callback provides a single method used to apply incoming
messages against the given RDD. Results of callback computations are
saved as a body to an exchange.

Copy to Clipboard

Toggle word wrap

Spark RDD callback

public interface RddCallback<T> {
    T onRdd(JavaRDDLike rdd, Object... payloads);
}

public interface RddCallback<T> {
    T onRdd(JavaRDDLike rdd, Object... payloads);
}

Copy to Clipboard

Toggle word wrap

The following snippet demonstrates how to send message as an input to the job and return results:

Calling spark job

String pattern = "job input";
long linesCount = producerTemplate.requestBody("spark:rdd?rdd=#myRdd&rddCallback=#countLinesContaining", pattern, long.class);

String pattern = "job input";
long linesCount = producerTemplate.requestBody("spark:rdd?rdd=#myRdd&rddCallback=#countLinesContaining", pattern, long.class);

Copy to Clipboard

Toggle word wrap

The RDD callback for the snippet above registered as Spring bean could look as follows:

Spark RDD callback

@Bean
RddCallback<Long> countLinesContaining() {
    return new RddCallback<Long>() {
        Long onRdd(JavaRDDLike rdd, Object... payloads) {
            String pattern = (String) payloads[0];
            return rdd.filter({line -> line.contains(pattern)}).count();
        }
    }
}

@Bean
RddCallback<Long> countLinesContaining() {
    return new RddCallback<Long>() {
        Long onRdd(JavaRDDLike rdd, Object... payloads) {
            String pattern = (String) payloads[0];
            return rdd.filter({line -> line.contains(pattern)}).count();
        }
    }
}

Copy to Clipboard

Toggle word wrap

The RDD definition in Spring could looks as follows:

Spark RDD definition

@Bean
JavaRDDLike myRdd(JavaSparkContext sparkContext) {
  return sparkContext.textFile("testrdd.txt");
}

@Bean
JavaRDDLike myRdd(JavaSparkContext sparkContext) {
  return sparkContext.textFile("testrdd.txt");
}

Copy to Clipboard

Toggle word wrap

314.5.1. Void RDD callbacks
Link kopieren

If your RDD callback doesn’t return any value back to a Camel pipeline, you can either return null value or use VoidRddCallback base class:

Spark RDD definition

@Bean
RddCallback<Void> rddCallback() {
  return new VoidRddCallback() {
        @Override
        public void doOnRdd(JavaRDDLike rdd, Object... payloads) {
            rdd.saveAsTextFile(output.getAbsolutePath());
        }
    };
}

@Bean
RddCallback<Void> rddCallback() {
  return new VoidRddCallback() {
        @Override
        public void doOnRdd(JavaRDDLike rdd, Object... payloads) {
            rdd.saveAsTextFile(output.getAbsolutePath());
        }
    };
}

Copy to Clipboard

Toggle word wrap

314.5.2. Converting RDD callbacks
Link kopieren

If you know what type of the input data will be sent to the RDD callback, you can use ConvertingRddCallback and let Camel to automatically convert incoming messages before inserting those into the callback:

Spark RDD definition

@Bean
RddCallback<Long> rddCallback(CamelContext context) {
  return new ConvertingRddCallback<Long>(context, int.class, int.class) {
            @Override
            public Long doOnRdd(JavaRDDLike rdd, Object... payloads) {
                return rdd.count() * (int) payloads[0] * (int) payloads[1];
            }
        };
    };
}

@Bean
RddCallback<Long> rddCallback(CamelContext context) {
  return new ConvertingRddCallback<Long>(context, int.class, int.class) {
            @Override
            public Long doOnRdd(JavaRDDLike rdd, Object... payloads) {
                return rdd.count() * (int) payloads[0] * (int) payloads[1];
            }
        };
    };
}

Copy to Clipboard

Toggle word wrap

314.5.3. Annotated RDD callbacks
Link kopieren

Probably the easiest way to work with the RDD callbacks is to provide class with method marked with @RddCallback annotation:

Annotated RDD callback definition

import static org.apache.camel.component.spark.annotations.AnnotatedRddCallback.annotatedRddCallback;

@Bean
RddCallback<Long> rddCallback() {
    return annotatedRddCallback(new MyTransformation());
}

...

import org.apache.camel.component.spark.annotation.RddCallback;

public class MyTransformation {

    @RddCallback
    long countLines(JavaRDD<String> textFile, int first, int second) {
        return textFile.count() * first * second;
    }

}

import static org.apache.camel.component.spark.annotations.AnnotatedRddCallback.annotatedRddCallback;

@Bean
RddCallback<Long> rddCallback() {
    return annotatedRddCallback(new MyTransformation());
}

...

import org.apache.camel.component.spark.annotation.RddCallback;

public class MyTransformation {

    @RddCallback
    long countLines(JavaRDD<String> textFile, int first, int second) {
        return textFile.count() * first * second;
    }

}

Copy to Clipboard

Toggle word wrap

If you will pass CamelContext to the annotated RDD callback factory method, the created callback will be able to convert incoming payloads to match the parameters of the annotated method:

Body conversions for annotated RDD callbacks

import static org.apache.camel.component.spark.annotations.AnnotatedRddCallback.annotatedRddCallback;

@Bean
RddCallback<Long> rddCallback(CamelContext camelContext) {
    return annotatedRddCallback(new MyTransformation(), camelContext);
}

...


import org.apache.camel.component.spark.annotation.RddCallback;

public class MyTransformation {

    @RddCallback
    long countLines(JavaRDD<String> textFile, int first, int second) {
        return textFile.count() * first * second;
    }

}

...

// Convert String "10" to integer
long result = producerTemplate.requestBody("spark:rdd?rdd=#rdd&rddCallback=#rddCallback" Arrays.asList(10, "10"), long.class);

import static org.apache.camel.component.spark.annotations.AnnotatedRddCallback.annotatedRddCallback;

@Bean
RddCallback<Long> rddCallback(CamelContext camelContext) {
    return annotatedRddCallback(new MyTransformation(), camelContext);
}

...


import org.apache.camel.component.spark.annotation.RddCallback;

public class MyTransformation {

    @RddCallback
    long countLines(JavaRDD<String> textFile, int first, int second) {
        return textFile.count() * first * second;
    }

}

...

// Convert String "10" to integer
long result = producerTemplate.requestBody("spark:rdd?rdd=#rdd&rddCallback=#rddCallback" Arrays.asList(10, "10"), long.class);

Copy to Clipboard

Toggle word wrap

314.6. DataFrame jobs
Link kopieren

Instead of working with RDDs Spark component can work with DataFrames as well.

To invoke an DataFrame job, use the following URI:

Spark RDD producer

spark:dataframe?dataFrame=#testDataFrame&dataFrameCallback=#transformation

spark:dataframe?dataFrame=#testDataFrame&dataFrameCallback=#transformation

Copy to Clipboard

Toggle word wrap

 Where `dataFrame` option refers to the name of an DataFrame instance
(`instances of org.apache.spark.sql.Dataset and org.apache.spark.sql.Row`) from a Camel registry,
while `dataFrameCallback` refers to the implementation
of `org.apache.camel.component.spark.DataFrameCallback` interface (also
from a registry). DataFrame callback provides a single method used to
apply incoming messages against the given DataFrame. Results of callback
computations are saved as a body to an exchange.

 Where `dataFrame` option refers to the name of an DataFrame instance
(`instances of org.apache.spark.sql.Dataset and org.apache.spark.sql.Row`) from a Camel registry,
while `dataFrameCallback` refers to the implementation
of `org.apache.camel.component.spark.DataFrameCallback` interface (also
from a registry). DataFrame callback provides a single method used to
apply incoming messages against the given DataFrame. Results of callback
computations are saved as a body to an exchange.

Copy to Clipboard

Toggle word wrap

Spark RDD callback

public interface DataFrameCallback<T> {
    T onDataFrame(Dataset<Row> dataFrame, Object... payloads);
}

public interface DataFrameCallback<T> {
    T onDataFrame(Dataset<Row> dataFrame, Object... payloads);
}

Copy to Clipboard

Toggle word wrap

The following snippet demonstrates how to send message as an input to a job and return results:

Calling spark job

String model = "Micra";
long linesCount = producerTemplate.requestBody("spark:dataFrame?dataFrame=#cars&dataFrameCallback=#findCarWithModel", model, long.class);

String model = "Micra";
long linesCount = producerTemplate.requestBody("spark:dataFrame?dataFrame=#cars&dataFrameCallback=#findCarWithModel", model, long.class);

Copy to Clipboard

Toggle word wrap

The DataFrame callback for the snippet above registered as Spring bean could look as follows:

Spark RDD callback

@Bean
RddCallback<Long> findCarWithModel() {
    return new DataFrameCallback<Long>() {
        @Override
        public Long onDataFrame(Dataset<Row> dataFrame, Object... payloads) {
            String model = (String) payloads[0];
            return dataFrame.where(dataFrame.col("model").eqNullSafe(model)).count();
        }
    };
}

@Bean
RddCallback<Long> findCarWithModel() {
    return new DataFrameCallback<Long>() {
        @Override
        public Long onDataFrame(Dataset<Row> dataFrame, Object... payloads) {
            String model = (String) payloads[0];
            return dataFrame.where(dataFrame.col("model").eqNullSafe(model)).count();
        }
    };
}

Copy to Clipboard

Toggle word wrap

The DataFrame definition in Spring could looks as follows:

Spark RDD definition

@Bean
Dataset<Row> cars(HiveContext hiveContext) {
    Dataset<Row> jsonCars = hiveContext.read().json("/var/data/cars.json");
    jsonCars.registerTempTable("cars");
    return jsonCars;
}

@Bean
Dataset<Row> cars(HiveContext hiveContext) {
    Dataset<Row> jsonCars = hiveContext.read().json("/var/data/cars.json");
    jsonCars.registerTempTable("cars");
    return jsonCars;
}

Copy to Clipboard

Toggle word wrap

314.7. Hive jobs
Link kopieren

 Instead of working with RDDs or DataFrame Spark component can also
receive Hive SQL queries as payloads. To send Hive query to Spark
component, use the following URI:

 Instead of working with RDDs or DataFrame Spark component can also
receive Hive SQL queries as payloads. To send Hive query to Spark
component, use the following URI:

Copy to Clipboard

Toggle word wrap

Spark RDD producer

spark:hive

spark:hive

Copy to Clipboard

Toggle word wrap

The following snippet demonstrates how to send message as an input to a job and return results:

Calling spark job

long carsCount = template.requestBody("spark:hive?collect=false", "SELECT * FROM cars", Long.class);
List<Row> cars = template.requestBody("spark:hive", "SELECT * FROM cars", List.class);

long carsCount = template.requestBody("spark:hive?collect=false", "SELECT * FROM cars", Long.class);
List<Row> cars = template.requestBody("spark:hive", "SELECT * FROM cars", List.class);

Copy to Clipboard

Toggle word wrap

The table we want to execute query against should be registered in a HiveContext before we query it. For example in Spring such registration could look as follows:

Spark RDD definition

@Bean
Dataset<Row> cars(HiveContext hiveContext) {
     jsonCars = hiveContext.read().json("/var/data/cars.json");
    jsonCars.registerTempTable("cars");
    return jsonCars;
}

@Bean
Dataset<Row> cars(HiveContext hiveContext) {
     jsonCars = hiveContext.read().json("/var/data/cars.json");
    jsonCars.registerTempTable("cars");
    return jsonCars;
}

Copy to Clipboard

Toggle word wrap

314.8. See Also
Link kopieren

Configuring Camel
Component
Endpoint
Getting Started

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 314. Apache Spark Component

314.1. Supported architectural styles
Link kopieren

314.2. Running Spark in OSGi servers
Link kopieren

314.3. URI format
Link kopieren

314.3.1. Spark options
Link kopieren

314.3.2. Path Parameters (1 parameters):
Link kopieren

314.3.3. Query Parameters (6 parameters):
Link kopieren

314.4. Spring Boot Auto-Configuration
Link kopieren

314.5. RDD jobs
Link kopieren

314.5.1. Void RDD callbacks
Link kopieren

314.5.2. Converting RDD callbacks
Link kopieren

314.5.3. Annotated RDD callbacks
Link kopieren

314.6. DataFrame jobs
Link kopieren

314.7. Hive jobs
Link kopieren

314.8. See Also
Link kopieren

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 314. Apache Spark Component

314.1. Supported architectural stylesLink kopierenLink in die Zwischenablage kopiert!

314.2. Running Spark in OSGi serversLink kopierenLink in die Zwischenablage kopiert!

314.3. URI formatLink kopierenLink in die Zwischenablage kopiert!

314.3.1. Spark optionsLink kopierenLink in die Zwischenablage kopiert!

314.3.2. Path Parameters (1 parameters):Link kopierenLink in die Zwischenablage kopiert!

314.3.3. Query Parameters (6 parameters):Link kopierenLink in die Zwischenablage kopiert!

314.4. Spring Boot Auto-ConfigurationLink kopierenLink in die Zwischenablage kopiert!

314.5. RDD jobsLink kopierenLink in die Zwischenablage kopiert!

314.5.1. Void RDD callbacksLink kopierenLink in die Zwischenablage kopiert!

314.5.2. Converting RDD callbacksLink kopierenLink in die Zwischenablage kopiert!

314.5.3. Annotated RDD callbacksLink kopierenLink in die Zwischenablage kopiert!

314.6. DataFrame jobsLink kopierenLink in die Zwischenablage kopiert!

314.7. Hive jobsLink kopierenLink in die Zwischenablage kopiert!

314.8. See AlsoLink kopierenLink in die Zwischenablage kopiert!

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Mehr Inklusion in Open Source

Über Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

314.1. Supported architectural styles
Link kopieren

314.2. Running Spark in OSGi servers
Link kopieren

314.3. URI format
Link kopieren

314.3.1. Spark options
Link kopieren

314.3.2. Path Parameters (1 parameters):
Link kopieren

314.3.3. Query Parameters (6 parameters):
Link kopieren

314.4. Spring Boot Auto-Configuration
Link kopieren

314.5. RDD jobs
Link kopieren

314.5.1. Void RDD callbacks
Link kopieren

314.5.2. Converting RDD callbacks
Link kopieren

314.5.3. Annotated RDD callbacks
Link kopieren

314.6. DataFrame jobs
Link kopieren

314.7. Hive jobs
Link kopieren

314.8. See Also
Link kopieren