Chapter 6. Understanding compute engines in Feature Store
You can configure compute engines to manage feature pipelines, transformations, and materialization in Red Hat. By integrating your Feature Store with distributed processing frameworks, you can centralize feature management and improve data reusability across your organization.
6.1. Using compute engines in Feature Store Copy linkLink copied to clipboard!
You can use compute engines to run feature pipelines on back ends such as Spark, PyArrow, Pandas, or Ray. These pipelines perform transformations, aggregations, joins, and materializations.
Use the compute engine to build and run directed acyclic graphs (DAGs), for modular and scalable workflows.
Available operations:
-
materialize(): Generate features for offline and online stores in batch and stream modes. -
get_historical_features(): Retrieve point-in-time training datasets.
Key concepts for compute engines
Understand the following components for better execution of materialization and retrieval tasks:
| Concept | Definition |
|---|---|
| Compute engine | The interface for executing materialization and retrieval tasks. |
| Feature builder | Constructs a Directed Acyclic Graphs (DAG), from a feature view definition for a specific backend. |
| Feature resolver | Arranges tasks in the correct sequence, so each step runs only after its dependencies. |
| DAG | A DAG operation, such as read, aggregate, or join. |
| Execution plan | Runs nodes in the correct sequence and saves the results. |
| Execution Context | Collects configuration, registry, stores, entity data, and node outputs. |
Understanding the feature resolver and builder
The feature builder starts a feature resolver that extracts a DAG from FeatureView definitions, resolving dependencies and ensuring the correct execution order. A FeatureView represents a logical data source, whereas a DataSource represents the physical data source.
When defining a feature view, the source can be a physical data source, a derived feature view, or a list of feature views. Use the feature resolver to organize data sources into a directed acyclic graph (DAG). The resolver identifies node dependencies to generate the final output. The FeatureBuilder then builds DAG node objects for each operation, such as read, join, filter, or aggregate.
| Compute engine | Description |
|---|---|
| Spark compute engine | Distributed DAG execution using Apache Spark. Supports point-in-time joins and large-scale materialization. Integrates with Spark Offline Store and Spark materialization job. |
| Ray compute engine | Provides distributed DAG execution. Enables automatic resource management and optimization. Integrates with Ray Offline Store and Ray Materialization Job. |
| Local compute engine | Runs on Arrow and a backend you specify (e.g., Pandas, Polars). |
| Enables local development, testing, or lightweight feature generation. |
Supports |
Feature builder node details
Use the feature builder to build a directed acyclic graph (DAG) from a feature view definition to determine the operation order. The feature resolver identifies data sources and sorts the nodes to resolve dependencies.
| Node type | Description |
|---|---|
| Source read node | The process begins by reading the data source. |
| Transformation node or join node | If a feature transformation is defined, the system applies a transformation node. If there are multiple sources the system applies a join node. |
| Filter node | The system always includes this node to apply time to live (TTL) parameters or user-defined filters. |
| Aggregation node | The system applies this node if the feature view includes defined aggregations. |
| Deduplication node |
The system applies this node for |
| Validation node |
The system applies this node if |
| Output |
Use retrieval output for |
6.2. Understanding the Ray compute engine in Feature Store Copy linkLink copied to clipboard!
The Ray compute engine is a distributed compute implementation that uses Ray for executing feature pipelines. This includes transformations, aggregations, joins, and materializations. It provides scalable and efficient distributed processing for both materialize() and get_historical_features() operations.
The Ray RAG template features:
- Parallel embedding generation: Uses the Ray compute engine to generate embeddings across multiple workers
- Vector search integration: Works with Milvus for semantic similarity search
- Complete RAG pipeline: The Ray compute engine automatically distributes the embedding generation across available workers, making it ideal for processing large datasets efficiently
Ray compute engine features:
- Distributed directed acyclic graphs (DAG) Execution: Executes feature computation DAG across Ray clusters
- Intelligent Join Strategies: Automatic selection between broadcast and distributed joins
- Lazy Evaluation: Deferred execution for optimal performance
- Resource Management: Automatic scaling and resource optimization
- Point-in-Time Joins: Efficient temporal joins for historical feature retrieval
6.3. Getting started using the Ray template Copy linkLink copied to clipboard!
Use the Ray retrieval-augmented generation (RAG) template to build scalable, high-performance applications. This end-to-end framework enables parallel processing of large datasets, which reduces the processing time and memory intensity required on a single machine.
Prerequisites
- You have a data science project with an active workbench.
- Your workbench image includes the Feature Store.
Procedure
Apply the Ray RAG Template
Run the following code for RAG (Retrieval-Augmented Generation) applications with distributed embedding generation:
feast init -t ray_rag my_rag_project cd my_rag_project/feature_repo
Your Ray template is now active.
6.4. Configuring Ray in your Feature Store YAML file Copy linkLink copied to clipboard!
Configure the Ray compute engine in Feature Store by defining Ray-specific settings in the feature_store.yaml file. This enables distributed execution of feature pipelines for materialization and historical feature retrieval.
Prerequisites
- Your Ray cluster is running.
Procedure
-
Configure the Ray compute engine in your
feature_store.yamlfile:
| YAML | Available options |
project: my_project
| None |
registry: data/registry.db
| None |
provider: local
| None |
offline_store:
type: ray
storage_path: data/ray_storage
| None |
batch_engine:
type: ray.engine
max_workers: 4
| Maximum number of workers |
broadcast_join_threshold_mb: 100
| Broadcast join threshold (MB) |
max_parallelism_multiplier: 2
| Parallelism multiplier |
target_partition_size_mb: 64
| Target partition size (MB) |
window_size_for_joins: "1H"
| Time window for distributed joins |
ray_address: localhost
| Ray cluster address |
| Option | Type | Default | Description |
|---|---|---|---|
| type | string |
| Must be ray.engine |
|
| integer | none (uses all cores) | This enables the maximum number of Ray workers. |
|
| boolean | true | This enables performance optimizations. |
|
| integer | 100 | This is the size threshold for broadcast joins (MB). |
|
| integer | 2 | This enables you to run many CPU cores simultaneously. |
|
| integer | 64 | This allows you to identify a partition size (MB). |
|
| string | 1H | This enables a time window for distributed joins. |
|
| string | none | This enables the Ray cluster address, which triggers the remote mode. |
|
| boolean | none |
This enables KubeRay mode (overrides |
|
| dictionary | none |
This enables KubeRay configuration dictionary with keys: |
|
| boolean | false | This enables Ray progress bars and logging. |
|
| boolean | true | This enables distributed joins for large datasets. |
|
| string | none | This is the remote path for batch materialization jobs. |
|
| dictionary | none | These are Ray configuration parameters such as memory and CPU limits. |
6.5. Understanding Ray mode detection precedence in Feature Store Copy linkLink copied to clipboard!
You can manage mode detection precedence since the Ray compute engine automatically detects the execution mode:
-
Environment Variables (→) KubeRay mode (if
FEAST_RAY_USE_KUBERAY=true) -
Config
kuberay_conf(→) KubeRay mode -
Config
ray_address(→) Remote mode - Default (→) Local mode
It is recommended that you use KubeRay mode.
For more information about Ray compute engine usage examples, see Ray compute engine usage examples.
6.6. Using Ray directed acyclic graph node types in Feature Store Copy linkLink copied to clipboard!
Use Ray directed acyclic graph (DAG) node types to build scalable, high-performance feature generation workflows. Ray optimizes resources and reduces execution time by handling data partitioning and statically allocating buffers.
Ray read node
Reads data from Ray-compatible sources:
- Supports Parquet, comma-separated values (CSV), and other formats
- Handles partitioning and schema inference
- Applies field mappings and filters
Ray join node
Performs distributed joins:
- Broadcast join: Use for small datasets (<100MB)
- Distributed join: Use for large datasets with time-based windowing
- Automatic strategy selection: Based on dataset size and cluster resources
Ray filter node
Applies filters and time-based constraints:
- Time to live (TTL)-based filtering
- Timestamp range filtering
- Custom predicate filtering
Ray aggregation node
Handles feature aggregations:
- Windowed aggregations
- Grouped aggregations
- Custom aggregation functions
Ray transformation node
Applies feature transformations:
- Row-level transformations
- Column-level transformations
- Custom transformation functions
Ray write node
Writes results to various targets:
- Online stores
- Offline stores
- Temporary storage
6.7. Using Ray join strategies in Feature Store Copy linkLink copied to clipboard!
The Ray compute engine automatically selects optimal join strategies:
.Used for small feature datasets:
- Selects this join automatically when feature data is <100MB.
- Caches features in Ray’s object store.
- Distributes entities across a cluster.
- Copies feature data and sends it to each worker.
- Uses the distributed window join.
- Used for large feature datasets.
- Selects this join automatically when feature data <100MB.
- Partitions data by time windows.
- Provides point-in-time joins within each window.
- Combines results across windows.
Example of using strategy selection logic
def select_join_strategy(feature_size_mb, threshold_mb):
if feature_size_mb < threshold_mb:
return "broadcast"
else:
return "distributed_windowed"
def select_join_strategy(feature_size_mb, threshold_mb):
if feature_size_mb < threshold_mb:
return "broadcast"
else:
return "distributed_windowed"
6.8. Understanding Ray performance optimization for Feature Store Copy linkLink copied to clipboard!
Ray is a distributed execution engine that scales Feast feature engineering and retrieval-augmented generation (RAG) workloads. By processing large datasets in parallel, Ray accelerates pipelines and reduces costs compared to single-node processing.
Use the Ray automatic optimizations for increased efficiency.
Enabling automatic optimization
The Ray compute engine includes several automatic optimizations:
- Partition optimization: Automatically determines optimal partition sizes
- Join strategy selection: Chooses between broadcast and distributed joins
- Resource allocation: Scales workers based on available resources
- Memory management: Handles out-of-core processing for large datasets
Manual tuning example
If you have specific workloads that require custom tuning, you can fine-tune performance:
6.8.1. Understanding Ray monitoring and metrics in Feature Store Copy linkLink copied to clipboard!
You can check cluster resources and monitor job progress when working with the Ray compute engine.
See the following Python example for how to extract monitoring and metrics data:
6.9. Understanding the Spark compute engine in Feature Store Copy linkLink copied to clipboard!
Use the Spark compute engine to run distributed batch materialization and historical retrieval operations. Batch materialization includes materialize and materialize-incremental operations. The engine processes large-scale data from offline stores, such as Snowflake, Google BigQuery, and Apache Spark SQL.
The Spark compute engine can read various data sources and perform distributed or custom transformations. You can use the engine to perform these tasks: * Read from various data sources, such as Apache Spark SQL, Google BigQuery, and Snowflake. * Execute distributed feature transformations and aggregations. * Run custom transformations by using Apache Spark SQL or user-defined functions (UDFs).
6.10. Configuring Spark in your Feature Store YAML file Copy linkLink copied to clipboard!
Configure the Spark compute engine in Feature Store by defining Spark-specific settings in the feature_store.yaml file or programmatically using a Feast RepoConfig. This enables distributed batch materialization and historical feature retrieval using Spark.
Prerequisites
- Your Spark cluster is running.
Procedure
-
Configure the Spark compute engine in your
feature_store.yamlfile:
Configuring the Spark offline store example
You can configure the feature store by using the feature_store.py file. This configuration uses Amazon DynamoDB for the online store and the Spark compute engine for the offline store.
In the following code, replace [YOUR_BUCKET] with the name of your specific S3 bucket.
6.11. Reference material for integrating Ray with other components in Feature Store Copy linkLink copied to clipboard!
You can integrate Ray with Spark, cloud storage and feature transformations. This enables distributed processing of large-scale machine learning workloads, from feature engineering to serving. It also enables efficient handling of intensive tasks.
Integrating Ray with the Spark offline store in Feature Store
Integrating Ray with cloud storage in Feature Store
Integrating Ray with feature transformations
- Ray native transformations
If you have distributed transformations that use Ray’s dataset and parallel processing capabilities, use mode=ray in your BatchFeatureView: