Chapter 3. Deploying a RAG stack in a data science project
This feature is currently available in Red Hat OpenShift AI as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
As an OpenShift cluster administrator, you can deploy a Retrieval-Augmented Generation (RAG) stack in OpenShift AI. This stack provides the infrastructure, including LLM inference, vector storage, and retrieval services that data scientists and AI engineers use to build conversational workflows in their projects.
To deploy the RAG stack in a data science project, complete the following tasks:
- Activate the Llama Stack Operator in OpenShift AI.
- Enable GPU support on the OpenShift cluster. This task includes installing the required NVIDIA Operators.
- Deploy an inference model, for example, the llama-3.2-3b-instruct model. This task includes creating a storage connection and configuring GPU allocation.
-
Create a
LlamaStackDistributioninstance to enable RAG functionality. This action deploys LlamaStack alongside a Milvus vector store and connects both components to the inference model. - Ingest domain data into Milvus by running Docling in a data science pipeline or Jupyter notebook. This process keeps the embeddings synchronized with the source data.
- Expose and secure the model endpoints.
3.1. Overview of RAG Copy linkLink copied to clipboard!
Retrieval-augmented generation (RAG) in OpenShift AI enhances large language models (LLMs) by integrating domain-specific data sources directly into the model’s context. Domain-specific data sources can be structured data, such as relational database tables, or unstructured data, such as PDF documents.
RAG indexes content and builds an embedding store that data scientists and AI engineers can query. When data scientists or AI engineers pose a question to a RAG chatbot, the RAG pipeline retrieves the most relevant pieces of data, passes them to the LLM as context, and generates a response that reflects both the prompt and the retrieved content.
By implementing RAG, data scientists and AI engineers can obtain tailored, accurate, and verifiable answers to complex queries based on their own datasets within a data science project.
3.1.1. Audience for RAG Copy linkLink copied to clipboard!
The target audience for RAG is practitioners who build data-grounded conversational AI applications using OpenShift AI infrastructure.
- For Data Scientists
- Data scientists can use RAG to prototype and validate models that answer natural-language queries against data sources without managing low-level embedding pipelines or vector stores. They can focus on creating prompts and evaluating model outputs instead of building retrieval infrastructure.
- For MLOps Engineers
- MLOps engineers typically deploy and operate RAG pipelines in production. Within OpenShift AI, they manage LLM endpoints, monitor performance, and ensure that both retrieval and generation scale reliably. RAG decouples vector store maintenance from the serving layer, enabling MLOps engineers to apply CI/CD workflows to data ingestion and model deployment alike.
- For Data Engineers
- Data engineers build workflows to load data into storage that OpenShift AI indexes. They keep embeddings in sync with source systems, such as S3 buckets or relational tables to ensure that chatbot responses are accurate.
- For AI Engineers
- AI engineers architect RAG chatbots by defining prompt templates, retrieval methods, and fallback logic. They configure agents and add domain-specific tools, such as OpenShift job triggers, enabling rapid iteration.
3.2. Overview of vector databases Copy linkLink copied to clipboard!
Vector databases are a crucial component of retrieval-augmented generation (RAG) in OpenShift AI. They store and index vector embeddings that represent the semantic meaning of text or other data. When you integrate vector databases with Llama Stack in OpenShift AI, you can build RAG applications that combine large language models (LLMs) with relevant, domain-specific knowledge.
Vector databases provide you with the following capabilities:
- Store vector embeddings generated by embedding models.
- Support efficient similarity search to retrieve semantically related content.
- Enable RAG workflows by supplying the LLM with contextually relevant data from a specific domain.
When you deploy RAG workloads in OpenShift AI, you can deploy vector databases through the Llama Stack Operator. Currently, OpenShift AI supports the following vector databases:
- Inline Milvus Lite An Inline Milvus vector database runs embedded within the Llama Stack Distribution (LSD) pod and is suitable for lightweight experimentation and small-scale development. Inline Milvus stores data in a local SQLite database and is limited in scale and persistence.
- Remote Milvus A remote Milvus vector database runs as a standalone service in your project namespace or as an external managed deployment. Remote Milvus is recommended for production-grade RAG use cases because it provides persistence, scalability, and isolation from the Llama Stack Distribution (LSD) pod. In OpenShift environments, you must deploy Milvus with an etcd service directly in your project. For more information on using etcd services, see Providing redundancy with etcd.
Consider the following points when you decide on the vector database to use for your RAG workloads:
- Use inline Milvus Lite if you want to experiment quickly with RAG in a self-contained setup and do not require persistence across pod restarts.
- Use remote Milvus if you need reliable storage, high availability, and the ability to scale out RAG workloads in your OpenShift AI environment.
3.3. Overview of Milvus vector databases Copy linkLink copied to clipboard!
Milvus is an open source vector database designed for high-performance similarity search across embedding data. In OpenShift AI, Milvus is supported as a remote vector database provider for the Llama Stack Operator. Milvus enables retrieval-augmented generation (RAG) workloads that require persistence, scalability, and efficient search across large document collections.
Milvus vector databases provide you with the following capabilities in OpenShift AI:
- Similarity search using Approximate Nearest Neighbor (ANN) algorithms.
- Persistent storage support for vectors.
- Indexing and query optimizations for embedding-based search.
- Integration with external metadata and APIs.
In OpenShift AI, you can use Milvus vector databases in the following operational modes:
- Inline Milvus Lite, which runs embedded in the Llama Stack Distribution pod for testing or small-scale experiments.
- Remote Milvus, which runs as a standalone service in your OpenShift project or as an external managed Milvus service. Remote Milvus is recommended for production workloads.
When you deploy a remote Milvus vector database, you must run the following components in your OpenShift project:
-
Secret (
milvus-secret): Stores sensitive data such as the Milvus root password. -
PersistentVolumeClaim (
milvus-pvc): Provides persistent storage for Milvus data. -
Deployment (
etcd-deployment): Runs an etcd instance that Milvus uses for metadata storage and service coordination. -
Service (
etcd-service): Exposes the etcd port for Milvus to connect to. -
Deployment (
milvus-standalone): Runs Milvus in standalone mode and connects it to the etcd service and PVC. -
Service (
milvus-service): Exposes Milvus gRPC (19530) and HTTP (9091 health check) ports for client access.
Milvus requires an etcd service to manage metadata such as collections, indexes, and partitions, and to provide service discovery and coordination among Milvus components. Even when running in standalone mode, Milvus depends on etcd to operate correctly and maintain metadata consistency. For more information on using etcd services, see Providing redundancy with etcd.
Do not use the OpenShift control plane etcd for Milvus. You must deploy a separate etcd instance inside your project or connect to an external etcd service.
Use Remote Milvus when you require a persistent, scalable, and production-ready vector database that integrates seamlessly with OpenShift AI. Consider choosing a remote Milvus vector database if your deployment must cater for the following requirements:
- Persistent vector storage across restarts or upgrades.
- Scalable indexing and high-performance vector search.
- A production-grade RAG architecture integrated with OpenShift AI.
3.4. Deploying a Llama model with KServe Copy linkLink copied to clipboard!
To use Llama Stack and retrieval-augmented generation (RAG) workloads in OpenShift AI, you must deploy a Llama model with a vLLM model server and configure KServe in KServe RawDeployment mode.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have logged in to Red Hat OpenShift AI.
- You have cluster administrator privileges for your OpenShift cluster.
- You have activated the Llama Stack Operator.
- You have installed KServe.
- You have enabled the single-model serving platform. For more information about enabling the single-model serving platform, see Enabling the single-model serving platform.
- You can access the single-model serving platform in the dashboard configuration. For more information about setting dashboard configuration options, see Customizing the dashboard.
- You have enabled GPU support in OpenShift AI, including installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
- You have created a data science project.
- The vLLM serving runtime is installed and available in your environment.
-
You have created a storage connection for your model that contains a
URI - v1connection type. This storage connection must define the location of your Llama 3.2 model artifacts. For example,oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct. For more information about creating storage connections, see Adding a connection to your data science project.
These steps are only supported in OpenShift AI versions 2.19 and later.
- In the OpenShift AI dashboard, navigate to the project details page and click the Models tab.
- In the Single-model serving platform tile, click Select single-model.
Click the Deploy model button.
The Deploy model dialog opens.
Configure the deployment properties for your model:
- In the Model deployment name field, enter a unique name for your deployment.
-
In the Serving runtime field, select
vLLM NVIDIA GPU serving runtime for KServefrom the drop-down list. - In the Deployment mode field, select KServe RawDeployment from the drop-down list.
-
Set Number of model server replicas to deploy to
1. In the Model server size field, select
Customfrom the drop-down list.-
Set CPUs requested to
1 core. -
Set Memory requested to
10 GiB. -
Set CPU limit to
2 core. -
Set Memory limit to
14 GiB. -
Set Accelerator to
NVIDIA GPUs. -
Set Accelerator count to
1.
-
Set CPUs requested to
- From the Connection type, select a relevant data connection from the drop-down list.
In the Additional serving runtime arguments field, specify the following recommended arguments:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Click Deploy.
NoteModel deployment can take several minutes, especially for the first model that is deployed on the cluster. Initial deployment may take more than 10 minutes while the relevant images download.
Verification
Verify that the
kserve-controller-managerandodh-model-controllerpods are running:- Open a new terminal window.
- Log in to your OpenShift cluster from the CLI:
- In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
- After you have logged in, click Display token.
Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).oc login --token=<token> --server=<openshift_cluster_url>
$ oc login --token=<token> --server=<openshift_cluster_url>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the following command to verify that the
kserve-controller-managerandodh-model-controllerpods are running:oc get pods -n redhat-ods-applications | grep -E 'kserve-controller-manager|odh-model-controller'
$ oc get pods -n redhat-ods-applications | grep -E 'kserve-controller-manager|odh-model-controller'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Confirm that you see output similar to the following example:
kserve-controller-manager-7c865c9c9f-xyz12 1/1 Running 0 4m21s odh-model-controller-7b7d5fd9cc-wxy34 1/1 Running 0 3m55s
kserve-controller-manager-7c865c9c9f-xyz12 1/1 Running 0 4m21s odh-model-controller-7b7d5fd9cc-wxy34 1/1 Running 0 3m55sCopy to Clipboard Copied! Toggle word wrap Toggle overflow If you do not see either of the
kserve-controller-managerandodh-model-controllerpods, there could be a problem with your deployment. In addition, if the pods appear in the list, but theirStatusis not set toRunning, check the pod logs for errors:oc logs <pod-name> -n redhat-ods-applications
$ oc logs <pod-name> -n redhat-ods-applicationsCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the inference service:
oc get inferenceservice -n llamastack oc get pods -n <data science project name> | grep llama
$ oc get inferenceservice -n llamastack $ oc get pods -n <data science project name> | grep llamaCopy to Clipboard Copied! Toggle word wrap Toggle overflow The deployment automatically creates the following resources:
-
A
ServingRuntimeresource. -
An
InferenceServiceresource, aDeployment, a pod, and a service pointing to the pod.
-
A
Verify that the server is running. For example:
oc logs llama-32-3b-instruct-predictor-77f6574f76-8nl4r -n <data science project name>
$ oc logs llama-32-3b-instruct-predictor-77f6574f76-8nl4r -n <data science project name>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Check for output similar to the following example log:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - The deployed model displays in the Models tab on the Data Science project details page for the project it was deployed under.
If you see a
ConvertTritonGPUToLLVMerror in the pod logs when querying the/v1/chat/completionsAPI, and the vLLM server restarts or returns a500 Internal Servererror, apply the following workaround:Before deploying the model, remove the
--enable-chunked-prefillargument from the Additional serving runtime arguments field in the deployment dialog.The error is displayed similar to the following:
/opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: error: Failures have been detected while processing an MLIR pass pipeline /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` INFO: 10.129.2.8:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
/opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: error: Failures have been detected while processing an MLIR pass pipeline /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.` INFO: 10.129.2.8:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server ErrorCopy to Clipboard Copied! Toggle word wrap Toggle overflow
3.5. Testing your vLLM model endpoints Copy linkLink copied to clipboard!
To verify that your deployed Llama 3.2 model is accessible externally, ensure that your vLLM model server is exposed as a network endpoint. You can then test access to the model from outside both the OpenShift cluster and the OpenShift AI interface.
If you selected Make deployed models available through an external route during deployment, your vLLM model endpoint is already accessible outside the cluster. You do not need to manually expose the model server. Manually exposing vLLM model endpoints, for example, by using oc expose, creates an unsecured route unless you configure authentication. Avoid exposing endpoints without security controls to prevent unauthorized access.
Prerequisites
- You have cluster administrator privileges for your OpenShift cluster.
- You have logged in to Red Hat OpenShift AI.
- You have activated the Llama Stack Operator in OpenShift AI.
- You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
Procedure
Open a new terminal window.
- Log in to your OpenShift cluster from the CLI:
- In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
- After you have logged in, click Display token.
Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).oc login --token=<token> --server=<openshift_cluster_url>
$ oc login --token=<token> --server=<openshift_cluster_url>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
If you enabled Require token authentication during model deployment, retrieve your token:
export MODEL_TOKEN=$(oc get secret default-name-llama-32-3b-instruct-sa -n <project name> --template={{ .data.token }} | base64 -d)$ export MODEL_TOKEN=$(oc get secret default-name-llama-32-3b-instruct-sa -n <project name> --template={{ .data.token }} | base64 -d)Copy to Clipboard Copied! Toggle word wrap Toggle overflow Obtain your model endpoint URL:
- If you enabled Make deployed models available through an external route during model deployment, click Endpoint details on the Model deployments page in the OpenShift AI dashboard to obtain your model endpoint URL.
In addition, if you did not enable Require token authentication during model deployment, you can also enter the following command to retrieve the endpoint URL:
export MODEL_ENDPOINT="https://$(oc get route llama-32-3b-instruct -n <project name> --template={{ .spec.host }})"$ export MODEL_ENDPOINT="https://$(oc get route llama-32-3b-instruct -n <project name> --template={{ .spec.host }})"Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Test the endpoint with a sample chat completion request:
If you did not enable Require token authentication during model deployment, enter a chat completion request. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you enabled Require token authentication during model deployment, include a token in your request. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
-kflag disables SSL verification and should only be used in test environments or with self-signed certificates.
Verification
Confirm that you received a JSON response containing a chat completion. For example:
If you do not receive a response similar to the example, verify that the endpoint URL and token are correct, and ensure your model deployment is running.
3.6. Deploying a remote Milvus vector database Copy linkLink copied to clipboard!
To use Milvus as a remote vector database provider for Llama Stack in OpenShift AI, you must deploy Milvus and its required etcd service in your OpenShift project. This procedure shows how to deploy Milvus in standalone mode without the Milvus Operator.
The following example configuration is intended for testing or evaluation environments. For production-grade deployments, see https://milvus.io/docs in the Milvus documentation.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
- You have cluster administrator privileges for your OpenShift cluster.
- You are logged in to Red Hat OpenShift AI.
- You have a StorageClass available that can provision persistent volumes.
- You created a root password to secure your Milvus service.
- You have deployed an inference model with vLLM, for example, the llama-3.2-3b-instruct model, and you have selected Make deployed models available through an external route and Require token authentication during model deployment.
- You have the correct inference model identifier, for example, llama-3-2-3b.
-
You have the model endpoint URL, ending with
/v1, such ashttps://llama-32-3b-instruct-predictor:8443/v1. - You have the API token required to access the model endpoint.
-
You have installed the OpenShift command line interface (
oc) as described in Installing the OpenShift CLI (OpenShift Dedicated) or Installing the OpenShift CLI (Red Hat OpenShift Service on AWS).
Procedure
-
In the OpenShift console, click the Quick Create (
) icon and then click the Import YAML option.
- Verify that your data science project is the selected project.
In the Import YAML editor, paste the following manifest and click Create:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Note-
Use the gRPC port (
19530) for theMILVUS_ENDPOINTsetting in Llama Stack. -
The HTTP port (
9091) is reserved for health checks. -
If you deploy Milvus in a different namespace, use the fully qualified service name in your Llama Stack configuration. For example:
http://milvus-service.<namespace>.svc.cluster.local:19530
-
Use the gRPC port (
Verification
-
In the OpenShift web console, click Workloads
Deployments. -
Verify that both
etcd-deploymentandmilvus-standaloneshow a status of 1 of 1 pods available. - Click Pods in the navigation panel and confirm that pods for both deployments are Running.
-
Click the
milvus-standalonepod name, then select the Logs tab. Verify that Milvus reports a healthy startup with output similar to:
Milvus Standalone is ready to serve ... Listening on 0.0.0.0:19530 (gRPC)
Milvus Standalone is ready to serve ... Listening on 0.0.0.0:19530 (gRPC)Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Click Networking
Services and confirm that the milvus-serviceandetcd-serviceresources exist and are exposed on ports19530and2379, respectively. (Optional) Click Pods
milvus-standalone Terminal and run the following health check: curl http://localhost:9091/healthz
curl http://localhost:9091/healthzCopy to Clipboard Copied! Toggle word wrap Toggle overflow A response of
{"status": "healthy"}confirms that Milvus is running correctly.
3.7. Deploying a LlamaStackDistribution instance Copy linkLink copied to clipboard!
You can deploy Llama Stack with retrieval-augmented generation (RAG) by pairing it with a vLLM-served Llama 3.2 model. This module provides two deployment examples of the LlamaStackDistribution custom resource (CR): one configured for Inline Milvus (single-node, embedded) and one for Remote Milvus (external Milvus service). When you create the CR, specify rh-dev in the spec.server.distribution.name field.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
- You have cluster administrator privileges for your OpenShift cluster.
- You are logged in to Red Hat OpenShift AI.
- You have activated the Llama Stack Operator in OpenShift AI.
- You have deployed an inference model with vLLM (for example, llama-3.2-3b-instruct) and selected Make deployed models available through an external route and Require token authentication during model deployment.
-
You have the correct inference model identifier, for example,
llama-3-2-3b. -
You have the model endpoint URL ending with
/v1, for example,https://llama-32-3b-instruct-predictor:8443/v1. - You have the API token required to access the model endpoint.
You have installed the OpenShift CLI (
oc) as described in the appropriate documentation for your cluster:- Installing the OpenShift CLI for OpenShift Dedicated
- Installing the OpenShift CLI for Red Hat OpenShift Service on AWS (classic architecture)
Procedure
Open a new terminal window and log in to your OpenShift cluster from the CLI:
In the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token. Copy the Log in with this token command and paste it in the OpenShift CLI (
oc).oc login --token=<token> --server=<openshift_cluster_url>
$ oc login --token=<token> --server=<openshift_cluster_url>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a secret that contains the inference model environment variables:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Choose one of the following deployment examples:
3.7.1. Example A: LlamaStackDistribution with Inline Milvus Copy linkLink copied to clipboard!
Use this example for development or small datasets where an embedded, single-node Milvus is sufficient. No MILVUS_* connection variables are required.
In the OpenShift web console, select Administrator
Quick Create (
) Import YAML, and create a CR similar to the following: Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteThe
rh-devvalue is an internal image reference. When you create theLlamaStackDistributioncustom resource, the OpenShift AI Operator automatically resolvesrh-devto the container image in the appropriate registry. This internal image reference allows the underlying image to update without requiring changes to your custom resource.
3.7.2. Example B: LlamaStackDistribution with Remote Milvus Copy linkLink copied to clipboard!
Use this example for production-grade or large datasets with an external Milvus service. This configuration reads both MILVUS_ENDPOINT and MILVUS_TOKEN from a dedicated secret.
Create the Milvus connection secret:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow ImportantUse the gRPC port
19530forMILVUS_ENDPOINT. Ports such as9091are typically used for health checks and are not valid for client traffic.In the OpenShift web console, select Administrator
Quick Create (
) Import YAML, and create a CR similar to the following: Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Click Create.
Verification
-
In the left-hand navigation, click Workloads
Pods and verify that the Llama Stack pod is running in the correct namespace. To verify that the Llama Stack server is running, click the pod name and select the Logs tab. Look for output similar to the following:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Confirm that a Service resource for the Llama Stack backend is present in your namespace and points to the running pod: Networking
Services.
If you switch from Inline Milvus to Remote Milvus, delete the existing pod to ensure the new environment variables and backing store are picked up cleanly.
3.8. Ingesting content into a Llama model Copy linkLink copied to clipboard!
You can quickly customize and prototype your retrievable content by ingesting raw text into your model from inside a Jupyter notebook. This approach voids requiring a separate ingestion pipeline. By using the LlamaStack SDK, you can embed and store text in your vector store in real-time, enabling immediate RAG workflows.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have deployed a Llama 3.2 model with a vLLM model server and you have integrated LlamaStack.
- You have created a project workbench within a data science project.
- You have opened a Jupyter notebook and it is running in your workbench environment.
-
You have installed the
llama_stack_clientversion 0.2.22 or later in your workbench environment. - You have a vector database identifier, or you plan to create or register one in this procedure.
Procedure
In a new notebook cell, install the
llama_stack_clientpackage and its dependencies:%pip install llama_stack_client fire
%pip install llama_stack_client fireCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, import RAGDocument and LlamaStackClient:
from llama_stack_client import RAGDocument, LlamaStackClient
from llama_stack_client import RAGDocument, LlamaStackClientCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, assign your deployment endpoint to the
base_urlparameter to create a LlamaStackClient instance:client = LlamaStackClient(base_url="<your deployment endpoint>")
client = LlamaStackClient(base_url="<your deployment endpoint>")Copy to Clipboard Copied! Toggle word wrap Toggle overflow List the available models:
# Fetch all registered models models = client.models.list()
# Fetch all registered models models = client.models.list()Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the list of registered models includes your Llama model and an embedding model. Here is an example of a list of registered models:
[Model(identifier='llama-32-3b-instruct', metadata={}, api_model_type='llm', provider_id='vllm-inference', provider_resource_id='llama-32-3b-instruct', type='model', model_type='llm'), Model(identifier='ibm-granite/granite-embedding-125m-english', metadata={'embedding_dimension': 768.0}, api_model_type='embedding', provider_id='sentence-transformers', provider_resource_id='ibm-granite/granite-embedding-125m-english', type='model', model_type='embedding')][Model(identifier='llama-32-3b-instruct', metadata={}, api_model_type='llm', provider_id='vllm-inference', provider_resource_id='llama-32-3b-instruct', type='model', model_type='llm'), Model(identifier='ibm-granite/granite-embedding-125m-english', metadata={'embedding_dimension': 768.0}, api_model_type='embedding', provider_id='sentence-transformers', provider_resource_id='ibm-granite/granite-embedding-125m-english', type='model', model_type='embedding')]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Select the first LLM and the first embedding model:
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = int(embedding_model.metadata["embedding_dimension"])
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = int(embedding_model.metadata["embedding_dimension"])Copy to Clipboard Copied! Toggle word wrap Toggle overflow (Optional) Register a vector database (choose one). Skip if you already have a vector DB ID.
Example 3.1. Option 1: Inline Milvus Lite (embedded)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteUse inline Milvus Lite for development and small datasets. Persistence and scale are limited compared to remote Milvus.
Example 3.2. Option 2: Remote Milvus (recommended for production)
-
Ensure your
LlamaStackDistributionsetsMILVUS_ENDPOINT(gRPC:19530) andMILVUS_TOKEN. -
Aside from the
provider_id, ingestion and query APIs are identical for inline and remote Milvus.
If you already have a vector database, set its identifier:
# If a DB already exists, set it here instead of registering above # Example: vector_db_id = "<your existing vector database ID>"
# If a DB already exists, set it here instead of registering above # Example: # vector_db_id = "<your existing vector database ID>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, define the raw text that you want to ingest into the vector store:
# Example raw text passage raw_text = """ LlamaStack can embed raw text into a vector store for retrieval. This example ingests a small passage for demonstration. """
# Example raw text passage raw_text = """ LlamaStack can embed raw text into a vector store for retrieval. This example ingests a small passage for demonstration. """Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, create a RAGDocument object to contain the raw text:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, ingest the raw text:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, create a RAGDocument from an HTML source and ingest it into the vector store:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, ingest the content into the vector store:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
- Review the output to confirm successful ingestion. A typical response after ingestion includes the number of text chunks inserted and any warnings or errors.
-
The model list returned by
client.models.list()includes your Llama 3.2 model and an embedding model.
3.9. Querying ingested content in a Llama model Copy linkLink copied to clipboard!
You can use the LlamaStack SDK in your Jupyter notebook to query ingested content by running retrieval-augmented generation (RAG) queries on raw text or HTML sources stored in your vector database. When you query the ingested content, you can perform one-off lookups or start multi-turn conversational flows without setting up a separate retrieval service.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
- If you are using GPU acceleration, you have at least one NVIDIA GPU available.
- You have logged in to OpenShift web console.
- You have activated the Llama Stack Operator in OpenShift AI.
- You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
-
You have configured a Llama Stack deployment by creating a
LlamaStackDistributioninstance to enable RAG functionality. - You have created a project workbench within a data science project.
- You have opened a Jupyter notebook and it is running in your workbench environment.
-
You have installed the
llama_stack_clientversion 0.2.14 or later in your workbench environment. - You have ingested content into your model.
This procedure does not require any specific type of content. It only requires that you have already ingested some text, HTML, or document data into your vector database, and that this content is available for retrieval. If you have previously ingested content, that content will be available to query. If you have not ingested any content yet, the queries in this procedure will return empty results or errors.
Procedure
In a new notebook cell, install the
llama_stackclient package:%pip install llama_stack_client
%pip install llama_stack_clientCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, import
Agent,AgentEventLogger, andLlamaStackClient:from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClientCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, assign your deployment endpoint to the
base_urlparameter to create aLlamaStackClientinstance. For example:client = LlamaStackClient(base_url="http://lsd-llama-milvus-service:8321/")
client = LlamaStackClient(base_url="http://lsd-llama-milvus-service:8321/")Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, list the available models:
models = client.models.list()
models = client.models.list()Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the list of registered models includes your Llama model and an embedding model. Here is an example of a list of registered models:
[Model(identifier='llama-32-3b-instruct', metadata={}, api_model_type='llm', provider_id='vllm-inference', provider_resource_id='llama-32-3b-instruct', type='model', model_type='llm'), Model(identifier='ibm-granite/granite-embedding-125m-english', metadata={'embedding_dimension': 768.0}, api_model_type='embedding', provider_id='sentence-transformers', provider_resource_id='ibm-granite/granite-embedding-125m-english', type='model', model_type='embedding')][Model(identifier='llama-32-3b-instruct', metadata={}, api_model_type='llm', provider_id='vllm-inference', provider_resource_id='llama-32-3b-instruct', type='model', model_type='llm'), Model(identifier='ibm-granite/granite-embedding-125m-english', metadata={'embedding_dimension': 768.0}, api_model_type='embedding', provider_id='sentence-transformers', provider_resource_id='ibm-granite/granite-embedding-125m-english', type='model', model_type='embedding')]Copy to Clipboard Copied! Toggle word wrap Toggle overflow Select the first LLM:
model_id = next(m.identifier for m in models if m.model_type == "llm")
model_id = next(m.identifier for m in models if m.model_type == "llm")Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you have not already created a vector store, select an embedding model for registration in the next step:
embedding = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding.identifier embedding_dimension = int(embedding.metadata["embedding_dimension"])
embedding = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding.identifier embedding_dimension = int(embedding.metadata["embedding_dimension"])Copy to Clipboard Copied! Toggle word wrap Toggle overflow If you do not already have a vector store ID, register a vector store of your choice:
Example 3.3. Option 1: Inline Milvus Lite (embedded)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteUse inline Milvus Lite for development and small datasets. Persistence and scale are limited compared to remote Milvus.
Example 3.4. Option 2: Remote Milvus (recommended for production)
-
Ensure your
LlamaStackDistributionsetsMILVUS_ENDPOINT(gRPC:19530) andMILVUS_TOKEN. -
Aside from the
provide_id, querying APIs are identical for inline and remote Milvus.
If you already have a vector database, set its identifier:
# If a DB already exists, set it here instead of registering above # Example: vector_db_id = "<your existing vector database ID>"
# If a DB already exists, set it here instead of registering above # Example: # vector_db_id = "<your existing vector database ID>"Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, query the ingested content using the low-level RAG tool:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, query the ingested content by using the high-level Agent API:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
- The notebook prints query results for both the low-level RAG tool and the high-level Agent API.
- No errors appear in the output, confirming the model can retrieve and respond to ingested content.
3.10. Preparing documents with Docling for Llama Stack retrieval Copy linkLink copied to clipboard!
You can transform your source documents with a Docling-enabled data science pipeline and ingest the output into a Llama Stack vector store by using the Llama Stack SDK. This modular approach separates document preparation from ingestion, yet still delivers an end-to-end, retrieval-augmented generation (RAG) workflow.
The pipeline registers a Milvus vector database and downloads the source PDFs, then splits them for parallel processing and converts each batch to Markdown with Docling. It generates sentence-transformer embeddings from the Markdown and stores them in the vector store, making the documents instantly searchable in Llama Stack.
Prerequisites
- You have installed OpenShift 4.17 or newer.
- You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
- You have logged in to OpenShift web console.
- You have a data science project and access to pipelines in the OpenShift AI dashboard.
- You have created and configured a pipeline server within the data science project that contains your workbench.
- You have activated the Llama Stack Operator in OpenShift AI.
- You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
-
You have configured a Llama Stack deployment by creating a
LlamaStackDistributioninstance to enable RAG functionality. - You have created a project workbench within a data science project.
- You have opened a Jupyter notebook and it is running in your workbench environment.
-
You have installed the
llama_stack_clientversion 0.2.14 or later in your workbench environment. - You have installed local object storage buckets and created connections, as described in Adding a connection to your data science project.
- You have compiled to YAML a data science pipeline that includes a Docling transform, either one of the RAG demo samples or your own custom pipeline.
- Your data science project quota allows between 500 millicores (0.5 CPU) and 4 CPU cores for the pipeline run.
- Your data science project quota allows from 2 GiB up to 6 GiB of RAM for the pipeline run.
- If you are using GPU acceleration, you have at least one NVIDIA GPU available.
Procedure
In a new notebook cell, install the
llama_stackclient package:%pip install llama_stack_client
%pip install llama_stack_clientCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, import Agent, AgentEventLogger, and LlamaStackClient:
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClient
from llama_stack_client import Agent, AgentEventLogger, LlamaStackClientCopy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, assign your deployment endpoint to the
base_urlparameter to create a LlamaStackClient instance:client = LlamaStackClient(base_url="<your deployment endpoint>")
client = LlamaStackClient(base_url="<your deployment endpoint>")Copy to Clipboard Copied! Toggle word wrap Toggle overflow List the available models:
models = client.models.list()
models = client.models.list()Copy to Clipboard Copied! Toggle word wrap Toggle overflow Select the first LLM and the first embedding model:
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = embedding_model.metadata["embedding_dimension"]
model_id = next(m.identifier for m in models if m.model_type == "llm") embedding_model = next(m for m in models if m.model_type == "embedding") embedding_model_id = embedding_model.identifier embedding_dimension = embedding_model.metadata["embedding_dimension"]Copy to Clipboard Copied! Toggle word wrap Toggle overflow In a new notebook cell, register a vector database (choose one option):
Example 3.5. Option 1: Inline Milvus Lite (embedded)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow NoteInline Milvus Lite is best for development. Data durability and scale are limited compared to remote Milvus.
Example 3.6. Option 2: Remote Milvus (recommended for production)
-
Ensure your
LlamaStackDistributionincludesMILVUS_ENDPOINTandMILVUS_TOKEN. -
Aside from the
provider_id, ingestion and query APIs are identical between inline and remote Milvus.
+
If you are using the sample Docling pipeline from the RAG demo repository, the pipeline registers the database automatically and you can skip this step. However, if you are using your own pipeline, you must register the database yourself.
- In the OpenShift web console, import the YAML file containing your docling pipeline into your data science project, as described in Importing a data science pipeline.
Create a pipeline run to execute your Docling pipeline, as described in Executing a pipeline run. The pipeline run inserts your PDF documents into the vector database. If you run the Docling pipeline from the RAG demo samples repository, you can optionally customize the following parameters before starting the pipeline run:
-
base_url: The base URL to fetch PDF files from. -
pdf_filenames: A comma-separated list of PDF filenames to download and convert. -
num_workers: The number of parallel workers. -
vector_db_id: The Milvus vector database ID. -
service_url: The Milvus service URL. -
embed_model_id: The embedding model to use. -
max_tokens: The maximum tokens for each chunk. -
use_gpu: Enable or disable GPU acceleration.
-
Verification
In your Jupyter notebook, query the LLM with a question that relates to the ingested content. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Query chunks from the vector database:
query_result = client.vector_io.query( vector_db_id=vector_db_id, query="what do you know about?", ) print(query_result)query_result = client.vector_io.query( vector_db_id=vector_db_id, query="what do you know about?", ) print(query_result)Copy to Clipboard Copied! Toggle word wrap Toggle overflow
3.11. About Llama stack search types Copy linkLink copied to clipboard!
Llama Stack supports keyword, vector, and hybrid search modes for retrieving context in retrieval-augmented generation (RAG) workloads. Each mode offers different tradeoffs in precision, recall, semantic depth, and computational cost.
3.11.1. Supported search modes Copy linkLink copied to clipboard!
3.11.1.1. Keyword search Copy linkLink copied to clipboard!
Keyword search applies lexical matching techniques, such as TF-IDF or BM25, to locate documents that contain exact or near-exact query terms. This approach is effective when precise term-matching is critical and remains widely used in information-retrieval systems. For more information, see The Probabilistic Relevance Framework: BM25 and Beyond.
3.11.1.2. Vector search Copy linkLink copied to clipboard!
Vector search encodes documents and queries as dense numerical vectors, known as embeddings, and measures similarity with metrics such as cosine similarity or inner product. This approach captures contextual meaning and supports semantic matching beyond exact word overlap. For more information, see Billion-scale similarity search with GPUs.
3.11.1.3. Hybrid search Copy linkLink copied to clipboard!
Hybrid search blends keyword and vector techniques, typically by combining individual scores with a weighted sum or methods, such as Reciprocal Rank Fusion (RRF). This approach returns results that balance exact matches with semantic relevance. For more information, see Sparse, Dense, and Hybrid Retrieval for Answer Ranking.
3.11.2. Retrieval database support Copy linkLink copied to clipboard!
Milvus is the supported retrieval database for Llama Stack. It currently provides vector search. However, keyword and hybrid search capabilities are not currently supported.