Chapter 4. Llama Stack application examples


Llama Stack allows you to create AI-driven applications in your OpenShift AI cluster. Llama Stack includes a default run.yaml file that enables the APIs to expose backend provider configurations. You can update and customize this run.yaml and create a ConfigMap that uses the run.yaml configurations.

The following documentation includes various example applications you can deploy, including:

  • Deploying a RAG stack in a data science project.
  • Evaluating RAG systems with Llama Stack.
  • Configuring Llama Stack with OAuth authentication.

4.1. Deploying a RAG stack in a project

Important

This feature is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

As an OpenShift cluster administrator, you can deploy a Retrieval-Augmented Generation (RAG) stack in OpenShift AI. This stack provides the infrastructure, including LLM inference, vector storage, and retrieval services that data scientists and AI engineers use to build conversational workflows in their projects.

To deploy the RAG stack in a project, complete the following tasks:

  • Activate the Llama Stack Operator in OpenShift AI.
  • Enable GPU support on the OpenShift cluster. This task includes installing the required NVIDIA Operators.
  • Deploy an inference model, for example, the llama-3.2-3b-instruct model. This task includes creating a storage connection and configuring GPU allocation.
  • Create a LlamaStackDistribution instance to enable RAG functionality. This action deploys LlamaStack alongside a Milvus vector store and connects both components to the inference model.
  • Ingest domain data into the configured vector store by running Docling in an AI pipeline or Jupyter notebook. This process keeps the embeddings synchronized with the source data.
  • Expose and secure the model endpoints.

4.1.1. Overview of RAG

Retrieval-augmented generation (RAG) in OpenShift AI enhances large language models (LLMs) by integrating domain-specific data sources directly into the model’s context. Domain-specific data sources can be structured data, such as relational database tables, or unstructured data, such as PDF documents.

RAG indexes content and builds an embedding store that data scientists and AI engineers can query. When data scientists or AI engineers pose a question to a RAG chatbot, the RAG pipeline retrieves the most relevant pieces of data, passes them to the LLM as context, and generates a response that reflects both the prompt and the retrieved content.

By implementing RAG, data scientists and AI engineers can obtain tailored, accurate, and verifiable answers to complex queries based on their own datasets within a project.

4.1.1.1. Audience for RAG

The target audience for RAG is practitioners who build data-grounded conversational AI applications using OpenShift AI infrastructure.

For Data Scientists
Data scientists can use RAG to prototype and validate models that answer natural-language queries against data sources without managing low-level embedding pipelines or vector stores. They can focus on creating prompts and evaluating model outputs instead of building retrieval infrastructure.
For MLOps Engineers
MLOps engineers typically deploy and operate RAG pipelines in production. Within OpenShift AI, they manage LLM endpoints, monitor performance, and ensure that both retrieval and generation scale reliably. RAG decouples vector store maintenance from the serving layer, enabling MLOps engineers to apply CI/CD workflows to data ingestion and model deployment alike.
For Data Engineers
Data engineers build workflows to load data into storage that OpenShift AI indexes. They keep embeddings in sync with source systems, such as S3 buckets or relational tables to ensure that chatbot responses are accurate.
For AI Engineers
AI engineers architect RAG chatbots by defining prompt templates, retrieval methods, and fallback logic. They configure agents and add domain-specific tools, such as OpenShift job triggers, enabling rapid iteration.

4.1.2. Overview of vector databases

Vector databases are a core component of retrieval-augmented generation (RAG) in OpenShift AI. They store and index vector embeddings that represent the semantic meaning of text or other data. When integrated with Llama Stack, vector databases enable applications to retrieve relevant context and combine it with large language model (LLM) inference.

Vector databases provide the following capabilities:

  • Store vector embeddings generated by embedding models.
  • Support efficient similarity search to retrieve semantically related content.
  • Enable RAG workflows by supplying the LLM with contextually relevant data.

In OpenShift AI, vector databases are configured and managed through the Llama Stack Operator as part of a LlamaStackDistribution. Starting with version 3.2, PostgreSQL is the default and recommended metadata store for Llama Stack, supporting production-ready persistence, concurrency, and scalability.

The following vector database options are supported in OpenShift AI:

  • Inline Milvus Inline Milvus runs embedded within the Llama Stack Distribution (LSD) pod and is suitable for development and small-scale RAG workloads. In OpenShift AI 3.2 and later, Inline Milvus uses PostgreSQL as the backing metadata store by default. This option provides a simplified deployment model while retaining durable metadata storage.
  • Inline FAISS Inline FAISS uses the FAISS (Facebook AI Similarity Search) library to provide an in-process vector store for RAG workflows. Inline FAISS is designed for experimentation, prototyping, and development scenarios where simplicity and low operational overhead are priorities. In OpenShift AI 3.2 and later, Inline FAISS also relies on PostgreSQL for metadata storage.
  • Remote Milvus Remote Milvus runs as a standalone vector database service, either within the cluster or as an external managed deployment. This option is suitable for large-scale or production-grade RAG workloads that require high availability, horizontal scalability, and isolation from the Llama Stack server. In OpenShift environments, Milvus typically requires an accompanying etcd service for coordination. For more information, see Providing redundancy with etcd.
  • Remote PostgreSQL with pgvector PostgreSQL with the pgvector extension provides a production-ready vector database option that integrates vector similarity search directly into PostgreSQL. This option is well suited for environments that already operate PostgreSQL and require durable storage, transactional consistency, and centralized management. pgvector enables Llama Stack to store embeddings and perform similarity search without deploying a separate vector database service.

Consider the following guidance when choosing a vector database for your RAG workloads:

  • Use Inline Milvus or Inline FAISS for development, testing, or early experimentation.
  • Use Remote Milvus when you require large-scale vector indexing and high-throughput similarity search.
  • Use PostgreSQL with pgvector when you want production-ready persistence and integration with existing PostgreSQL-based data platforms.

Starting with OpenShift AI 3.2, SQLite-based storage is no longer recommended for production deployments. PostgreSQL-based backends provide improved reliability, concurrency, and scalability as Llama Stack moves toward general availability.

4.1.2.1. Overview of Milvus vector databases

Milvus is an open source vector database designed for high-performance similarity search across large volumes of embedding data. In OpenShift AI, Milvus is supported as a vector store provider for Llama Stack and enables retrieval-augmented generation (RAG) workloads that require efficient vector indexing, scalable search, and durable storage.

Starting with OpenShift AI 3.2, production-grade Llama Stack deployments default to PostgreSQL for metadata persistence. When Milvus is used as the vector store, PostgreSQL is typically used for Llama Stack metadata, while Milvus manages vector indexes and similarity search.

Milvus vector databases provide the following capabilities in OpenShift AI:

  • High-performance similarity search using Approximate Nearest Neighbor (ANN) algorithms
  • Efficient indexing and query optimization for dense embeddings
  • Persistent storage of vector data
  • Integration with Llama Stack through an OpenAI-compatible Vector Stores API

In a typical RAG workflow in OpenShift AI, the following responsibilities are separated:

  • Embedding generation Embeddings are generated by the configured embedding provider. In OpenShift AI 3.2, remote embedding models are the recommended and default option for production deployments.
  • Vector storage and retrieval Milvus stores embedding vectors and performs similarity search operations.
  • Metadata persistence Llama Stack stores vector store metadata, file references, and configuration state using PostgreSQL in production deployments.
  • Llama Stack server Coordinates ingestion, retrieval, and model inference through a unified API surface.

In OpenShift AI, Milvus can be used in the following operational modes:

  • Inline Milvus Lite Runs embedded within the Llama Stack Distribution pod. Inline Milvus Lite is intended for experimentation, development, or small datasets. It does not provide high availability or horizontal scalability and is not recommended for production use.
  • Remote Milvus Runs as a standalone service within your OpenShift project or as an external managed Milvus deployment. Remote Milvus is recommended for production-grade RAG workloads.

A remote Milvus deployment typically includes the following components:

  • A Milvus service that exposes a gRPC endpoint (port 19530) for client traffic
  • An etcd service that Milvus uses for metadata coordination, collection state, and index management
  • Persistent storage for durable vector data

Milvus requires a dedicated etcd instance for metadata coordination, even when running in standalone mode. Do not use the OpenShift control plane etcd for this purpose. For more information about etcd, see Providing redundancy with etcd.

Important

You must deploy a dedicated etcd service for Milvus or connect Milvus to an external etcd instance. Do not share the OpenShift control plane etcd with application workloads.

Use Remote Milvus when you require scalable vector search, high-performance retrieval, and integration with production-grade Llama Stack deployments in OpenShift AI.

4.1.2.2. Overview of FAISS vector databases

The FAISS (Facebook AI Similarity Search) library is an open source framework for high-performance vector search and clustering. It is optimized for dense numerical embeddings and supports both CPU and GPU execution. In OpenShift AI, FAISS is supported as an inline vector store provider for Llama Stack, enabling fast, in-process similarity search without requiring a separate vector database service.

When you enable inline FAISS in a LlamaStackDistribution, Llama Stack uses FAISS as an embedded vector index that runs inside the Llama Stack server container. This configuration is designed for lightweight development, experimentation, and single-node retrieval-augmented generation (RAG) workflows.

Inline FAISS provides the following capabilities in OpenShift AI:

  • In-process similarity search using FAISS indexes.
  • Low-latency embedding ingestion and query operations.
  • Simple deployment with no external vector database service.
  • Compatibility with OpenAI-compatible Vector Stores API endpoints.

In OpenShift AI 3.2, inline FAISS relies on the Llama Stack metadata and persistence backend for managing vector store state. PostgreSQL is the default and recommended backend for production-grade deployments, even when FAISS is used as the inline vector index.

SQLite can be explicitly configured for local or on-the-fly development scenarios, but it is not recommended for production use.

Inline FAISS is suitable for the following use cases:

  • Rapid prototyping of RAG workflows.
  • Development or testing environments.
  • Disconnected or single-node deployments where external vector databases are not required.
Note

Inline FAISS does not provide distributed storage, replication, or high availability. For production-grade RAG workloads that require durability, scalability, or multi-node access, use a remote vector database such as Milvus or PostgreSQL with the pgvector extension.

4.1.2.3. Overview of pgvector vector databases

pgvector is an open source PostgreSQL extension that enables vector similarity search on embedding data stored in relational tables. In OpenShift AI, PostgreSQL with the pgvector extension is supported as a remote vector database provider for the Llama Stack Operator. pgvector supports retrieval augmented generation workflows that require persistent vector storage while integrating with existing PostgreSQL environments.

pgvector vector databases provide the following capabilities in OpenShift AI:

  • Storage of vector embeddings in PostgreSQL tables.
  • Similarity search across embeddings by using pgvector distance metrics.
  • Persistent storage of vectors alongside structured relational data.
  • Integration with existing PostgreSQL security and operational tooling.

In a typical retrieval augmented generation workflow in OpenShift AI, your application uses the following components:

  • Inference provider Generates embeddings and model responses.
  • Vector store provider Stores embeddings and performs similarity search. When you use pgvector, PostgreSQL provides this capability as a remote vector store.
  • File storage provider Stores the source files that are ingested into vector stores.
  • Llama Stack server Provides a unified API surface, including an OpenAI compatible Vector Stores API.

When you ingest content, Llama Stack splits source material into chunks, generates embeddings, and stores them in PostgreSQL through the pgvector extension. When you query a vector store, Llama Stack performs similarity search and returns the most relevant chunks for use in prompts.

In OpenShift AI, pgvector is used in the following operational mode:

  • Remote PostgreSQL with pgvector, which runs as a standalone PostgreSQL database service accessed by the Llama Stack server. This mode is suitable for development and production workloads that require persistent storage and integration with existing PostgreSQL infrastructure.

When you deploy PostgreSQL with the pgvector extension, you typically manage the following components:

  • Secrets for PostgreSQL connection credentials.
  • Persistent storage for durable database data.
  • A PostgreSQL service that exposes a network endpoint.

PostgreSQL with pgvector does not require an external coordination service. Vector data, indexes, and metadata are stored directly in PostgreSQL tables and managed through standard database mechanisms.

Use PostgreSQL with pgvector when you require persistent vector storage and want to integrate vector search into existing PostgreSQL based data platforms within OpenShift AI.

4.1.3. Deploying a Llama model with KServe

To use Llama Stack and retrieval-augmented generation (RAG) workloads in OpenShift AI, you must deploy a Llama model with a vLLM model server and configure KServe in KServe RawDeployment mode.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have activated the Llama Stack Operator.
  • You have installed KServe.
  • You have enabled the model serving platform. For more information about enabling the model serving platform, see Enabling the model serving platform.
  • You can access the model serving platform in the dashboard configuration. For more information about setting dashboard configuration options, see Customizing the dashboard.
  • You have enabled GPU support in OpenShift AI, including installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

  • You have created a project.
  • The vLLM serving runtime is installed and available in your environment.
  • You have created a storage connection for your model that contains a URI - v1 connection type. This storage connection must define the location of your Llama 3.2 model artifacts. For example, oci://quay.io/redhat-ai-services/modelcar-catalog:llama-3.2-3b-instruct. For more information about creating storage connections, see Adding a connection to your project.
Procedure

These steps are only supported in OpenShift AI versions 2.19 and later.

  1. In the OpenShift AI dashboard, navigate to the project details page and click the Deployments tab.
  2. In the Model serving platform tile, click Select model.
  3. Click the Deploy model button.

    The Deploy model dialog opens.

  4. Configure the deployment properties for your model:

    1. In the Model deployment name field, enter a unique name for your deployment.
    2. In the Serving runtime field, select vLLM NVIDIA GPU serving runtime for KServe from the drop-down list.
    3. In the Deployment mode field, select KServe RawDeployment from the drop-down list.
    4. Set Number of model server replicas to deploy to 1.
    5. In the Model server size field, select Custom from the drop-down list.

      • Set CPUs requested to 1 core.
      • Set Memory requested to 10 GiB.
      • Set CPU limit to 2 core.
      • Set Memory limit to 14 GiB.
      • Set Accelerator to NVIDIA GPUs.
      • Set Accelerator count to 1.
    6. From the Connection type, select a relevant data connection from the drop-down list.
  5. In the Additional serving runtime arguments field, specify the following recommended arguments:

    --dtype=half
    --max-model-len=20000
    --gpu-memory-utilization=0.95
    --enable-chunked-prefill
    --enable-auto-tool-choice
    --tool-call-parser=llama3_json
    --chat-template=/app/data/template/tool_chat_template_llama3.2_json.jinja
    1. Click Deploy.

      Note

      Model deployment can take several minutes, especially for the first model that is deployed on the cluster. Initial deployment may take more than 10 minutes while the relevant images download.

Verification

  1. Verify that the kserve-controller-manager and odh-model-controller pods are running:

    1. Open a new terminal window.
    2. Log in to your OpenShift cluster from the CLI:
    3. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
    4. After you have logged in, click Display token.
    5. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

      $ oc login --token=<token> --server=<openshift_cluster_url>
    6. Enter the following command to verify that the kserve-controller-manager and odh-model-controller pods are running:

      $ oc get pods -n redhat-ods-applications | grep -E 'kserve-controller-manager|odh-model-controller'
    7. Confirm that you see output similar to the following example:

      kserve-controller-manager-7c865c9c9f-xyz12   1/1     Running   0          4m21s
      odh-model-controller-7b7d5fd9cc-wxy34        1/1     Running   0          3m55s
    8. If you do not see either of the kserve-controller-manager and odh-model-controller pods, there could be a problem with your deployment. In addition, if the pods appear in the list, but their Status is not set to Running, check the pod logs for errors:

      $ oc logs <pod-name> -n redhat-ods-applications
    9. Check the status of the inference service:

      $ oc get inferenceservice -n llamastack
      $ oc get pods -n <project name> | grep llama
      • The deployment automatically creates the following resources:

        • A ServingRuntime resource.
        • An InferenceService resource, a Deployment, a pod, and a service pointing to the pod.
      • Verify that the server is running. For example:

        $ oc logs llama-32-3b-instruct-predictor-77f6574f76-8nl4r  -n <project name>

        Check for output similar to the following example log:

        INFO     2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321
        INFO:     Started server process [1]
        INFO:     Waiting for application startup.
        INFO     2025-05-15 11:23:52,765 __main__:151 server: Starting up
        INFO:     Application startup complete.
        INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
      • The deployed model displays in the Deployments tab on the project details page for the project it was deployed under.
  2. If you see a ConvertTritonGPUToLLVM error in the pod logs when querying the /v1/chat/completions API, and the vLLM server restarts or returns a 500 Internal Server error, apply the following workaround:

    Before deploying the model, remove the --enable-chunked-prefill argument from the Additional serving runtime arguments field in the deployment dialog.

    The error is displayed similar to the following:

    /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: error: Failures have been detected while processing an MLIR pass pipeline
    /opt/vllm/lib64/python3.12/site-packages/vllm/attention/ops/prefix_prefill.py:36:0: note: Pipeline failed while executing [`ConvertTritonGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`
    INFO:     10.129.2.8:0 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error

4.1.4. Testing your vLLM model endpoints

To verify that your deployed Llama 3.2 model is accessible externally, ensure that your vLLM model server is exposed as a network endpoint. You can then test access to the model from outside both the OpenShift cluster and the OpenShift AI interface.

Important

If you selected Make deployed models available through an external route during deployment, your vLLM model endpoint is already accessible outside the cluster. You do not need to manually expose the model server. Manually exposing vLLM model endpoints, for example, by using oc expose, creates an unsecured route unless you configure authentication. Avoid exposing endpoints without security controls to prevent unauthorized access.

Prerequisites

  • You have cluster administrator privileges for your OpenShift cluster.
  • You have logged in to Red Hat OpenShift AI.
  • You have activated the Llama Stack Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. Open a new terminal window.

    1. Log in to your OpenShift cluster from the CLI:
    2. In the upper-right corner of the OpenShift web console, click your user name and select Copy login command.
    3. After you have logged in, click Display token.
    4. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

      $ oc login --token=<token> --server=<openshift_cluster_url>
  2. If you enabled Require token authentication during model deployment, retrieve your token:

    $ export MODEL_TOKEN=$(oc get secret default-name-llama-32-3b-instruct-sa -n <project name> --template={{ .data.token }} | base64 -d)
  3. Obtain your model endpoint URL:

    • If you enabled Make deployed models available through an external route during model deployment, click Endpoint details on the Deployments page in the OpenShift AI dashboard to obtain your model endpoint URL.
    • In addition, if you did not enable Require token authentication during model deployment, you can also enter the following command to retrieve the endpoint URL:

      $ export MODEL_ENDPOINT="https://$(oc get route llama-32-3b-instruct -n <project name> --template={{ .spec.host }})"
  4. Test the endpoint with a sample chat completion request:

    • If you did not enable Require token authentication during model deployment, enter a chat completion request. For example:

      $ curl -X POST $MODEL_ENDPOINT/v1/chat/completions \
       -H "Content-Type: application/json" \
       -d '{
       "model": "llama-32-3b-instruct",
       "messages": [
         {
           "role": "user",
           "content": "Hello"
         }
       ]
      }'
    • If you enabled Require token authentication during model deployment, include a token in your request. For example:

      curl -s -k $MODEL_ENDPOINT/v1/chat/completions \
      --header "Authorization: Bearer $MODEL_TOKEN" \
      --header 'Content-Type: application/json' \
      -d '{
        "model": "llama-32-3b-instruct",
        "messages": [
          {
            "role": "user",
            "content": "can you tell me a funny joke?"
          }
        ]
      }' | jq .
      Note

      The -k flag disables SSL verification and should only be used in test environments or with self-signed certificates.

Verification

Confirm that you received a JSON response containing a chat completion. For example:

{
  "id": "chatcmpl-05d24b91b08a4b78b0e084d4cc91dd7e",
  "object": "chat.completion",
  "created": 1747279170,
  "model": "llama-32-3b-instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "reasoning_content": null,
      "content": "Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?",
      "tool_calls": []
    },
    "logprobs": null,
    "finish_reason": "stop",
    "stop_reason": null
  }],
  "usage": {
    "prompt_tokens": 37,
    "total_tokens": 62,
    "completion_tokens": 25,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null
}

If you do not receive a response similar to the example, verify that the endpoint URL and token are correct, and ensure your model deployment is running.

4.1.5. Deploying a remote Milvus vector database

To use Milvus as a remote vector database provider for Llama Stack in OpenShift AI, you must deploy Milvus and its required etcd service in your OpenShift project. This procedure shows how to deploy Milvus in standalone mode without the Milvus Operator.

Note

The following example configuration is intended for testing or evaluation environments. For production-grade deployments, see https://milvus.io/docs in the Milvus documentation.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You are logged in to Red Hat OpenShift AI.
  • You have a StorageClass available that can provision persistent volumes.
  • You created a root password to secure your Milvus service.
  • You have deployed an inference model with vLLM, for example, the llama-3.2-3b-instruct model, and you have selected Make deployed models available through an external route and Require token authentication during model deployment.
  • You have the correct inference model identifier, for example, llama-3-2-3b.
  • You have the model endpoint URL, ending with /v1, such as https://llama-32-3b-instruct-predictor:8443/v1.
  • You have the API token required to access the model endpoint.
  • You have installed the OpenShift command line interface (oc) as described in Installing the OpenShift CLI.

Procedure

  1. In the OpenShift console, click the Quick Create ( quick create icon ) icon and then click the Import YAML option.
  2. Verify that your project is the selected project.
  3. In the Import YAML editor, paste the following manifest and click Create:

    apiVersion: v1
    kind: Secret
    metadata:
      name: milvus-secret
    type: Opaque
    stringData:
      root-password: "MyStr0ngP@ssw0rd"
    ---
    kind: PersistentVolumeClaim
    apiVersion: v1
    metadata:
      name: milvus-pvc
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      volumeMode: Filesystem
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: etcd-deployment
      labels:
        app: etcd
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: etcd
      strategy:
        type: Recreate
      template:
        metadata:
          labels:
            app: etcd
        spec:
          containers:
            - name: etcd
              image: quay.io/coreos/etcd:v3.5.5
              command:
                - etcd
                - --advertise-client-urls=http://127.0.0.1:2379
                - --listen-client-urls=http://0.0.0.0:2379
                - --data-dir=/etcd
              ports:
                - containerPort: 2379
              volumeMounts:
                - name: etcd-data
                  mountPath: /etcd
              env:
                - name: ETCD_AUTO_COMPACTION_MODE
                  value: revision
                - name: ETCD_AUTO_COMPACTION_RETENTION
                  value: "1000"
                - name: ETCD_QUOTA_BACKEND_BYTES
                  value: "4294967296"
                - name: ETCD_SNAPSHOT_COUNT
                  value: "50000"
          volumes:
            - name: etcd-data
              emptyDir: {}
          restartPolicy: Always
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: etcd-service
    spec:
      ports:
        - port: 2379
          targetPort: 2379
      selector:
        app: etcd
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      labels:
        app: milvus-standalone
      name: milvus-standalone
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: milvus-standalone
      strategy:
        type: Recreate
      template:
        metadata:
          labels:
            app: milvus-standalone
        spec:
          containers:
            - name: milvus-standalone
              image: milvusdb/milvus:v2.6.0
              args: ["milvus", "run", "standalone"]
              env:
                - name: DEPLOY_MODE
                  value: standalone
                - name: ETCD_ENDPOINTS
                  value: etcd-service:2379
                - name: COMMON_STORAGETYPE
                  value: local
                - name: MILVUS_ROOT_PASSWORD
                  valueFrom:
                    secretKeyRef:
                      name: milvus-secret
                      key: root-password
              livenessProbe:
                exec:
                  command: ["curl", "-f", "http://localhost:9091/healthz"]
                initialDelaySeconds: 90
                periodSeconds: 30
                timeoutSeconds: 20
                failureThreshold: 5
              ports:
                - containerPort: 19530
                  protocol: TCP
                - containerPort: 9091
                  protocol: TCP
              volumeMounts:
                - name: milvus-data
                  mountPath: /var/lib/milvus
          restartPolicy: Always
          volumes:
            - name: milvus-data
              persistentVolumeClaim:
                claimName: milvus-pvc
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: milvus-service
    spec:
      selector:
        app: milvus-standalone
      ports:
        - name: grpc
          port: 19530
          targetPort: 19530
        - name: http
          port: 9091
          targetPort: 9091
    Note
    • Use the gRPC port (19530) for the MILVUS_ENDPOINT setting in Llama Stack.
    • The HTTP port (9091) is reserved for health checks.
    • If you deploy Milvus in a different namespace, use the fully qualified service name in your Llama Stack configuration. For example: http://milvus-service.<namespace>.svc.cluster.local:19530

Verification

  1. In the OpenShift web console, click Workloads Deployments.
  2. Verify that both etcd-deployment and milvus-standalone show a status of 1 of 1 pods available.
  3. Click Pods in the navigation panel and confirm that pods for both deployments are Running.
  4. Click the milvus-standalone pod name, then select the Logs tab.
  5. Verify that Milvus reports a healthy startup with output similar to:

    Milvus Standalone is ready to serve ...
    Listening on 0.0.0.0:19530 (gRPC)
  6. Click Networking Services and confirm that the milvus-service and etcd-service resources exist and are exposed on ports 19530 and 2379, respectively.
  7. (Optional) Click Pods milvus-standalone Terminal and run the following health check:

    curl http://localhost:9091/healthz

    A response of {"status": "healthy"} confirms that Milvus is running correctly.

4.1.6. Deploying a LlamaStackDistribution instance

You can deploy Llama Stack with retrieval-augmented generation (RAG) by pairing it with a vLLM-served Llama 3.2 model. This module provides the following deployment examples of the LlamaStackDistribution custom resource (CR):

  • Example A: Inline Milvus (embedded, single-node, remote embeddings)
  • Example B: Remote Milvus (external service, inline embeddings served with the sentence-transformers library)
  • Example C: Inline FAISS (embedded, single node, inline embeddings served with the sentence-transformers library)
  • Example D: Remote PostgreSQL with pgvector (external service, remote embeddings)

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You are logged in to Red Hat OpenShift AI.
  • You have activated the Llama Stack Operator in OpenShift AI.
  • You have deployed an inference model with vLLM (for example, llama-3.2-3b-instruct) and selected Make deployed models available through an external route and Require token authentication during model deployment. In addition, in Add custom runtime arguments, you have added --enable-auto-tool-choice.
  • You have the correct inference model identifier, for example, llama-3-2-3b.
  • You have the model endpoint URL ending with /v1, for example, https://llama-32-3b-instruct-predictor:8443/v1.
  • You have the API token required to access the model endpoint.
  • You have installed the PostgreSQL Operator version 14 or later and configured a PostgreSQL database for Llama Stack metadata storage. For more information, see the documentation for "Deploying a Llama Stack server".
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. Open a new terminal window and log in to your OpenShift cluster from the CLI:

    In the upper-right corner of the OpenShift web console, click your user name and select Copy login command. After you have logged in, click Display token. Copy the Log in with this token command and paste it in the OpenShift CLI (oc).

    $ oc login --token=<token> --server=<openshift_cluster_url>
  2. Create a secret that contains the inference model and the remote embeddings environment variables:

    # Remote LLM
    export INFERENCE_MODEL="llama-3-2-3b"
    export VLLM_URL="https://llama-32-3b-instruct-predictor:8443/v1"
    export VLLM_TLS_VERIFY="false"   # Use "true" in production
    export VLLM_API_TOKEN="<token identifier>"
    export VLLM_MAX_TOKENS=16384
    
    # Remote embedding configuration
    export EMBEDDING_MODEL="nomic-embed-text-v1-5"
    export EMBEDDING_PROVIDER_MODEL_ID="nomic-embed-text-v1-5"
    export VLLM_EMBEDDING_URL="<embedding-endpoint>/v1"
    export VLLM_EMBEDDING_API_TOKEN="<embedding-token>"
    export VLLM_EMBEDDING_MAX_TOKENS=8192
    export VLLM_EMBEDDING_TLS_VERIFY="true"
    
    oc create secret generic llama-stack-secret -n <project-name> \
      --from-literal=INFERENCE_MODEL="$INFERENCE_MODEL" \
      --from-literal=VLLM_URL="$VLLM_URL" \
      --from-literal=VLLM_TLS_VERIFY="$VLLM_TLS_VERIFY" \
      --from-literal=VLLM_API_TOKEN="$VLLM_API_TOKEN" \
      --from-literal=VLLM_MAX_TOKENS="$VLLM_MAX_TOKENS" \
      --from-literal=EMBEDDING_MODEL="$EMBEDDING_MODEL" \
      --from-literal=EMBEDDING_PROVIDER_MODEL_ID="$EMBEDDING_PROVIDER_MODEL_ID" \
      --from-literal=VLLM_EMBEDDING_URL="$VLLM_EMBEDDING_URL" \
      --from-literal=VLLM_EMBEDDING_TLS_VERIFY="$VLLM_EMBEDDING_TLS_VERIFY" \
      --from-literal=VLLM_EMBEDDING_API_TOKEN="$VLLM_EMBEDDING_API_TOKEN" \
      --from-literal=VLLM_EMBEDDING_MAX_TOKENS="$VLLM_EMBEDDING_MAX_TOKENS"
  3. Choose one of the following deployment examples:
Important

To enable inline embeddings in a disconnected environment, add the following parameters to your LlamaStackDistribution custom resource:

# Enable inline embeddings with sentence-transformers
- name: ENABLE_SENTENCE_TRANSFORMERS
  value: "true"
- name: EMBEDDING_PROVIDER
  value: "sentence-transformers"

# Additional required configuration for disconnected environments
- name: SENTENCE_TRANSFORMERS_HOME
  value: /opt/app-root/src/.cache/huggingface/hub
- name: HF_HUB_OFFLINE
  value: "1"
- name: TRANSFORMERS_OFFLINE
  value: "1"
- name: HF_DATASETS_OFFLINE
  value: "1"

The built-in Llama Stack tool websearch is not available in the Red Hat Llama Stack Distribution in disconnected environments. In addition, the built-in Llama Stack tool wolfram_alpha is not available in the Red Hat Llama Stack Distribution in all clusters.

Use this example for development or small datasets where an embedded, single-node Milvus is sufficient. This example uses remote embeddings.

  1. In the OpenShift web console, select Administrator Quick Create ( quick create icon ) Import YAML, and create a CR similar to the following:

    apiVersion: llamastack.io/v1alpha1
    kind: LlamaStackDistribution
    metadata:
      name: lsd-llama-milvus-inline
    spec:
      replicas: 1
      server:
        containerSpec:
          resources:
            requests:
              cpu: "250m"
              memory: "500Mi"
            limits:
              cpu: 4
              memory: "12Gi"
          env:
            # PostgreSQL metadata store (required in {productname-short} 3.2)
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: <postgres-database>
            - name: POSTGRES_USER
              value: <postgres-username>
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <postgres-secret-name>
                  key: <postgres-password-key>
    
            # Remote LLM configuration
            - name: INFERENCE_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: INFERENCE_MODEL
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_URL
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_TLS_VERIFY
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_API_TOKEN
            - name: VLLM_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_MAX_TOKENS
    
            # Remote embedding configuration
            - name: EMBEDDING_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: EMBEDDING_MODEL
            - name: EMBEDDING_PROVIDER_MODEL_ID
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: EMBEDDING_PROVIDER_MODEL_ID
            - name: VLLM_EMBEDDING_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_URL
            - name: VLLM_EMBEDDING_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_TLS_VERIFY
            - name: VLLM_EMBEDDING_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_API_TOKEN
            - name: VLLM_EMBEDDING_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_MAX_TOKENS
    
            - name: FMS_ORCHESTRATOR_URL
              value: "http://localhost"
          name: llama-stack
          port: 8321
        distribution:
          name: rh-dev
        storage:
          size: 5Gi
    Note

    The rh-dev value is an internal image reference. When you create the LlamaStackDistribution custom resource, the OpenShift AI Operator automatically resolves rh-dev to the container image in the appropriate registry. This internal image reference allows the underlying image to update without requiring changes to your custom resource.

Use this example for production-grade or large datasets with an external Milvus service. This example uses inline embeddings served with the sentence-transformers library.

  1. Create the Milvus connection secret:

    # Required: gRPC endpoint on port 19530
    export MILVUS_ENDPOINT="tcp://milvus-service:19530"
    export MILVUS_TOKEN="<milvus-root-or-user-token>"
    export MILVUS_CONSISTENCY_LEVEL="Bounded"   # Optional; choose per your deployment
    
    oc create secret generic milvus-secret \
      --from-literal=MILVUS_ENDPOINT="$MILVUS_ENDPOINT" \
      --from-literal=MILVUS_TOKEN="$MILVUS_TOKEN" \
      --from-literal=MILVUS_CONSISTENCY_LEVEL="$MILVUS_CONSISTENCY_LEVEL"
    Important

    Use the gRPC port 19530 for MILVUS_ENDPOINT. Ports such as 9091 are typically used for health checks and are not valid for client traffic.

  2. In the OpenShift web console, select Administrator Quick Create ( quick create icon ) Import YAML, and create a CR similar to the following:

    apiVersion: llamastack.io/v1alpha1
    kind: LlamaStackDistribution
    metadata:
      name: lsd-llama-milvus-remote
    spec:
      replicas: 1
      server:
        containerSpec:
          resources:
            requests:
              cpu: "250m"
              memory: "500Mi"
            limits:
              cpu: 4
              memory: "12Gi"
          env:
            # PostgreSQL metadata store (required in {productname-short} 3.2)
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: <postgres-database>
            - name: POSTGRES_USER
              value: <postgres-username>
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <postgres-secret-name>
                  key: <postgres-password-key>
    
            # Inline embeddings (sentence-transformers)
            - name: ENABLE_SENTENCE_TRANSFORMERS
              value: "true"
            - name: EMBEDDING_PROVIDER
              value: "sentence-transformers"
    
            # Remote LLM configuration
            - name: INFERENCE_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: INFERENCE_MODEL
            - name: VLLM_MAX_TOKENS
              value: "4096"
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_URL
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_TLS_VERIFY
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_API_TOKEN
    
            # Remote Milvus configuration from secret
            - name: MILVUS_ENDPOINT
              valueFrom:
                secretKeyRef:
                  name: milvus-secret
                  key: MILVUS_ENDPOINT
            - name: MILVUS_TOKEN
              valueFrom:
                secretKeyRef:
                  name: milvus-secret
                  key: MILVUS_TOKEN
            - name: MILVUS_CONSISTENCY_LEVEL
              valueFrom:
                secretKeyRef:
                  name: milvus-secret
                  key: MILVUS_CONSISTENCY_LEVEL
          name: llama-stack
          port: 8321
        distribution:
          name: rh-dev

Use this example to enable the inline FAISS vector store. This example uses inline embeddings served with the sentence-transformers library.

  1. In the OpenShift web console, select Administrator Quick Create ( quick create icon ) Import YAML, and create a CR similar to the following:

    apiVersion: llamastack.io/v1alpha1
    kind: LlamaStackDistribution
    metadata:
      name: lsd-llama-faiss-inline
    spec:
      replicas: 1
      server:
        containerSpec:
          resources:
            requests:
              cpu: "250m"
              memory: "500Mi"
            limits:
              cpu: "8"
              memory: "12Gi"
          env:
            # PostgreSQL metadata store (required in {productname-short} 3.2)
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: <postgres-database>
            - name: POSTGRES_USER
              value: <postgres-username>
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <postgres-secret-name>
                  key: <postgres-password-key>
    
            # Inline embeddings (sentence-transformers)
            - name: ENABLE_SENTENCE_TRANSFORMERS
              value: "true"
            - name: EMBEDDING_PROVIDER
              value: "sentence-transformers"
    
            # Remote LLM configuration
            - name: INFERENCE_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: INFERENCE_MODEL
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_URL
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_TLS_VERIFY
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_API_TOKEN
    
            # Enable inline FAISS
            - name: ENABLE_FAISS
              value: "faiss"
    
            - name: FMS_ORCHESTRATOR_URL
              value: "http://localhost"
          name: llama-stack
          port: 8321
        distribution:
          name: rh-dev

Use this example when you want to use a PostgreSQL database with the pgvector extension as the vector store backend. This configuration enables the pgvector provider and reads connection values from a secret. This example uses remote embeddings.

  1. Create the pgvector connection secret:

    export PGVECTOR_HOST="<pgvector-hostname>"
    export PGVECTOR_PORT="5432"
    export PGVECTOR_DB="<pgvector-database>"
    export PGVECTOR_USER="<pgvector-username>"
    export PGVECTOR_PASSWORD="<pgvector-password>"
    
    oc create secret generic pgvector-connection -n <project-name> \
      --from-literal=PGVECTOR_HOST="$PGVECTOR_HOST" \
      --from-literal=PGVECTOR_PORT="$PGVECTOR_PORT" \
      --from-literal=PGVECTOR_DB="$PGVECTOR_DB" \
      --from-literal=PGVECTOR_USER="$PGVECTOR_USER" \
      --from-literal=PGVECTOR_PASSWORD="$PGVECTOR_PASSWORD"
  2. In the OpenShift web console, select Administrator Quick Create ( quick create icon ) Import YAML, and create a custom resource similar to the following:

    apiVersion: llamastack.io/v1alpha1
    kind: LlamaStackDistribution
    metadata:
      name: lsd-llama-pgvector-remote
    spec:
      replicas: 1
      server:
        containerSpec:
          resources:
            requests:
              cpu: "250m"
              memory: "500Mi"
            limits:
              cpu: 4
              memory: "12Gi"
          env:
            # PostgreSQL metadata store (required in {productname-short} 3.2)
            - name: POSTGRES_HOST
              value: <postgres-host>
            - name: POSTGRES_PORT
              value: "5432"
            - name: POSTGRES_DB
              value: <postgres-database>
            - name: POSTGRES_USER
              value: <postgres-username>
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <postgres-secret-name>
                  key: <postgres-password-key>
    
            # Remote LLM configuration
            - name: INFERENCE_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: INFERENCE_MODEL
            - name: VLLM_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_URL
            - name: VLLM_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_TLS_VERIFY
            - name: VLLM_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_API_TOKEN
            - name: VLLM_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_MAX_TOKENS
    
            # Remote embedding configuration
            - name: EMBEDDING_MODEL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: EMBEDDING_MODEL
            - name: EMBEDDING_PROVIDER_MODEL_ID
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: EMBEDDING_PROVIDER_MODEL_ID
            - name: VLLM_EMBEDDING_URL
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_URL
            - name: VLLM_EMBEDDING_TLS_VERIFY
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_TLS_VERIFY
            - name: VLLM_EMBEDDING_API_TOKEN
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_API_TOKEN
            - name: VLLM_EMBEDDING_MAX_TOKENS
              valueFrom:
                secretKeyRef:
                  name: llama-stack-secret
                  key: VLLM_EMBEDDING_MAX_TOKENS
    
            # Enable and configure pgvector provider
            - name: ENABLE_PGVECTOR
              value: "true"
            - name: PGVECTOR_HOST
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_HOST
            - name: PGVECTOR_PORT
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_PORT
            - name: PGVECTOR_DB
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_DB
            - name: PGVECTOR_USER
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_USER
            - name: PGVECTOR_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: pgvector-connection
                  key: PGVECTOR_PASSWORD
    
            - name: FMS_ORCHESTRATOR_URL
              value: "http://localhost"
          name: llama-stack
          port: 8321
        distribution:
          name: rh-dev
  3. Click Create.

Verification

  • In the left-hand navigation, click Workloads Pods and verify that the Llama Stack pod is running in the correct namespace.
  • To verify that the Llama Stack server is running, click the pod name and select the Logs tab. Look for output similar to the following:

    INFO     2025-05-15 11:23:52,750 __main__:498 server: Listening on ['::', '0.0.0.0']:8321
    INFO:     Started server process [1]
    INFO:     Waiting for application startup.
    INFO     2025-05-15 11:23:52,765 __main__:151 server: Starting up
    INFO:     Application startup complete.
    INFO:     Uvicorn running on http://['::', '0.0.0.0']:8321 (Press CTRL+C to quit)
Tip

If you switch between vector store configurations, delete the existing pod to ensure the new environment variables and backing store are picked up cleanly.

4.1.7. Ingesting content into a Llama model

You can quickly customize and prototype retrievable content by uploading a document and adding it to a vector store from inside a Jupyter notebook. This approach avoids building a separate ingestion pipeline. By using the Llama Stack SDK, you can ingest documents into a vector store and enable retrieval-augmented generation (RAG) workflows.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have deployed a Llama 3.2 model with a vLLM model server.
  • You have created a LlamaStackDistribution instance.
  • You have configured a PostgreSQL database for Llama Stack metadata storage.
  • You have configured an embedding model:

    • Recommended: You have configured a remote embedding model by using environment variables in the LlamaStackDistribution.
    • Optional: You have enabled inline embeddings with the sentence-transformers library for development or testing.
  • You have created a workbench within a project.
  • You have opened a Jupyter notebook and it is running in your workbench environment.
  • You have installed llama_stack_client version 0.3.1 or later in your workbench environment.
  • You have installed requests in your workbench environment. This is required for downloading example documents.
  • If you use a remote vector store or remote embedding model, your environment has network access to those services through OpenShift.

Procedure

  1. In a new notebook cell, install the client:

    %pip install llama_stack_client
  2. Install the requests library if it is not already available:

    %pip install requests
  3. Import LlamaStackClient and create a client instance:

    from llama_stack_client import LlamaStackClient
    
    # Use the Llama Stack service or route URL that is reachable from the workbench.
    # Do not append /v1 when using llama_stack_client.
    client = LlamaStackClient(base_url="<llama-stack-base-url>")
  4. List the available models:

    models = client.models.list()
  5. Verify that the list includes:

    • At least one LLM model.
    • At least one embedding model (remote or inline).

      [Model(identifier='llama-32-3b-instruct', model_type='llm', provider_id='vllm-inference'),
       Model(identifier='nomic-embed-text-v1-5', model_type='embedding', metadata={'embedding_dimension': 768})]
  6. Select one LLM and one embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    
    embedding_model = next(m for m in models if m.model_type == "embedding")
    embedding_model_id = embedding_model.identifier
    embedding_dimension = int(embedding_model.metadata["embedding_dimension"])
  7. (Optional) Create a vector store. Skip this step if you already have one.

    Note

    Provider IDs can differ between interfaces. In the Python SDK, you typically use the provider name directly (for example, provider_id: "pgvector"). In some CLI tools and examples, remote providers might use a prefixed identifier (for example, --vector-db-provider-id remote-pgvector). Use the provider ID format that matches the interface you are using.

Example 4.1. Option 1: Inline Milvus (embedded)

vector_store = client.vector_stores.create(
    name="my_inline_milvus",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus",
    },
)
vector_store_id = vector_store.id
Note

Inline Milvus is suitable for development and small datasets. In OpenShift AI 3.2 and later, metadata persistence uses PostgreSQL by default.

Example 4.2. Option 2: Remote Milvus (recommended for production)

vector_store = client.vector_stores.create(
    name="my_remote_milvus",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",
    },
)
vector_store_id = vector_store.id
Note

Ensure your LlamaStackDistribution is configured with MILVUS_ENDPOINT and MILVUS_TOKEN.

Example 4.3. Option 3: Inline FAISS

vector_store = client.vector_stores.create(
    name="my_inline_faiss",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "faiss",
    },
)
vector_store_id = vector_store.id
Note

Inline FAISS is an in-process vector store intended for development and testing. In OpenShift AI 3.2 and later, FAISS uses PostgreSQL as the default metadata store.

Example 4.4. Option 4: Remote PostgreSQL with pgvector

vector_store = client.vector_stores.create(
    name="my_pgvector_store",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "pgvector",
    },
)
vector_store_id = vector_store.id
Note

Ensure that the pgvector provider is enabled in your LlamaStackDistribution and that the PostgreSQL instance has the pgvector extension installed.

  1. If you already have a vector store, set its identifier:

    # vector_store_id = "<existing-vector-store-id>"
  2. Download a PDF, upload it to Llama Stack, and add it to your vector store:

    import requests
    
    pdf_url = "https://www.federalreserve.gov/aboutthefed/files/quarterly-report-20250822.pdf"
    filename = "quarterly-report-20250822.pdf"
    
    response = requests.get(pdf_url)
    response.raise_for_status()
    
    with open(filename, "wb") as f:
        f.write(response.content)
    
    with open(filename, "rb") as f:
        file_info = client.files.create(
            file=(filename, f),
            purpose="assistants",
        )
    
    vector_store_file = client.vector_stores.files.create(
        vector_store_id=vector_store_id,
        file_id=file_info.id,
        chunking_strategy={
            "type": "static",
            "static": {
                "max_chunk_size_tokens": 800,
                "chunk_overlap_tokens": 400,
            },
        },
    )
    
    print(vector_store_file)

Verification

  • The call to client.vector_stores.files.create() succeeds and returns metadata for the ingested file.
  • The vector store contains indexed chunks associated with the uploaded document.
  • Subsequent RAG queries can retrieve content from the vector store.

4.1.8. Querying ingested content in a Llama model

You can use the Llama Stack SDK in your Jupyter notebook to query ingested content by running retrieval-augmented generation (RAG) queries on content stored in your vector store. You can perform one-off lookups without setting up a separate retrieval service.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery Operator and NVIDIA GPU Operator. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs.
  • If you are using GPU acceleration, you have at least one NVIDIA GPU available.
  • You have activated the Llama Stack Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have created a LlamaStackDistribution instance with:

    • PostgreSQL configured as the metadata store.
    • An embedding model configured, preferably as a remote embedding provider.
  • You have created a workbench within a project and opened a running Jupyter notebook.
  • You have installed llama_stack_client version 0.3.1 or later in your workbench environment.
  • You have already ingested content into a vector store.
Note

This procedure requires that content has already been ingested into a vector store. If no content is available, RAG queries return empty or non-contextual responses.

Procedure

  1. In a new notebook cell, install the client:

    %pip install -q llama_stack_client
  2. Import LlamaStackClient:

    from llama_stack_client import LlamaStackClient
  3. Create a client instance:

    # Use the Llama Stack service or route URL that is reachable from the workbench.
    # Do not append /v1 when using llama_stack_client.
    client = LlamaStackClient(base_url="<llama-stack-base-url>")
  4. List available models:

    models = client.models.list()
  5. Select an LLM. If you plan to register a new vector store, also capture an embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    
    embedding = next((m for m in models if m.model_type == "embedding"), None)
    if embedding:
        embedding_model_id = embedding.identifier
        embedding_dimension = int(embedding.metadata.get("embedding_dimension", 768))
  6. If you do not already have a vector store ID, register a vector store (choose one):

    Example 4.5. Option 1: Inline Milvus (embedded)

    vector_store = client.vector_stores.create(
        name="my_inline_milvus",
        extra_body={
            "embedding_model": embedding_model_id,
            "embedding_dimension": embedding_dimension,
            "provider_id": "milvus",
        },
    )
    vector_store_id = vector_store.id
    Note

    Inline Milvus is suitable for development and small datasets. In OpenShift AI 3.2 and later, metadata persistence uses PostgreSQL by default.

Example 4.6. Option 2: Remote Milvus (recommended for production)

vector_store = client.vector_stores.create(
    name="my_remote_milvus",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",
    },
)
vector_store_id = vector_store.id
Note

Ensure your LlamaStackDistribution sets MILVUS_ENDPOINT (gRPC port 19530) and MILVUS_TOKEN.

Example 4.7. Option 3: Inline FAISS

vector_store = client.vector_stores.create(
    name="my_inline_faiss",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "faiss",
    },
)
vector_store_id = vector_store.id
Note

Inline FAISS is an in-process vector store intended for development and testing. In OpenShift AI 3.2 and later, FAISS uses PostgreSQL as the default metadata store.

Example 4.8. Option 4: Remote PostgreSQL with pgvector

vector_store = client.vector_stores.create(
    name="my_pgvector_store",
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "pgvector",
    },
)
vector_store_id = vector_store.id
Note

Ensure the pgvector provider is enabled in your LlamaStackDistribution and that the PostgreSQL instance has the pgvector extension installed. This option is suitable for production-grade RAG workloads that require durability and concurrency.

  1. If you already have a vector store, set its identifier:

    # vector_store_id = "<existing-vector-store-id>"
  2. Query without using a vector store:

    system_instructions = """You are a precise and reliable AI assistant.
    Use retrieved context when it is available.
    If nothing relevant is found, say so clearly."""
    
    query = "How do you do great work?"
    
    response = client.responses.create(
        model=model_id,
        input=query,
        instructions=system_instructions,
    )
    
    print(response.output_text)
  3. Query by using the Responses API with file search:

    response = client.responses.create(
        model=model_id,
        input=query,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
    
    print(response.output_text)
Note

When you include the file_search tool with vector_store_ids, Llama Stack retrieves relevant chunks from the specified vector store and provides them to the model as context for the response.

Verification

  • The notebook returns a response without vector stores and a context-aware response when vector stores are enabled.
  • No errors appear, confirming successful retrieval and model execution.

You can transform your source documents with a Docling-enabled pipeline and ingest the output into a Llama Stack vector store by using the Llama Stack SDK. This modular approach separates document preparation from ingestion while still enabling an end-to-end, retrieval-augmented generation (RAG) workflow.

The pipeline registers a vector store and downloads the source PDFs, then splits them for parallel processing and converts each batch to Markdown with Docling. It generates embeddings from the Markdown and stores them in the vector store, making the documents searchable through Llama Stack.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have enabled GPU support in OpenShift AI. This includes installing the Node Feature Discovery operator and NVIDIA GPU Operators. For more information, see Installing the Node Feature Discovery operator and Enabling NVIDIA GPUs.
  • You have logged in to the OpenShift web console.
  • You have a project and access to pipelines in the OpenShift AI dashboard.
  • You have created and configured a pipeline server within the project that contains your workbench.
  • You have activated the Llama Stack Operator in OpenShift AI.
  • You have deployed an inference model, for example, the llama-3.2-3b-instruct model.
  • You have configured a Llama Stack deployment by creating a LlamaStackDistribution instance to enable RAG functionality.
  • You have created a workbench within a project.
  • You have opened a Jupyter notebook and it is running in your workbench environment.
  • You have installed the llama_stack_client version 0.3.1 or later in your workbench environment.
  • You have installed local object storage buckets and created connections, as described in Adding a connection to your project.
  • You have compiled to YAML a pipeline that includes a Docling transform, either one of the RAG demo samples or your own custom pipeline.
  • Your project quota allows between 500 millicores (0.5 CPU) and 4 CPU cores for the pipeline run.
  • Your project quota allows from 2 GiB up to 6 GiB of RAM for the pipeline run.
  • If you are using GPU acceleration, you have at least one NVIDIA GPU available.

Procedure

  1. In a new notebook cell, install the client:

    %pip install -q llama_stack_client
  2. In a new notebook cell, import LlamaStackClient:

    from llama_stack_client import LlamaStackClient
  3. In a new notebook cell, assign your deployment endpoint to the base_url parameter to create a LlamaStackClient instance:

    client = LlamaStackClient(base_url="http://<llama-stack-service>:8321")
    Note

    LlamaStackClient requires the service root without the /v1 path suffix. For example, use http://llama-stack-service:8321.

    The /v1 suffix is required only when you use OpenAI-compatible SDKs or send raw HTTP requests to the OpenAI-compatible API surface.

  4. List the available models:

    models = client.models.list()
  5. Select the first LLM and the first embedding model:

    model_id = next(m.identifier for m in models if m.model_type == "llm")
    embedding_model = next(m for m in models if m.model_type == "embedding")
    embedding_model_id = embedding_model.identifier
    embedding_dimension = int(embedding_model.metadata.get("embedding_dimension", 768))
  6. Register a vector store (choose one option). Skip this step if your pipeline registers the store automatically.

Example 4.9. Option 1: Inline Milvus Lite (embedded)

vector_store_name = "my_inline_db"
vector_store = client.vector_stores.create(
    name=vector_store_name,
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus",   # inline Milvus Lite
    },
)
vector_store_id = vector_store.id
print(f"Registered inline Milvus Lite DB: {vector_store_id}")
Note

Inline Milvus Lite is best for development. Data durability and scale are limited compared to remote Milvus.

Example 4.10. Option 2: Remote Milvus (recommended for production)

vector_store_name = "my_remote_db"
vector_store = client.vector_stores.create(
    name=vector_store_name,
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "milvus-remote",  # remote Milvus provider
    },
)
vector_store_id = vector_store.id
print(f"Registered remote Milvus DB: {vector_store_id}")
Note

Ensure your LlamaStackDistribution includes MILVUS_ENDPOINT and MILVUS_TOKEN (gRPC :19530).

Example 4.11. Option 3: Inline FAISS

vector_store_name = "my_faiss_db"
vector_store = client.vector_stores.create(
    name=vector_store_name,
    extra_body={
        "embedding_model": embedding_model_id,
        "embedding_dimension": embedding_dimension,
        "provider_id": "faiss",   # inline FAISS provider
    },
)
vector_store_id = vector_store.id
print(f"Registered inline FAISS DB: {vector_store_id}")
Note

Inline FAISS (available in OpenShift AI 3.0 and later) is a lightweight, in-process vector store. It is best for local experimentation, disconnected environments, or single-node RAG deployments.

Important

If you are using the sample Docling pipeline from the RAG demo repository, the pipeline registers the vector store automatically and you can skip the previous step. If you are using your own pipeline, you must register the vector store yourself.

  1. In the OpenShift web console, import the YAML file containing your Docling pipeline into your project, as described in Importing a pipeline.
  2. Create a pipeline run to execute your Docling pipeline, as described in Executing a pipeline run. The pipeline run inserts your PDF documents into the vector store. If you run the Docling pipeline from the RAG demo samples repository, you can optionally customize the following parameters before starting the pipeline run:

    • base_url: The base URL to fetch PDF files from.
    • pdf_filenames: A comma-separated list of PDF filenames to download and convert.
    • num_workers: The number of parallel workers.
    • vector_store_id: The vector store identifier.
    • service_url: The Milvus service URL (only for remote Milvus).
    • embed_model_id: The embedding model to use.
    • max_tokens: The maximum tokens for each chunk.
    • use_gpu: Enable or disable GPU acceleration.

Verification

  1. In your Jupyter notebook, query the LLM with a question that relates to the ingested content:

    system_instructions = """You are a precise and reliable AI assistant.
    Use retrieved context when it is available.
    If nothing relevant is found in the available files, say so clearly."""
    
    prompt = "What can you tell me about the birth of word processing?"
    
    # Query using the Responses API with file search
    response = client.responses.create(
        model=model_id,
        input=prompt,
        instructions=system_instructions,
        tools=[
            {
                "type": "file_search",
                "vector_store_ids": [vector_store_id],
            }
        ],
    )
    
    print("Answer (with vector stores):")
    print(response.output_text)
  2. Query chunks from the vector store:

    query_result = client.vector_io.query(
        vector_store_id=vector_store_id,
        query="word processing",
    )
    print(query_result)
    • The pipeline run completes successfully in your project.
    • Document embeddings are stored in the vector store and are available for retrieval.
    • No errors or warnings appear in the pipeline logs or your notebook output.

4.1.10. About Llama stack search types

Llama Stack supports keyword, vector, and hybrid search modes for retrieving context in retrieval-augmented generation (RAG) workloads. Each mode offers different tradeoffs in precision, recall, semantic depth, and computational cost.

4.1.10.1. Supported search modes

4.2. Evaluating RAG systems with Llama Stack

You can use the evaluation providers that Llama Stack exposes to measure and improve the quality of your Retrieval-Augmented Generation (RAG) workloads in OpenShift AI. This section introduces RAG evaluation providers, describes how to use Ragas with Llama Stack, shows how to benchmark embedding models with BEIR, and helps you choose the right provider for your use case.

4.2.1. Understanding RAG evaluation providers

Llama Stack supports pluggable evaluation providers that measure the quality and performance of Retrieval-Augmented Generation (RAG) pipelines. Evaluation providers assess how accurately, faithfully, and relevantly the generated responses align with the retrieved context and the original user query. Each provider implements its own metrics and evaluation methodology. You can enable a specific provider through the configuration of the LlamaStackDistribution custom resource.

OpenShift AI supports the following evaluation providers:

  • Ragas: A lightweight, Python-based framework that evaluates factuality, contextual grounding, and response relevance.
  • BEIR: A benchmarking framework for retrieval performance across multiple datasets.
  • TrustyAI: A Red Hat framework that evaluates explainability, fairness, and reliability of model outputs.

Evaluation providers operate independently of model serving and retrieval components. You can run evaluations asynchronously and aggregate results for quality tracking over time.

4.2.2. Using Ragas with Llama Stack

You can use the Ragas (Retrieval-Augmented Generation Assessment) evaluation provider with Llama Stack to measure the quality of your Retrieval-Augmented Generation (RAG) workflows in OpenShift AI. Ragas integrates with the Llama Stack evaluation API to compute metrics such as faithfulness, answer relevancy, and context precision for your RAG workloads.

Llama Stack exposes evaluation providers as part of its API surface. When you configure Ragas as a provider, the Llama Stack server sends RAG inputs and outputs to Ragas and records the resulting metrics for later analysis.

Ragas evaluation with Llama Stack in OpenShift AI supports the following deployment modes:

  • Inline provider for development and small-scale experiments.
  • Remote provider for production-scale evaluations that run as OpenShift AI AI pipelines.

You choose the mode that best fits your workflow:

  • Use the inline provider when you want fast, low-overhead evaluation while you iterate on prompts, retrieval configuration, or model choices.
  • Use the remote provider when you need to evaluate large datasets, integrate with CI/CD pipelines, or run repeated benchmarks at scale.

For information on evaluating RAG systems with Ragas in OpenShift AI, see Evaluating RAG systems with RAGAS

This procedure explains how to set up, run, and verify embedding model benchmarks by using the Llama Stack framework. Embedding models are neural networks that convert text or other data into dense numerical vectors called embeddings, which capture semantic meaning. In retrieval augmented generation systems, embeddings enable semantic search so that the system retrieves the documents most relevant to a query.

Selecting an embedding model depends on several factors, such as the content type, accuracy requirements, performance needs, and model license. The beir_benchmarks.py script compares the retrieval accuracy of embedding models by using standardized information retrieval benchmarks from the BEIR framework. The script is included in the RAG repository, which provides demonstrations, benchmarking scripts, and deployment guides for the RAG stack on OpenShift.

The examples use the sentence-transformers inference provider, which you can replace with another provider if required.

Prerequisites

  • You have cloned the https://github.com/opendatahub-io/rag repository.
  • You have changed into the /rag/benchmarks/beir-benchmarks directory.
  • You have initialized and activated a virtual environment.
  • You have defined and installed the relevant script package dependencies to a requirements.txt file.
  • You have built the Llama Stack starter distribution to install all dependencies.
  • You have verified that your vector database is accessible and configured in the run.yaml file, and that any required embedding models were preloaded or registered with Llama Stack.
Important

Before you run the benchmark script, the Llama Stack server must be running and a vector database provider must be enabled and reachable. If you plan to compare embedding models beyond the default set, you must also register those embedding models with Llama Stack.

Procedure

  1. Optional: Start the Llama Stack server and enable a vector database provider If you have not already started Llama Stack with a vector database provider enabled, start the server by using a configuration similar to one of the following examples:

    • Using inline Milvus:

      MILVUS_URL=milvus uv run llama stack run run.yaml
    • Using remote PostgreSQL with pgvector:

      ENABLE_PGVECTOR=true PGVECTOR_DB=pgvector PGVECTOR_USER=<user> PGVECTOR_PASSWORD=<password> uv run llama stack run run.yaml
  2. Optional: Register additional embedding models The default supported embedding models are granite-embedding-30m and granite-embedding-125m, served by the sentence-transformers framework. If you want to benchmark additional embedding models, register them with Llama Stack before running the benchmark script.

    For example, register an embedding model by using the Llama Stack client:

    llama-stack-client models register all-MiniLM-L6-v2 \
      --provider-id sentence-transformers \
      --provider-model-id all-minilm:latest \
      --metadata {"embedding_dimension": 384} \
      --model-type embedding
    Note

    Any embedding models specified in the --embedding-models option must be registered before running the benchmark script.

  3. Run the beir_benchmarks.py benchmarking script.

    • Enter the following command to use the configuration from run.yaml and the default dataset scifact with inline Milvus:

      MILVUS_URL=milvus uv run python beir_benchmarks.py
    • Enter the following command to run the benchmark by using remote PostgreSQL with pgvector:

      ENABLE_PGVECTOR=true PGVECTOR_DB=pgvector uv run python beir_benchmarks.py \
        --vector-db-provider-id pgvector
    • Alternatively, enter the following command to connect to a custom Llama Stack server:

      LLAMA_STACK_URL="http://localhost:8321" MILVUS_URL=milvus uv run python beir_benchmarks.py
  4. Use environment variables and command line options to modify the benchmark run. For example, set the environment variable for the vector database provider before executing the script.

    • Enter the following command to use a larger batch size for document ingestion:

      MILVUS_URL=milvus uv run python beir_benchmarks.py --batch-size 300
    • Enter the following command to benchmark multiple datasets, for example, scifact and scidocs:

      MILVUS_URL=milvus uv run python beir_benchmarks.py \
        --dataset-names scifact scidocs
    • Enter the following command to compare embedding models, for example, granite-embedding-30m and all-MiniLM-L6-v2:

      MILVUS_URL=milvus uv run python beir_benchmarks.py \
        --embedding-models granite-embedding-30m all-MiniLM-L6-v2
      Note

      Ensure that all-MiniLM-L6-v2 is registered with Llama Stack before running this command. See step 2 for registration instructions.

    • Enter the following command to use a custom BEIR compatible dataset:

      MILVUS_URL=milvus uv run python beir_benchmarks.py \
        --dataset-names my-dataset \
        --custom-datasets-urls https://example.com/my-beir-dataset.zip
    • Enter the following command to change the vector database provider:

      # Use remote PostgreSQL with pgvector
      ENABLE_PGVECTOR=true PGVECTOR_DB=llama-stack PGVECTOR_USER=<user> PGVECTOR_PASSWORD=<password> uv run python beir_benchmarks.py \
        --vector-db-provider-id pgvector

Command line options

  • --vector-db-provider-id

    • Description: Specifies the vector database provider to use. The provider must also be enabled through the appropriate environment variable.
    • Type: String.
    • Default: milvus.
    • Example values: milvus, pgvector, faiss.
    • Example:

      --vector-db-provider-id pgvector
  • --dataset-names

    • Description: Specifies which BEIR datasets to use for benchmarking. Use this option together with --custom-datasets-urls when testing custom datasets.
    • Type: List of strings.
    • Default: ["scifact"].
    • Example:

      --dataset-names scifact scidocs nq
  • --embedding-models

    • Description: Specifies the embedding models to compare. Models must be defined in the run.yaml file.
    • Type: List of strings.
    • Default: ["granite-embedding-30m", "granite-embedding-125m"].
    • Example:

      --embedding-models all-MiniLM-L6-v2 granite-embedding-125m
  • --batch-size

    • Description: Controls how many documents are processed per batch during ingestion. Larger batch sizes improve speed but use more memory.
    • Type: Integer.
    • Default: 150.
    • Example:

      --batch-size 50
      --batch-size 300
  • --custom-datasets-urls

    • Description: Specifies URLs for custom BEIR compatible datasets. Use this option with --dataset-names.
    • Type: List of strings.
    • Default: [].
    • Example:

      --dataset-names my-custom-dataset \
        --custom-datasets-urls https://example.com/my-dataset.zip
Note

Custom BEIR datasets must follow the required file structure and format:

dataset-name.zip/
├── qrels/
│   └── test.tsv
├── corpus.jsonl
└── queries.jsonl

Verification

To verify that the benchmark completed successfully and to review the results, perform the following steps:

  1. Locate the results directory. All output files are saved to the following path:

    <path-to>/rag/benchmarks/beir-benchmarks/results

  2. Examine the output. Compare your results with the sample output structure. The report includes performance metrics such as map@cut_k and ndcg@cut_k for each dataset and embedding model pair. The script also calculates a statistical significance test called a p value.

    Example output for scifact and map_cut_10:

    scifact map_cut_10
     granite-embedding-125m : 0.6879
     granite-embedding-30m  : 0.6578
     p_value                : 0.0150
    
     p_value < 0.05 indicates a statistically significant difference.
     The granite-embedding-125m model performs better for this dataset and metric.
  3. Interpret the results. A p value below 0.05 indicates that the performance difference between models is statistically significant. The model with the higher metric value performs better. Use these results to identify which embedding model performs best for your dataset.

4.3. Using PostgreSQL in Llama Stack

PostgreSQL is a required dependency for Llama Stack deployments in OpenShift AI, where it serves as the mandatory metadata storage backend for supported vector storage configurations. Additionally, you can configure PostgreSQL as a remote vector database provider by enabling the pgvector extension.

4.3.1. Understanding Llama Stack metadata storage

In OpenShift AI, Llama Stack requires PostgreSQL as a metadata storage backend to persist state and configuration data across multiple components. Metadata storage provides durable persistence for vector stores, file management, agent state, conversation history, and other Llama Stack services.

PostgreSQL is required as a metadata storage backend for all OpenShift AI deployments.

4.3.1.1. Role of metadata storage in Llama Stack

Llama Stack components require persistent storage beyond in-memory data structures. Without metadata storage, component state would be lost on pod restarts or application failures.

Llama Stack uses metadata storage to persist:

  • Vector store metadata, such as collection identifiers and document mappings.
  • File metadata, including file locations, identifiers, and attributes.
  • Agent state and conversation history.
  • Dataset configurations and batch processing state.
  • Model registry information and prompt templates.

This persistent storage allows Llama Stack to maintain operational state across pod restarts, rescheduling, and application updates.

4.3.1.2. PostgreSQL metadata storage backends

Llama Stack uses PostgreSQL to store multiple categories of metadata, including vector store metadata, file records, agent state, conversation history, and configuration data. These data types have different storage characteristics but are managed automatically within a single PostgreSQL instance.

Important

Starting with OpenShift AI 3.2, PostgreSQL version 14 or later is required for all Llama Stack deployments, including development, testing, and production environments.

If validation errors occur, confirm that the deployed Llama Stack image version matches the configuration schema referenced by your run.yaml.

You can connect Llama Stack in OpenShift AI to an existing PostgreSQL instance that has the pgvector extension enabled. For development or evaluation, you can also deploy a PostgreSQL instance with the pgvector extension directly in your OpenShift project by creating Kubernetes resources through the OpenShift web console.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have permissions to create resources in a project in your OpenShift cluster.
  • You have PostgreSQL connection details available, including the database name, user name, and password.
  • If you plan to deploy PostgreSQL in-cluster, you have a StorageClass that can provision persistent volumes.
  • If you are using an existing PostgreSQL instance, the pgvector extension is installed and enabled on the target database.

Procedure

  1. Log in to the OpenShift web console.
  2. Select the project where you want to deploy the PostgreSQL instance.
  3. Click the Quick Create ( quick create icon ) icon, and then click Import YAML.
  4. Verify that the correct project is selected.
  5. Copy the following YAML, replace the placeholder values, paste it into the YAML editor, and then click Create.

    Important

    This example deploys a standalone PostgreSQL service with the pgvector extension enabled.

    Llama Stack does not automatically use this database. To use this PostgreSQL instance as a vector store, you must explicitly configure the pgvector provider in a LlamaStackDistribution.

    This example is intended for development or evaluation purposes. For production deployments, review and adapt the configuration to meet your organization’s security, availability, backup, and lifecycle requirements.

    Example PostgreSQL deployment with pgvector (development or evaluation)

    apiVersion: v1
    kind: Secret
    metadata:
      name: <pgvector-postgresql-credentials-secret>
    type: Opaque
    stringData:
      POSTGRES_DB: "<database-name>"
      POSTGRES_USER: "<database-username>"
      POSTGRES_PASSWORD: "<database-password>"
    
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: <pgvector-postgresql-pvc>
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: <storage-size>
    
    ---
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: <pgvector-postgresql-deployment>
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: <pgvector-postgresql-app-label>
      template:
        metadata:
          labels:
            app: <pgvector-postgresql-app-label>
        spec:
          containers:
          - name: postgres
            image: pgvector/pgvector:pg16
            ports:
            - name: postgres
              containerPort: 5432
            env:
            - name: POSTGRES_DB
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_DB
            - name: POSTGRES_USER
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_USER
            - name: POSTGRES_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: <pgvector-postgresql-credentials-secret>
                  key: POSTGRES_PASSWORD
            volumeMounts:
            - name: pgdata
              mountPath: /var/lib/postgresql/data
    
            # Replace TCP socket probes with exec probes that validate SQL readiness.
            readinessProbe:
              exec:
                command:
                - /bin/sh
                - -c
                - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB"
              initialDelaySeconds: 10
              periodSeconds: 10
              timeoutSeconds: 5
              failureThreshold: 6
            livenessProbe:
              exec:
                command:
                - /bin/sh
                - -c
                - pg_isready -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB"
              initialDelaySeconds: 30
              periodSeconds: 20
              timeoutSeconds: 5
              failureThreshold: 6
    
            # Create the pgvector extension after PostgreSQL is actually accepting SQL.
            lifecycle:
              postStart:
                exec:
                  command:
                  - /bin/sh
                  - -c
                  - |
                    set -e
                    echo "Waiting for PostgreSQL to be ready before enabling pgvector..."
                    until PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "SELECT 1" >/dev/null 2>&1; do
                      sleep 2
                    done
                    PGPASSWORD="$POSTGRES_PASSWORD" psql -h 127.0.0.1 -U "$POSTGRES_USER" -d "$POSTGRES_DB" -c "CREATE EXTENSION IF NOT EXISTS vector;"
    
          volumes:
          - name: pgdata
            persistentVolumeClaim:
              claimName: <pgvector-postgresql-pvc>
    
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: <pgvector-postgresql-service>
    spec:
      selector:
        app: <pgvector-postgresql-app-label>
      ports:
      - name: postgres
        port: 5432
        targetPort: 5432
      type: ClusterIP

  6. Click Create.

Verification

  1. Navigate to Networking Services.
  2. Confirm that the PostgreSQL Service is listed and exposes port 5432.
  3. Navigate to Workloads Pods.
  4. Confirm that the PostgreSQL pod is running.
Note

This procedure verifies only that PostgreSQL with pgvector is deployed and reachable within the project. It does not verify integration with Llama Stack.

To use PostgreSQL with the pgvector extension as a remote vector store, configure pgvector in a LlamaStackDistribution and supply PostgreSQL connection details as environment variables. This configuration enables retrieval augmented generation (RAG) workflows in OpenShift AI by using PostgreSQL-based vector storage.

Prerequisites

  • You have installed and enabled the Llama Stack Operator in OpenShift AI.
  • You have a PostgreSQL database with the pgvector extension enabled.
  • You have the PostgreSQL connection details, including the host name, port number, database name, user name, and password.
  • You have permissions to create Secrets and edit custom resources in your project.

Procedure

  1. In the OpenShift web console, switch to the Administrator perspective.
  2. Create a Secret that stores the PostgreSQL connection details.

    1. Ensure that the correct project is selected.
    2. Click Workloads Secrets.
    3. Click Create From YAML.
    4. Paste the following YAML, update the placeholder values, and then click Create.

      Example Secret for pgvector connection details

      apiVersion: v1
      kind: Secret
      metadata:
        name: pgvector-connection
      type: Opaque
      stringData:
        PGVECTOR_HOST: "<pgvector-hostname>"
        PGVECTOR_PORT: "<pgvector-port>"
        PGVECTOR_DB: "<database-name>"
        PGVECTOR_USER: "<database-username>"
        PGVECTOR_PASSWORD: "<database-password>"

      Important

      The pgvector provider is not enabled automatically.

      You must explicitly enable pgvector and supply its connection details through environment variables in your LlamaStackDistribution.

      In OpenShift AI, the pgvector provider is enabled when the ENABLE_PGVECTOR environment variable is set.

  3. Update your LlamaStackDistribution custom resource to enable pgvector and reference the Secret.

    1. Click Operators Installed Operators.
    2. Select the Llama Stack Operator.
    3. Click the LlamaStackDistribution tab.
    4. Select your LlamaStackDistribution resource.
    5. Click YAML.
    6. Update the resource to include the following fields, and then click Save.

      Example LlamaStackDistribution configuration for pgvector

      apiVersion: llamastack.io/v1alpha1
      kind: LlamaStackDistribution
      metadata:
        name: llamastack
      spec:
        server:
          containerSpec:
            env:
              - name: ENABLE_PGVECTOR
                value: "true"
              - name: PGVECTOR_HOST
                valueFrom:
                  secretKeyRef:
                    name: pgvector-connection
                    key: PGVECTOR_HOST
              - name: PGVECTOR_PORT
                valueFrom:
                  secretKeyRef:
                    name: pgvector-connection
                    key: PGVECTOR_PORT
              - name: PGVECTOR_DB
                valueFrom:
                  secretKeyRef:
                    name: pgvector-connection
                    key: PGVECTOR_DB
              - name: PGVECTOR_USER
                valueFrom:
                  secretKeyRef:
                    name: pgvector-connection
                    key: PGVECTOR_USER
              - name: PGVECTOR_PASSWORD
                valueFrom:
                  secretKeyRef:
                    name: pgvector-connection
                    key: PGVECTOR_PASSWORD

Verification

  1. Click Workloads Pods.
  2. Confirm that the Llama Stack pod restarts and reaches the Running state.
  3. Open the pod logs and confirm that the server starts successfully and initializes the pgvector provider without errors.

You can configure Llama Stack to enable Role-Based Access Control (RBAC) for model access using OAuth authentication on OpenShift AI. The following example shows how to configure Llama Stack so that a vLLM model can be accessed by all authenticated users, while an OpenAI model is restricted to specific users. This example uses Keycloak to issue and validate tokens.

Before you begin, you must already have Keycloak set up with the following parameters:

  • Realm: llamastack-demo
  • Client: llamastack with direct access grants enabled
  • Role: inference_max grants access to restricted models and a protocol mapper that adds realm roles to the access token under the claim name llamastack_roles
  • Two test users:

    • user1 as a basic user with no assigned roles
    • user2 as an advanced user assigned the inference_max role
  • The client secret generated by Keycloak must be saved as you will need it for token requests.

This document assumes the Keycloak server is available at https://my-keycloak-server.com

Important

When accessing Llama Stack APIs, the required base URL depends on the client you are using.

  • OpenAI-compatible clients or raw HTTP requests You must include the /v1 path suffix in the base URL. For example: http://llama-stack-service:8321/v1
  • LlamaStackClient SDK Do not include the /v1 path suffix. For example: http://llama-stack-service:8321

Using an incorrect base URL results in request failures.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. To configure Llama Stack to use Role-Based Access Control (RBAC) to access models, you need to view and verify the OAuth provider token structure.

    1. Generate a Keycloak test token to view the structure with the following command:

      $ curl -d client_id=llamastack -d client_secret=YOUR_CLIENT_SECRET -d username=user1 -d password=user-password -d grant_type=password https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token | jq -r .access_token > test.token
    2. View the token claims by running the following command:

      $ cat test.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .

      Example token structure from Keycloak

      $ {
        "iss": "http://my-keycloak-server.com/realms/llamastack-demo",
        "aud": "account",
        "sub": "761cdc99-80e5-4506-9b9e-26a67a8566f7",
        "preferred_username": "user1",
        "llamastack_roles": [
          "inference_max"
        ],
      }

  2. Create a run.yaml file that defines the necessary configurations for OAuth.

    1. Define a configuration with two inference providers and OAuth authentication with the following run.yaml example:

      version: 2
      image_name: rh
      apis:
        - inference
        - agents
        - safety
        - telemetry
        - tool_runtime
        - vector_io
      providers:
        inference:
          - provider_id: vllm-inference
            provider_type: remote::vllm
            config:
              url: ${env.VLLM_URL:=http://localhost:8000/v1}
              max_tokens: ${env.VLLM_MAX_TOKENS:=4096}
              api_token: ${env.VLLM_API_TOKEN:=fake}
              tls_verify: ${env.VLLM_TLS_VERIFY:=true}
          - provider_id: openai
            provider_type: remote::openai
            config:
              api_key: ${env.OPENAI_API_KEY:=}
              base_url: ${env.OPENAI_BASE_URL:=https://api.openai.com/v1}  # OpenAI-compatible API requires /v1
        telemetry:
        - provider_id: meta-reference
          provider_type: inline::meta-reference
          config:
            service_name: "${env.OTEL_SERVICE_NAME:=}"
            sinks: ${env.TELEMETRY_SINKS:=console}
            sqlite_db_path: /opt/app-root/src/.llama/distributions/rh/trace_store.db
            otel_exporter_otlp_endpoint: ${env.OTEL_EXPORTER_OTLP_ENDPOINT:=}
        agents:
        - provider_id: meta-reference
          provider_type: inline::meta-reference
          config:
            persistence_store:
              type: sqlite
              namespace: null
              db_path: /opt/app-root/src/.llama/distributions/rh/agents_store.db
            responses_store:
              type: sqlite
              db_path: /opt/app-root/src/.llama/distributions/rh/responses_store.db
      models:
        - model_id: llama-3-2-3b
          provider_id: vllm-inference
          model_type: llm
          metadata: {}
      
        - model_id: gpt-4o-mini
          provider_id: openai
          model_type: llm
          metadata: {}
      
      server:
        port: 8321
        auth:
          provider_config:
            type: "oauth2_token"
            jwks:
              uri: "https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/certs" 
      1
      
              key_recheck_period: 3600
            issuer: "https://my-keycloak-server.com/realms/llamastack-demo" 
      2
      
            audience: "account"
            verify_tls: true
            claims_mapping:
              llamastack_roles: "roles" 
      3
      
          access_policy:
            - permit: 
      4
      
                actions: [read]
                resource: model::vllm-inference/llama-3-2-3b
              description: Allow all authenticated users to access Llama 3.2 model
            - permit: 
      5
      
                actions: [read]
                resource: model::openai/gpt-4o-mini
              when: user with inference_max in roles
              description: Allow only users with inference_max role to access OpenAI models
      1 2
      Specify your Keycloak host and Realm in the URL.
      3
      Maps the llamastack_roles path from the token to the roles field.
      4
      Policy 1: Allow all authenticated users to access vLLM models.
      5
      Policy 2: Restrict OpenAI models to users with the inference_max role.
  3. Create a ConfigMap that uses the run.yaml configuration by running the following command:

    $ oc create configmap llamastack-custom-config --from-file=run.yaml=run.yaml -n redhat-ods-operator
  4. Create a llamastack-distribution.yaml file with the following parameters:

    apiVersion: llamastack.io/v1alpha1
    kind: LlamaStackDistribution
    metadata:
      name: llamastack-distribution
      namespace: redhat-ods-operator
    spec:
      replicas: 1
      server:
        distribution:
          name: rh-dev
        containerSpec:
          name: llama-stack
          port: 8321
          env:
            # vLLM Provider Configuration
            - name: VLLM_URL
              value: "https://your-vllm-service:8000/v1"
            - name: VLLM_API_TOKEN
              value: "your-vllm-token"
            - name: VLLM_TLS_VERIFY
              value: "false"
            # OpenAI Provider Configuration
            - name: OPENAI_API_KEY
              value: "your-openai-api-key"
            - name: OPENAI_BASE_URL
              value: "https://api.openai.com/v1"
        userConfig:
          configMapName: llamastack-custom-config
          configMapNamespace: redhat-ods-operator
  5. To apply the distribution, run the following command:

    $ oc apply -f llamastack-distribution.yaml
  6. Wait for the distribution to be ready by running the following command:

    oc wait --for=jsonpath='{.status.phase}'=Ready llamastackdistribution/llamastack-distribution -n redhat-ods-operator --timeout=300s
  7. Generate the OAuth tokens for each user account to authenticate API requests.

    1. To request a basic access token, and to add the token to a user1.token file, run the following command:

      $ curl -d client_id=llamastack \
        -d client_secret=YOUR_CLIENT_SECRET \
        -d username=user1 \
        -d password=user1-password \
        -d grant_type=password \
        https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token \
        | jq -r .access_token > user1.token
    2. To request full access token and add it to a user2.token file, run the following command:

      $ curl -d client_id=llamastack \
        -d client_secret=YOUR_CLIENT_SECRET \
        -d username=user2 \
        -d password=user2-password \
        -d grant_type=password \
        https://my-keycloak-server.com/realms/llamastack-demo/protocol/openid-connect/token \
        | jq -r .access_token > user2.token
    3. Verify the credentials by running the following command:

      $ cat user2.token | cut -d . -f 2 | base64 -d 2>/dev/null | jq .

Verification

  1. Set the Llama Stack base URL:

    export LLAMASTACK_URL="http://<llama-stack-host>:8321"
  2. Verify basic access for user1 (no privileged roles).

    Load the token:

    USER1_TOKEN=$(cat user1.token)

    Confirm that user1 can access the vLLM-served model:

    curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER1_TOKEN}" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 200.

    Confirm that user1 is denied access to the restricted OpenAI model:

    curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER1_TOKEN}" \
      -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 403 (forbidden).

  3. Verify privileged access for user2 (assigned the inference_max role).

    Load the token:

    USER2_TOKEN=$(cat user2.token)

    Confirm that user2 can access both models:

    curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER2_TOKEN}" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'
    curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer ${USER2_TOKEN}" \
      -d '{"model":"openai/gpt-4o-mini","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 200 for both requests.

  4. Verify that requests without a Bearer token are denied.

    curl -s -o /dev/null -w "%{http_code}\n" \
      -X POST "${LLAMASTACK_URL}/v1/openai/v1/chat/completions" \
      -H "Content-Type: application/json" \
      -d '{"model":"vllm-inference/llama-3-2-3b","messages":[{"role":"user","content":"Hello!"}],"max_tokens":50}'

    Expected result: HTTP 401 (unauthorized).

Llama Stack servers can be configured to remain operational in the event of a single point of failure. If a pod restarts, an application crashes, or node maintenance occurs, you can maintain availability by enabling PostgreSQL high-availability settings in your Llama Stack server. You can also enable autoscaling settings to adjust server capacity and automatic resource adjustment. The following documentation displays how to configure high availability and autoscaling in your LlamaStackDistribution custom resource.

Prerequisites

  • You have installed OpenShift 4.19 or newer.
  • You have logged in to Red Hat OpenShift AI.
  • You have cluster administrator privileges for your OpenShift cluster.
  • You have installed the PostgreSQL Operator version 14 or later.
  • You have activated the Llama Stack Operator in your cluster.
  • You have installed the OpenShift CLI (oc) as described in the appropriate documentation for your cluster:

Procedure

  1. To enable high availability for your Llama Stack server, add the following parameters to your LlamaStackDistribution CR:

    spec:
      replicas: 2 
    1
    
      server:
        podDisruptionBudget:
          minAvailable: 1 
    2
    
        topologySpreadConstraints: 
    3
    
          - maxSkew: 1 
    4
    
            topologyKey: topology.kubernetes.io/zone 
    5
    
            whenUnsatisfiable: ScheduleAnyway 
    6
    
            labelSelector:
              matchLabels:
                app.kubernetes.io/instance: llamastackdistribution-sample 
    7
    1
    This example runs two llama stack pods for high availability.
    2
    Specifies voluntary disruption tolerance for the pods. For example, in a voluntary disruption, this configuration keeps at least one server pod available.
    3
    Specifies how to spread matching pods in the topology.
    4
    Instructs the scheduler to minimize replica imbalance across zones. With a skew of one and two replicas, the scheduler targets one Pod per zone when multiple zones are available
    5
    Configures and uses the node’s zone label as the failure-domain for pod spreading.
    6
    Configures and allows scheduling to proceed even when spread constraints cannot be met. For example, if the cluster has insufficient capacity, Pods are scheduled instead of remaining Pending.
    7
    Ensures that only pods from the same application instance are considered when calculating spread
  2. To enable autoscaling for your Llama Stack server, add the following parameters to your LlamaStackDistribution CR:

    spec:
      server:
          autoscaling: 
    1
    
            minReplicas: 1 
    2
    
            maxReplicas: 5 
    3
    
            targetCPUUtilizationPercentage: 75 
    4
    
            targetMemoryUtilizationPercentage: 70 
    5
    1
    Configures HorizontalPodAutoscaler (HPA) for the server pods.
    2
    Specifies the lower bound replica count maintained by the HPA.
    3
    Specifies the upper bound replica count maintained by the HPA.
    4
    Configures CPU based scaling.
    5
    Configures memory based scaling.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top