Chapter 1. Overview of Llama Stack
Llama Stack is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on OpenShift AI. In OpenShift, the Llama Stack Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with OpenShift AI projects. Llama Stack integrates model inference, embedding generation, vector storage, and retrieval services into a single stack that is optimized for retrieval-augmented generation (RAG) and agent-based AI workflows.
Llama Stack concepts
- Llama Stack Operator Installs and manages Llama Stack server instances in OpenShift AI, handling lifecycle operations such as deployment, scaling, and updates.
-
The
run.yamlfile Defines which APIs are enabled and how backend providers are configured for a Llama Stack server. Red Hat ships a defaultrun.yamlthat supports common deployment scenarios. You can provide a customrun.yamlto enable advanced workflows or integrate additional providers. -
LlamaStackDistributioncustom resource Declares the runtime configuration for a Llama Stack server, including model providers, embedding configuration, vector storage, and persistence settings.
OpenShift AI ships with a Llama Stack Distribution that runs the Llama Stack server in a containerized environment. OpenShift AI 3.2.0 includes Open Data Hub Llama Stack version 0.3.5+rhai0, which is based on upstream Llama Stack version 0.3.5.
Llama Stack integration is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production.
These features provide early access to upcoming product capabilities, enabling customers to test functionality and provide feedback during development.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Llama Stack includes the following core components:
-
Integration with OpenShift AI Uses the
LlamaStackDistributioncustom resource to simplify configuration and deployment of AI workloads. - Inference model connections Acts as a proxy between Llama Stack APIs and model inference servers, such as vLLM deployments.
- Embedding generation Generates vector embeddings used for retrieval. In OpenShift AI 3.2, remote embedding models are the recommended and default option for production deployments. Inline embedding models remain available for development and testing scenarios.
- Vector storage Stores and indexes embeddings by using supported vector databases, such as Milvus or PostgreSQL with the pgvector extension.
- Metadata persistence Stores vector store metadata, file references, and configuration state. In OpenShift AI 3.2, PostgreSQL is the default backend for production-grade deployments.
- Retrieval workflows Manages ingestion, chunking, embedding, and similarity search to support RAG workflows.
- Agentic workflows Enables agent-based interactions through supported APIs, such as OpenAI-compatible Responses and Chat Completions.
For information about deploying Llama Stack in OpenShift AI, see Deploying a RAG stack in a project.
The Llama Stack Operator is not currently supported on IBM Power or IBM Z platforms.
1.1. Llama Stack APIs Copy linkLink copied to clipboard!
You can use the following APIs from Llama Stack for AI actions such as evaluation, scoring, and inference:
1.1.1. Supported Llama Stack APIs in OpenShift AI Copy linkLink copied to clipboard!
1.1.1.1. Agents API Copy linkLink copied to clipboard!
-
Endpoint:
/v1alpha/agents. - Providers: All agent backends deployed through OpenShift AI.
- Support level: Developer Preview.
The Agents API allows you to create and manage AI agents.
1.1.1.2. Datasets_IO API Copy linkLink copied to clipboard!
-
Endpoint:
/v1beta/datasetio. - Providers: All dataset_io backends deployed through OpenShift AI.
- Support level: Technology Preview.
The Dataset_IO API manages the input and output of datasets and their content.
1.1.1.3. Evaluation API Copy linkLink copied to clipboard!
-
Endpoint:
/v1beta/eval. - Providers: All evaluation backends deployed through OpenShift AI.
- Support level: Developer Preview.
The Evaluation API defines an evaluation task for models and datasets
1.1.1.4. Inference API Copy linkLink copied to clipboard!
-
Endpoint:
/v1alpha/inference. - Providers: All inference backends deployed through OpenShift AI.
- Support level: Developer Preview.
The majority of the Inference API is deprecated. The Inference providers use the Completions and Chat Completions APIs now.
The Inference API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.
1.1.1.5. Safety API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/safety. - Providers: All safety backends deployed through OpenShift AI.
- Support level: Technology Preview.
The Safety API detects and prevents harmful content in model inputs and outputs.
1.1.1.6. Tool Runtime API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/tool-runtime. - Providers: All tool runtime backends deployed through OpenShift AI.
- Support level: Developer Preview.
The Tool Runtime API allows a model to dynamically call a tool at runtime.
1.1.1.7. Vector_IO API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/vector-io. - Providers: All vector_io backends deployed through OpenShift AI.
- Support level: Developer Preview.
The Vector_IO API allows you to manage and query vector embeddings: numeric representations of data.
1.2. OpenAI-compatible APIs in Llama Stack Copy linkLink copied to clipboard!
OpenShift AI includes a Llama Stack component that exposes OpenAI-compatible APIs. These APIs enable you to reuse existing OpenAI SDKs, tools, and workflows directly within your OpenShift environment, without changing your client code. This compatibility layer supports retrieval-augmented generation (RAG), inference, and embedding workloads by using the same endpoints, schemas, and authentication model as OpenAI.
This compatibility layer has the following capabilities:
- Standardized endpoints: REST API paths align with OpenAI specifications.
- Schema parity: Request and response fields follow OpenAI data structures.
When connecting OpenAI SDKs or third-party tools to OpenShift AI, you must update the client configuration to use your deployment’s Llama Stack route as the base_url.
When you use OpenAI-compatible SDKs, the base_url must include the /v1 path suffix so that requests are routed to the OpenAI-compatible API surface exposed by Llama Stack.
When you use OpenAI SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.
For example: http://llama-stack-service:8321/v1
Using the service endpoint without /v1 results in request failures.
These endpoints are exposed under the OpenAI compatibility layer and are distinct from the native Llama Stack APIs.
1.2.1. Supported OpenAI-compatible APIs in OpenShift AI Copy linkLink copied to clipboard!
1.2.1.1. Chat Completions API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/chat/completions. - Providers: All inference back ends deployed through OpenShift AI.
- Support level: Technology Preview.
The Chat Completions API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.
1.2.1.2. Completions API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/completions. - Providers: All inference backends managed by OpenShift AI.
- Support level: Technology Preview.
The Completions API supports single-turn text generation and prompt completion.
1.2.1.3. Embeddings API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/embeddings. - Providers: All embedding models enabled in OpenShift AI.
The Embeddings API generates numerical embeddings for text or documents that can be used in downstream semantic search or RAG applications.
1.2.1.4. Files API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/files. - Providers: File system-based file storage provider for managing files and documents stored locally in your cluster.
- Support level: Technology Preview.
The Files API manages file uploads for use in embedding and retrieval workflows.
1.2.1.5. Vector Stores API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/vector_stores/. - Providers: Inline and remote vector store providers configured in OpenShift AI.
- Support level: Technology Preview.
The Vector Stores API manages the creation, configuration, and lifecycle of vector store resources in Llama Stack. Through this API, you can create new vector stores, list existing ones, delete unused stores, and query their metadata, all using OpenAI-compatible request and response formats.
1.2.1.6. Vector Store Files API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/vector_stores/{vector_store_id}/files. - Providers: Local inline provider configured for file storage and retrieval.
- Support level: Developer Preview.
The Vector Store Files API implements the OpenAI Vector Store Files interface and manages the association between document files and vector stores used for RAG workflows.
1.2.1.7. Models API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/models. - Providers: All model-serving back ends configured within OpenShift AI.
- Support level: Technology Preview.
The Models API lists and retrieves available model resources from the Llama Stack deployment running on OpenShift AI. By using the Models API, you can enumerate models, view their capabilities, and verify deployment status through a standardized OpenAI-compatible interface.
1.2.1.8. Responses API Copy linkLink copied to clipboard!
-
Endpoint:
/v1/openai/v1/responses. - Providers: All agents, inference, and vector providers configured in OpenShift AI.
- Support level: Developer Preview.
The Responses API generates model outputs by combining inference, file search, and tool-calling capabilities through a single OpenAI-compatible endpoint. It is particularly useful for retrieval-augmented generation (RAG) workflows that rely on the file_search tool to retrieve context from vector stores.
The Responses API is an experimental feature that is still under active development in OpenShift AI. While the API is already functional and suitable for evaluation, some endpoints and parameters remain under implementation and might change in future releases. This API is provided for testing and feedback purposes only and is not recommended for production use.
1.2.2. OpenAI compatibility for RAG APIs in Llama Stack Copy linkLink copied to clipboard!
OpenShift AI supports OpenAI-compatible request and response schemas for Llama Stack retrieval-augmented generation (RAG) workflows. This compatibility allows you to use OpenAI clients, tools, and schemas with Llama Stack for managing files, vector stores, and executing RAG queries through the Responses API.
OpenAI compatibility enables the following capabilities:
- You can use OpenAI SDKs and tools with Llama Stack by pointing the client to the Llama Stack OpenAI-compatible API path.
-
You can manage files and vector stores by using OpenAI-compatible endpoints and invoke RAG workflows by using the Responses API with the
file_searchtool.
When configuring clients, the required base_url depends on the SDK that you use:
-
OpenAI SDKs When you use an OpenAI-compatible SDK (for example, the OpenAI Python client), you must include the
/v1path suffix in the base URL. + For example: +http://llama-stack-service:8321/v1 -
Llama Stack SDK (
llama_stack_client) When you use the native Llama Stack SDK, set the base URL to the Llama Stack service endpoint without the/v1suffix. The SDK automatically appends the correct API paths. + For example: +http://llama-stack-service:8321
When you use OpenAI-compatible SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.
Using the service endpoint without /v1 results in request failures.
1.3. Llama Stack API provider support Copy linkLink copied to clipboard!
You can use Llama Stack to enable various Provider APIs and providers in OpenShift AI. The following table lists the supported providers included in OpenShift AI
The support status of the Llama Stack API providers has shifted between Technology Preview and Developer Preview across OpenShift AI versions.
| Provider API | Providers | How to Enable | Disconnected support | Support status |
|---|---|---|---|---|
| Agents |
Note The Responses API is accessible from the Agents provider API. | Enabled by default | Yes | Developer Preview |
| Dataset_IO |
| Enabled by default | Yes | Technology Preview |
|
| Enabled by default | No | Technology Preview | |
| Evaluation |
|
Set the | No | Technology Preview |
|
| Enabled by default | No | Technology Preview | |
|
| See the "Configuring the Ragas remote provider for production" documentation | No | Technology Preview | |
| Files |
| Enabled by default | No | Technology Preview |
| Inference |
| Enabled by default | Yes | Technology Preview |
|
|
Set the | Yes | Technology Preview | |
|
|
Set the | No | Technology Preview | |
|
|
Set the | No | Technology Preview | |
|
|
Set the | No | Technology Preview | |
|
|
Set the | No | Technology Preview | |
|
|
Set the | No | Technology Preview | |
| Safety |
| Enabled by default | No | Technology Preview |
| Scoring |
| Enabled by default | No | Technology Preview |
|
| Enabled by default | No | Technology Preview | |
|
| Enabled by default | No | Technology Preview | |
| Tool_Runtime |
| Enabled by default | No | Developer Preview |
|
| Enabled by default | No | Developer Preview | |
|
| Enabled by default | No | Developer Preview | |
|
| Enabled by default | No | Developer Preview | |
| Vector_IO |
|
Set the | No | Technology Preview |
|
| Enabled by default | Yes | Technology Preview | |
|
|
Set the | Yes | Technology Preview | |
|
|
Set the | Yes | Technology Preview |