Chapter 1. Overview of Llama Stack

Llama Stack is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on OpenShift AI. In OpenShift, the Llama Stack Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with OpenShift AI projects. Llama Stack integrates model inference, embedding generation, vector storage, and retrieval services into a single stack that is optimized for retrieval-augmented generation (RAG) and agent-based AI workflows.

Llama Stack concepts

Llama Stack Operator Installs and manages Llama Stack server instances in OpenShift AI, handling lifecycle operations such as deployment, scaling, and updates.
The run.yaml file Defines which APIs are enabled and how backend providers are configured for a Llama Stack server. Red Hat ships a default run.yaml that supports common deployment scenarios. You can provide a custom run.yaml to enable advanced workflows or integrate additional providers.
LlamaStackDistribution custom resource Declares the runtime configuration for a Llama Stack server, including model providers, embedding configuration, vector storage, and persistence settings.

OpenShift AI ships with a Llama Stack Distribution that runs the Llama Stack server in a containerized environment. OpenShift AI 3.2.0 includes Open Data Hub Llama Stack version 0.3.5+rhai0, which is based on upstream Llama Stack version 0.3.5.

Important

Llama Stack integration is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production.

These features provide early access to upcoming product capabilities, enabling customers to test functionality and provide feedback during development.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Llama Stack includes the following core components:

Integration with OpenShift AI Uses the LlamaStackDistribution custom resource to simplify configuration and deployment of AI workloads.
Inference model connections Acts as a proxy between Llama Stack APIs and model inference servers, such as vLLM deployments.
Embedding generation Generates vector embeddings used for retrieval. In OpenShift AI 3.2, remote embedding models are the recommended and default option for production deployments. Inline embedding models remain available for development and testing scenarios.
Vector storage Stores and indexes embeddings by using supported vector databases, such as Milvus or PostgreSQL with the pgvector extension.
Metadata persistence Stores vector store metadata, file references, and configuration state. In OpenShift AI 3.2, PostgreSQL is the default backend for production-grade deployments.
Retrieval workflows Manages ingestion, chunking, embedding, and similarity search to support RAG workflows.
Agentic workflows Enables agent-based interactions through supported APIs, such as OpenAI-compatible Responses and Chat Completions.

For information about deploying Llama Stack in OpenShift AI, see Deploying a RAG stack in a project.

Note

The Llama Stack Operator is not currently supported on IBM Power or IBM Z platforms.

1.1. Llama Stack APIs
Copy link

You can use the following APIs from Llama Stack for AI actions such as evaluation, scoring, and inference:

1.1.1. Supported Llama Stack APIs in OpenShift AI
Copy link

1.1.1.1. Agents API
Copy link

Endpoint: /v1alpha/agents.
Providers: All agent backends deployed through OpenShift AI.
Support level: Developer Preview.

The Agents API allows you to create and manage AI agents.

1.1.1.2. Datasets_IO API
Copy link

Endpoint: /v1beta/datasetio.
Providers: All dataset_io backends deployed through OpenShift AI.
Support level: Technology Preview.

The Dataset_IO API manages the input and output of datasets and their content.

1.1.1.3. Evaluation API
Copy link

Endpoint: /v1beta/eval.
Providers: All evaluation backends deployed through OpenShift AI.
Support level: Developer Preview.

The Evaluation API defines an evaluation task for models and datasets

1.1.1.4. Inference API
Copy link

Endpoint: /v1alpha/inference.
Providers: All inference backends deployed through OpenShift AI.
Support level: Developer Preview.

Warning

The majority of the Inference API is deprecated. The Inference providers use the Completions and Chat Completions APIs now.

The Inference API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.

1.1.1.5. Safety API
Copy link

Endpoint: /v1/safety.
Providers: All safety backends deployed through OpenShift AI.
Support level: Technology Preview.

The Safety API detects and prevents harmful content in model inputs and outputs.

1.1.1.6. Tool Runtime API
Copy link

Endpoint: /v1/tool-runtime.
Providers: All tool runtime backends deployed through OpenShift AI.
Support level: Developer Preview.

The Tool Runtime API allows a model to dynamically call a tool at runtime.

1.1.1.7. Vector_IO API
Copy link

Endpoint: /v1/vector-io.
Providers: All vector_io backends deployed through OpenShift AI.
Support level: Developer Preview.

The Vector_IO API allows you to manage and query vector embeddings: numeric representations of data.

1.2. OpenAI-compatible APIs in Llama Stack
Copy link

OpenShift AI includes a Llama Stack component that exposes OpenAI-compatible APIs. These APIs enable you to reuse existing OpenAI SDKs, tools, and workflows directly within your OpenShift environment, without changing your client code. This compatibility layer supports retrieval-augmented generation (RAG), inference, and embedding workloads by using the same endpoints, schemas, and authentication model as OpenAI.

This compatibility layer has the following capabilities:

Standardized endpoints: REST API paths align with OpenAI specifications.
Schema parity: Request and response fields follow OpenAI data structures.

Note

When connecting OpenAI SDKs or third-party tools to OpenShift AI, you must update the client configuration to use your deployment’s Llama Stack route as the base_url.

When you use OpenAI-compatible SDKs, the base_url must include the /v1 path suffix so that requests are routed to the OpenAI-compatible API surface exposed by Llama Stack.

Important

When you use OpenAI SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.

For example: http://llama-stack-service:8321/v1

Using the service endpoint without /v1 results in request failures.

These endpoints are exposed under the OpenAI compatibility layer and are distinct from the native Llama Stack APIs.

1.2.1. Supported OpenAI-compatible APIs in OpenShift AI
Copy link

1.2.1.1. Chat Completions API
Copy link

Endpoint: /v1/openai/v1/chat/completions.
Providers: All inference back ends deployed through OpenShift AI.
Support level: Technology Preview.

The Chat Completions API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.

1.2.1.2. Completions API
Copy link

Endpoint: /v1/openai/v1/completions.
Providers: All inference backends managed by OpenShift AI.
Support level: Technology Preview.

The Completions API supports single-turn text generation and prompt completion.

1.2.1.3. Embeddings API
Copy link

Endpoint: /v1/openai/v1/embeddings.
Providers: All embedding models enabled in OpenShift AI.

The Embeddings API generates numerical embeddings for text or documents that can be used in downstream semantic search or RAG applications.

1.2.1.4. Files API
Copy link

Endpoint: /v1/openai/v1/files.
Providers: File system-based file storage provider for managing files and documents stored locally in your cluster.
Support level: Technology Preview.

The Files API manages file uploads for use in embedding and retrieval workflows.

1.2.1.5. Vector Stores API
Copy link

Endpoint: /v1/openai/v1/vector_stores/.
Providers: Inline and remote vector store providers configured in OpenShift AI.
Support level: Technology Preview.

The Vector Stores API manages the creation, configuration, and lifecycle of vector store resources in Llama Stack. Through this API, you can create new vector stores, list existing ones, delete unused stores, and query their metadata, all using OpenAI-compatible request and response formats.

1.2.1.6. Vector Store Files API
Copy link

Endpoint: /v1/openai/v1/vector_stores/{vector_store_id}/files.
Providers: Local inline provider configured for file storage and retrieval.
Support level: Developer Preview.

The Vector Store Files API implements the OpenAI Vector Store Files interface and manages the association between document files and vector stores used for RAG workflows.

1.2.1.7. Models API
Copy link

Endpoint: /v1/openai/v1/models.
Providers: All model-serving back ends configured within OpenShift AI.
Support level: Technology Preview.

The Models API lists and retrieves available model resources from the Llama Stack deployment running on OpenShift AI. By using the Models API, you can enumerate models, view their capabilities, and verify deployment status through a standardized OpenAI-compatible interface.

1.2.1.8. Responses API
Copy link

Endpoint: /v1/openai/v1/responses.
Providers: All agents, inference, and vector providers configured in OpenShift AI.
Support level: Developer Preview.

The Responses API generates model outputs by combining inference, file search, and tool-calling capabilities through a single OpenAI-compatible endpoint. It is particularly useful for retrieval-augmented generation (RAG) workflows that rely on the file_search tool to retrieve context from vector stores.

Note

The Responses API is an experimental feature that is still under active development in OpenShift AI. While the API is already functional and suitable for evaluation, some endpoints and parameters remain under implementation and might change in future releases. This API is provided for testing and feedback purposes only and is not recommended for production use.

1.2.2. OpenAI compatibility for RAG APIs in Llama Stack
Copy link

OpenShift AI supports OpenAI-compatible request and response schemas for Llama Stack retrieval-augmented generation (RAG) workflows. This compatibility allows you to use OpenAI clients, tools, and schemas with Llama Stack for managing files, vector stores, and executing RAG queries through the Responses API.

OpenAI compatibility enables the following capabilities:

You can use OpenAI SDKs and tools with Llama Stack by pointing the client to the Llama Stack OpenAI-compatible API path.
You can manage files and vector stores by using OpenAI-compatible endpoints and invoke RAG workflows by using the Responses API with the file_search tool.

When configuring clients, the required base_url depends on the SDK that you use:

OpenAI SDKs When you use an OpenAI-compatible SDK (for example, the OpenAI Python client), you must include the /v1 path suffix in the base URL. + For example: + http://llama-stack-service:8321/v1
Llama Stack SDK (llama_stack_client) When you use the native Llama Stack SDK, set the base URL to the Llama Stack service endpoint without the /v1 suffix. The SDK automatically appends the correct API paths. + For example: + http://llama-stack-service:8321

Important

When you use OpenAI-compatible SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.

Using the service endpoint without /v1 results in request failures.

1.3. Llama Stack API provider support
Copy link

You can use Llama Stack to enable various Provider APIs and providers in OpenShift AI. The following table lists the supported providers included in OpenShift AI

Warning

The support status of the Llama Stack API providers has shifted between Technology Preview and Developer Preview across OpenShift AI versions.

Expand

Provider API	Providers	How to Enable	Disconnected support	Support status
Agents	`inline::meta-reference` Note The Responses API is accessible from the Agents provider API.	Enabled by default	Yes	Developer Preview
Dataset_IO	`inline::localfs`	Enabled by default	Yes	Technology Preview
Dataset_IO	`remote::huggingface`	Enabled by default	No	Technology Preview
Evaluation	`inline::ragas`	Set the `EMBEDDING_MODEL` environment variable	No	Technology Preview
	`remote::lmeval`	Enabled by default	No	Technology Preview
	`remote::ragas`	See the "Configuring the Ragas remote provider for production" documentation	No	Technology Preview
Files	`inline::localfs`	Enabled by default	No	Technology Preview
Inference	`inline::sentence-transformers`	Enabled by default	Yes	Technology Preview
	`remote::vllm`	Set the `VLLM_URL` environment variable	Yes	Technology Preview
	`remote::azure`	Set the `AZURE_API_KEY` environment variable	No	Technology Preview
	`remote::bedrock`	Set the `AWS_ACCESS_KEY_ID` environment variable	No	Technology Preview
	`remote::openai`	Set the `OPENAI_API_KEY` environment variable	No	Technology Preview
	`remote::vertexai`	Set the `VERTEX_AI_PROJECT` environment variable	No	Technology Preview
	`remote::watsonx`	Set the `WATSONX_API_KEY` environment variable	No	Technology Preview
Safety	`remote::trustyai_fms`	Enabled by default	No	Technology Preview
Scoring	`inline::basic`	Enabled by default	No	Technology Preview
	`inline::braintrust`	Enabled by default	No	Technology Preview
	`inline::llm-as-a-judge`	Enabled by default	No	Technology Preview
Tool_Runtime	`inline::rag-runtime`	Enabled by default	No	Developer Preview
	`remote::brave-search`	Enabled by default	No	Developer Preview
	`remote::model-context-protocol`	Enabled by default	No	Developer Preview
	`remote::tavily-search`	Enabled by default	No	Developer Preview
Vector_IO	`inline::faiss`	Set the `ENABLE_FAISS` environment variable	No	Technology Preview
	`inline::milvus`	Enabled by default	Yes	Technology Preview
	`remote::milvus`	Set the `MILVUS_ENDPOINT` environment variable	Yes	Technology Preview
	`remote::pgvector`	Set the `ENABLE_PGVECTOR` environment variable	Yes	Technology Preview

Chapter 1. Overview of Llama Stack

1.1. Llama Stack APIs
Copy link

1.1.1. Supported Llama Stack APIs in OpenShift AI
Copy link

1.1.1.1. Agents API
Copy link

1.1.1.2. Datasets_IO API
Copy link

1.1.1.3. Evaluation API
Copy link

1.1.1.4. Inference API
Copy link

1.1.1.5. Safety API
Copy link

1.1.1.6. Tool Runtime API
Copy link

1.1.1.7. Vector_IO API
Copy link

1.2. OpenAI-compatible APIs in Llama Stack
Copy link

1.2.1. Supported OpenAI-compatible APIs in OpenShift AI
Copy link

1.2.1.1. Chat Completions API
Copy link

1.2.1.2. Completions API
Copy link

1.2.1.3. Embeddings API
Copy link

1.2.1.4. Files API
Copy link

1.2.1.5. Vector Stores API
Copy link

1.2.1.6. Vector Store Files API
Copy link

1.2.1.7. Models API
Copy link

1.2.1.8. Responses API
Copy link

1.2.2. OpenAI compatibility for RAG APIs in Llama Stack
Copy link

1.3. Llama Stack API provider support
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 1. Overview of Llama Stack

1.1. Llama Stack APIsCopy linkLink copied to clipboard!

1.1.1. Supported Llama Stack APIs in OpenShift AICopy linkLink copied to clipboard!

1.1.1.1. Agents APICopy linkLink copied to clipboard!

1.1.1.2. Datasets_IO APICopy linkLink copied to clipboard!

1.1.1.3. Evaluation APICopy linkLink copied to clipboard!

1.1.1.4. Inference APICopy linkLink copied to clipboard!

1.1.1.5. Safety APICopy linkLink copied to clipboard!

1.1.1.6. Tool Runtime APICopy linkLink copied to clipboard!

1.1.1.7. Vector_IO APICopy linkLink copied to clipboard!

1.2. OpenAI-compatible APIs in Llama StackCopy linkLink copied to clipboard!

1.2.1. Supported OpenAI-compatible APIs in OpenShift AICopy linkLink copied to clipboard!

1.2.1.1. Chat Completions APICopy linkLink copied to clipboard!

1.2.1.2. Completions APICopy linkLink copied to clipboard!

1.2.1.3. Embeddings APICopy linkLink copied to clipboard!

1.2.1.4. Files APICopy linkLink copied to clipboard!

1.2.1.5. Vector Stores APICopy linkLink copied to clipboard!

1.2.1.6. Vector Store Files APICopy linkLink copied to clipboard!

1.2.1.7. Models APICopy linkLink copied to clipboard!

1.2.1.8. Responses APICopy linkLink copied to clipboard!

1.2.2. OpenAI compatibility for RAG APIs in Llama StackCopy linkLink copied to clipboard!

1.3. Llama Stack API provider supportCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

1.1. Llama Stack APIs
Copy link

1.1.1. Supported Llama Stack APIs in OpenShift AI
Copy link

1.1.1.1. Agents API
Copy link

1.1.1.2. Datasets_IO API
Copy link

1.1.1.3. Evaluation API
Copy link

1.1.1.4. Inference API
Copy link

1.1.1.5. Safety API
Copy link

1.1.1.6. Tool Runtime API
Copy link

1.1.1.7. Vector_IO API
Copy link

1.2. OpenAI-compatible APIs in Llama Stack
Copy link

1.2.1. Supported OpenAI-compatible APIs in OpenShift AI
Copy link

1.2.1.1. Chat Completions API
Copy link

1.2.1.2. Completions API
Copy link

1.2.1.3. Embeddings API
Copy link

1.2.1.4. Files API
Copy link

1.2.1.5. Vector Stores API
Copy link

1.2.1.6. Vector Store Files API
Copy link

1.2.1.7. Models API
Copy link

1.2.1.8. Responses API
Copy link

1.2.2. OpenAI compatibility for RAG APIs in Llama Stack
Copy link

1.3. Llama Stack API provider support
Copy link