Chapter 1. Overview of Llama Stack


Llama Stack is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on OpenShift AI. In OpenShift, the Llama Stack Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with OpenShift AI projects. Llama Stack integrates model inference, embedding generation, vector storage, and retrieval services into a single stack that is optimized for retrieval-augmented generation (RAG) and agent-based AI workflows.

Llama Stack concepts

  • Llama Stack Operator Installs and manages Llama Stack server instances in OpenShift AI, handling lifecycle operations such as deployment, scaling, and updates.
  • The run.yaml file Defines which APIs are enabled and how backend providers are configured for a Llama Stack server. Red Hat ships a default run.yaml that supports common deployment scenarios. You can provide a custom run.yaml to enable advanced workflows or integrate additional providers.
  • LlamaStackDistribution custom resource Declares the runtime configuration for a Llama Stack server, including model providers, embedding configuration, vector storage, and persistence settings.

OpenShift AI ships with a Llama Stack Distribution that runs the Llama Stack server in a containerized environment. OpenShift AI 3.2.0 includes Open Data Hub Llama Stack version 0.3.5+rhai0, which is based on upstream Llama Stack version 0.3.5.

Important

Llama Stack integration is currently available in Red Hat OpenShift AI 3.2 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production.

These features provide early access to upcoming product capabilities, enabling customers to test functionality and provide feedback during development.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Llama Stack includes the following core components:

  • Integration with OpenShift AI Uses the LlamaStackDistribution custom resource to simplify configuration and deployment of AI workloads.
  • Inference model connections Acts as a proxy between Llama Stack APIs and model inference servers, such as vLLM deployments.
  • Embedding generation Generates vector embeddings used for retrieval. In OpenShift AI 3.2, remote embedding models are the recommended and default option for production deployments. Inline embedding models remain available for development and testing scenarios.
  • Vector storage Stores and indexes embeddings by using supported vector databases, such as Milvus or PostgreSQL with the pgvector extension.
  • Metadata persistence Stores vector store metadata, file references, and configuration state. In OpenShift AI 3.2, PostgreSQL is the default backend for production-grade deployments.
  • Retrieval workflows Manages ingestion, chunking, embedding, and similarity search to support RAG workflows.
  • Agentic workflows Enables agent-based interactions through supported APIs, such as OpenAI-compatible Responses and Chat Completions.

For information about deploying Llama Stack in OpenShift AI, see Deploying a RAG stack in a project.

Note

The Llama Stack Operator is not currently supported on IBM Power or IBM Z platforms.

1.1. Llama Stack APIs

You can use the following APIs from Llama Stack for AI actions such as evaluation, scoring, and inference:

1.1.1. Supported Llama Stack APIs in OpenShift AI

1.1.1.1. Agents API

  • Endpoint: /v1alpha/agents.
  • Providers: All agent backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Agents API allows you to create and manage AI agents.

1.1.1.2. Datasets_IO API

  • Endpoint: /v1beta/datasetio.
  • Providers: All dataset_io backends deployed through OpenShift AI.
  • Support level: Technology Preview.

The Dataset_IO API manages the input and output of datasets and their content.

1.1.1.3. Evaluation API

  • Endpoint: /v1beta/eval.
  • Providers: All evaluation backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Evaluation API defines an evaluation task for models and datasets

1.1.1.4. Inference API

  • Endpoint: /v1alpha/inference.
  • Providers: All inference backends deployed through OpenShift AI.
  • Support level: Developer Preview.
Warning

The majority of the Inference API is deprecated. The Inference providers use the Completions and Chat Completions APIs now.

The Inference API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.

1.1.1.5. Safety API

  • Endpoint: /v1/safety.
  • Providers: All safety backends deployed through OpenShift AI.
  • Support level: Technology Preview.

The Safety API detects and prevents harmful content in model inputs and outputs.

1.1.1.6. Tool Runtime API

  • Endpoint: /v1/tool-runtime.
  • Providers: All tool runtime backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Tool Runtime API allows a model to dynamically call a tool at runtime.

1.1.1.7. Vector_IO API

  • Endpoint: /v1/vector-io.
  • Providers: All vector_io backends deployed through OpenShift AI.
  • Support level: Developer Preview.

The Vector_IO API allows you to manage and query vector embeddings: numeric representations of data.

1.2. OpenAI-compatible APIs in Llama Stack

OpenShift AI includes a Llama Stack component that exposes OpenAI-compatible APIs. These APIs enable you to reuse existing OpenAI SDKs, tools, and workflows directly within your OpenShift environment, without changing your client code. This compatibility layer supports retrieval-augmented generation (RAG), inference, and embedding workloads by using the same endpoints, schemas, and authentication model as OpenAI.

This compatibility layer has the following capabilities:

  • Standardized endpoints: REST API paths align with OpenAI specifications.
  • Schema parity: Request and response fields follow OpenAI data structures.
Note

When connecting OpenAI SDKs or third-party tools to OpenShift AI, you must update the client configuration to use your deployment’s Llama Stack route as the base_url.

When you use OpenAI-compatible SDKs, the base_url must include the /v1 path suffix so that requests are routed to the OpenAI-compatible API surface exposed by Llama Stack.

Important

When you use OpenAI SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.

For example: http://llama-stack-service:8321/v1

Using the service endpoint without /v1 results in request failures.

These endpoints are exposed under the OpenAI compatibility layer and are distinct from the native Llama Stack APIs.

1.2.1.1. Chat Completions API

  • Endpoint: /v1/openai/v1/chat/completions.
  • Providers: All inference back ends deployed through OpenShift AI.
  • Support level: Technology Preview.

The Chat Completions API enables conversational, message-based interactions with models served by Llama Stack in OpenShift AI.

1.2.1.2. Completions API

  • Endpoint: /v1/openai/v1/completions.
  • Providers: All inference backends managed by OpenShift AI.
  • Support level: Technology Preview.

The Completions API supports single-turn text generation and prompt completion.

1.2.1.3. Embeddings API

  • Endpoint: /v1/openai/v1/embeddings.
  • Providers: All embedding models enabled in OpenShift AI.

The Embeddings API generates numerical embeddings for text or documents that can be used in downstream semantic search or RAG applications.

1.2.1.4. Files API

  • Endpoint: /v1/openai/v1/files.
  • Providers: File system-based file storage provider for managing files and documents stored locally in your cluster.
  • Support level: Technology Preview.

The Files API manages file uploads for use in embedding and retrieval workflows.

1.2.1.5. Vector Stores API

  • Endpoint: /v1/openai/v1/vector_stores/.
  • Providers: Inline and remote vector store providers configured in OpenShift AI.
  • Support level: Technology Preview.

The Vector Stores API manages the creation, configuration, and lifecycle of vector store resources in Llama Stack. Through this API, you can create new vector stores, list existing ones, delete unused stores, and query their metadata, all using OpenAI-compatible request and response formats.

1.2.1.6. Vector Store Files API

  • Endpoint: /v1/openai/v1/vector_stores/{vector_store_id}/files.
  • Providers: Local inline provider configured for file storage and retrieval.
  • Support level: Developer Preview.

The Vector Store Files API implements the OpenAI Vector Store Files interface and manages the association between document files and vector stores used for RAG workflows.

1.2.1.7. Models API

  • Endpoint: /v1/openai/v1/models.
  • Providers: All model-serving back ends configured within OpenShift AI.
  • Support level: Technology Preview.

The Models API lists and retrieves available model resources from the Llama Stack deployment running on OpenShift AI. By using the Models API, you can enumerate models, view their capabilities, and verify deployment status through a standardized OpenAI-compatible interface.

1.2.1.8. Responses API

  • Endpoint: /v1/openai/v1/responses.
  • Providers: All agents, inference, and vector providers configured in OpenShift AI.
  • Support level: Developer Preview.

The Responses API generates model outputs by combining inference, file search, and tool-calling capabilities through a single OpenAI-compatible endpoint. It is particularly useful for retrieval-augmented generation (RAG) workflows that rely on the file_search tool to retrieve context from vector stores.

Note

The Responses API is an experimental feature that is still under active development in OpenShift AI. While the API is already functional and suitable for evaluation, some endpoints and parameters remain under implementation and might change in future releases. This API is provided for testing and feedback purposes only and is not recommended for production use.

OpenShift AI supports OpenAI-compatible request and response schemas for Llama Stack retrieval-augmented generation (RAG) workflows. This compatibility allows you to use OpenAI clients, tools, and schemas with Llama Stack for managing files, vector stores, and executing RAG queries through the Responses API.

OpenAI compatibility enables the following capabilities:

  • You can use OpenAI SDKs and tools with Llama Stack by pointing the client to the Llama Stack OpenAI-compatible API path.
  • You can manage files and vector stores by using OpenAI-compatible endpoints and invoke RAG workflows by using the Responses API with the file_search tool.

When configuring clients, the required base_url depends on the SDK that you use:

  • OpenAI SDKs When you use an OpenAI-compatible SDK (for example, the OpenAI Python client), you must include the /v1 path suffix in the base URL. + For example: + http://llama-stack-service:8321/v1
  • Llama Stack SDK (llama_stack_client) When you use the native Llama Stack SDK, set the base URL to the Llama Stack service endpoint without the /v1 suffix. The SDK automatically appends the correct API paths. + For example: + http://llama-stack-service:8321
Important

When you use OpenAI-compatible SDKs or send raw HTTP requests to Llama Stack, always include the /v1 path suffix in the base URL.

Using the service endpoint without /v1 results in request failures.

1.3. Llama Stack API provider support

You can use Llama Stack to enable various Provider APIs and providers in OpenShift AI. The following table lists the supported providers included in OpenShift AI

Warning

The support status of the Llama Stack API providers has shifted between Technology Preview and Developer Preview across OpenShift AI versions.

Expand
Provider APIProvidersHow to EnableDisconnected supportSupport status

Agents

inline::meta-reference

Note

The Responses API is accessible from the Agents provider API.

Enabled by default

Yes

Developer Preview

Dataset_IO

inline::localfs

Enabled by default

Yes

Technology Preview

remote::huggingface

Enabled by default

No

Technology Preview

Evaluation

inline::ragas

Set the EMBEDDING_MODEL environment variable

No

Technology Preview

remote::lmeval

Enabled by default

No

Technology Preview

remote::ragas

See the "Configuring the Ragas remote provider for production" documentation

No

Technology Preview

Files

inline::localfs

Enabled by default

No

Technology Preview

Inference

inline::sentence-transformers

Enabled by default

Yes

Technology Preview

remote::vllm

Set the VLLM_URL environment variable

Yes

Technology Preview

remote::azure

Set the AZURE_API_KEY environment variable

No

Technology Preview

remote::bedrock

Set the AWS_ACCESS_KEY_ID environment variable

No

Technology Preview

remote::openai

Set the OPENAI_API_KEY environment variable

No

Technology Preview

remote::vertexai

Set the VERTEX_AI_PROJECT environment variable

No

Technology Preview

remote::watsonx

Set the WATSONX_API_KEY environment variable

No

Technology Preview

Safety

remote::trustyai_fms

Enabled by default

No

Technology Preview

Scoring

inline::basic

Enabled by default

No

Technology Preview

inline::braintrust

Enabled by default

No

Technology Preview

inline::llm-as-a-judge

Enabled by default

No

Technology Preview

Tool_Runtime

inline::rag-runtime

Enabled by default

No

Developer Preview

remote::brave-search

Enabled by default

No

Developer Preview

remote::model-context-protocol

Enabled by default

No

Developer Preview

remote::tavily-search

Enabled by default

No

Developer Preview

Vector_IO

inline::faiss

Set the ENABLE_FAISS environment variable

No

Technology Preview

inline::milvus

Enabled by default

Yes

Technology Preview

remote::milvus

Set the MILVUS_ENDPOINT environment variable

Yes

Technology Preview

remote::pgvector

Set the ENABLE_PGVECTOR environment variable

Yes

Technology Preview

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top