Chapter 2. Working with Llama Stack
Llama Stack is a unified AI runtime environment designed to simplify the deployment and management of generative AI workloads on OpenShift AI. Llama Stack integrates LLM inference servers, vector databases, and retrieval services in a single stack, optimized for Retrieval-Augmented Generation (RAG) and agent-based AI workflows. In OpenShift, the Llama Stack Operator manages the deployment lifecycle of these components, ensuring scalability, consistency, and integration with OpenShift AI projects.
Llama Stack integration is currently available in Red Hat OpenShift AI 2.23 as a Technology Preview feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Llama Stack includes the following components:
- Inference model servers such as vLLM, designed to efficiently serve large language models.
- Vector storage solutions, primarily Milvus, to store embeddings generated from your domain data.
- Retrieval and embedding management workflows using integrated tools, such as Docling, to handle continuous data ingestion and synchronization.
-
Integration with OpenShift AI by using the
LlamaStackDistribution
custom resource, simplifying configuration and deployment.
For information about how to deploy Llama Stack in OpenShift AI, see Deploying a RAG stack in a Data Science Project.