Centralize company knowledge with an Enterprise RAG Chatbot
Copy link

Use retrieval-augmented generation (RAG) to enhance large language models with specialized data sources for more accurate and context-aware responses.

Table of Contents
Copy link

Detailed description
- Architecture diagrams
Requirements
Deploy
Tags

Detailed description
Copy link

See how FantaCo, a fictional large enterprise, launched a secure RAG chatbot that connects employees to internal HR, procurement, sales, and IT documentation. From policies to startup guides, employees get fast, accurate answers through a single chat interface. Advanced users can extend the experience with AI agents for deeper workflows.

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by retrieving relevant external knowledge to improve accuracy, reduce hallucinations, and support domain-specific conversations.

This QuickStart allows users to explore the capabilities of RAG by:

Exploring FantaCo's solution
Uploading new documents to be embedded
Tweaking sampling parameters to influence LLM responses
Using custom system prompts
Switching between simple and agent based RAG

Architecture diagrams
Copy link

RAG System Architecture

This diagram illustrates both the ingestion pipeline for document processing and the RAG pipeline for query handling. For more details click here.

Layer/Component	Technology	Purpose/Description
Orchestration	OpenShift AI	Container orchestration and GPU acceleration
Framework	LLaMA Stack	Standardizes core building blocks and simplifies AI application development
UI Layer	Streamlit	User-friendly chatbot interface for chat-based interaction
LLM	Llama-3.2-3B-Instruct	Generates contextual responses based on retrieved documents
Safety	Safety Guardrail	Blocks harmful requests and responses for secure AI interactions
Integration	MCP Servers	Model Context Protocol servers for enhanced functionality
Embedding	all-MiniLM-L6-v2	Converts text to vector embeddings
Vector DB	PostgreSQL + PGVector	Stores embeddings and enables semantic search
Retrieval	Vector Search	Retrieves relevant documents based on query similarity
Data Ingestion	Kubeflow Pipelines	Multi-modal data ingestion with preprocessing pipelines for cleaning, chunking, and embedding generation
Storage	S3 Bucket	Document source for enterprise content

Requirements
Copy link

Minimum hardware requirements
Copy link

1 GPU/HPU with 24GB of VRAM for the LLM, refer to the chart below
1 GPU/HPU with 24GB of VRAM for the safety/shield model (optional)

Minimum software requirements
Copy link

OpenShift Client CLI - oc
OpenShift Cluster 4.18+
OpenShift AI
Helm CLI - helm

Required user permissions
Copy link

Regular user permission for default deployment
Cluster admin required for advanced configurations

Deploy
Copy link

The instructions below will deploy this quickstart to your OpenShift environment.

Please see the local deployments section for additional deployment options.

Prerequisites
Copy link

huggingface-cli (optional)
Hugging Face Token
Access to Meta Llama model
Access to Meta Llama Guard model
Some of the example scripts use jq a JSON parsing utility which you can acquire via brew install jq

Supported Models
Copy link

Function	Model Name	Hardware	AWS
Embedding	`all-MiniLM-L6-v2`	CPU/GPU/HPU
Generation	`meta-llama/Llama-3.2-3B-Instruct`	L4/HPU	g6.2xlarge
Generation	`meta-llama/Llama-3.1-8B-Instruct`	L4/HPU	g6.2xlarge
Generation	`meta-llama/Meta-Llama-3-70B-Instruct`	A100 x2/HPU	p4d.24xlarge
Safety	`meta-llama/Llama-Guard-3-8B`	L4/HPU	g6.2xlarge

Note: the 70B model is NOT required for initial testing of this example. The safety/shield model Llama-Guard-3-8B is also optional.

Installation Steps
Copy link

Clone Repository

git clone https://github.com/rh-ai-quickstart/RAG

git clone https://github.com/rh-ai-quickstart/RAG

Copy to Clipboard

Toggle word wrap

Login to OpenShift

oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"

oc login --server="<cluster-api-endpoint>" --token="sha256~XYZ"

Copy to Clipboard

Toggle word wrap

Hardware Configuration

Determine what hardware acceleration is available in your cluster and configure accordingly.

For NVIDIA GPU nodes: If GPU nodes are tainted, find the taint key. In the example below the key for the taint is nvidia.com/gpu

oc get nodes -l nvidia.com/gpu.present=true -o yaml | grep -A 3 taint

oc get nodes -l nvidia.com/gpu.present=true -o yaml | grep -A 3 taint

Copy to Clipboard

Toggle word wrap

For Intel Gaudi HPU nodes: If HPU nodes are tainted, find the taint key. The taint key is typically habana.ai/gaudi

oc get nodes -l habana.ai/gaudi.present=true -o yaml | grep -A 3 taint

oc get nodes -l habana.ai/gaudi.present=true -o yaml | grep -A 3 taint

Copy to Clipboard

Toggle word wrap

The output of either command may be something like below:

taints:
  - effect: NoSchedule
    key: nvidia.com/gpu  # or habana.ai/gaudi for HPU
    value: "true"

taints:
  - effect: NoSchedule
    key: nvidia.com/gpu  # or habana.ai/gaudi for HPU
    value: "true"

Copy to Clipboard

Toggle word wrap

You can work with your OpenShift cluster admin team to determine what labels and taints identify GPU-enabled or HPU-enabled worker nodes. It is also possible that all your worker nodes have accelerators therefore have no distinguishing taint.

Navigate to Deployment Directory

cd deploy/helm

cd deploy/helm

Copy to Clipboard

Toggle word wrap

List Available Models

make list-models

make list-models

Copy to Clipboard

Toggle word wrap

The above command will list the models to use in the next command:

(Output)
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)

(Output)
model: llama-3-1-8b-instruct (meta-llama/Llama-3.1-8B-Instruct)
model: llama-3-2-1b-instruct (meta-llama/Llama-3.2-1B-Instruct)
model: llama-3-2-1b-instruct-quantized (RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8)
model: llama-3-2-3b-instruct (meta-llama/Llama-3.2-3B-Instruct)
model: llama-3-3-70b-instruct (meta-llama/Llama-3.3-70B-Instruct)
model: llama-guard-3-1b (meta-llama/Llama-Guard-3-1B)
model: llama-guard-3-8b (meta-llama/Llama-Guard-3-8B)

Copy to Clipboard

Toggle word wrap

The "guard" models can be used to test shields for profanity, hate speech, violence, etc.

Deploy with Helm

Use the taint key from above as the LLM_TOLERATION and SAFETY_TOLERATION. The namespace will be auto-created.

GPU Deployment Examples (Default):

To install only the RAG example, no shields:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu"

Copy to Clipboard

Toggle word wrap

To install both the RAG example as well as the guard model to allow for shields:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="nvidia.com/gpu" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="nvidia.com/gpu"

Copy to Clipboard

Toggle word wrap

Note: DEVICE=gpu is the default and can be omitted.

Intel Gaudi HPU Deployment Examples:

To install only the RAG example on Intel Gaudi HPU nodes:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" DEVICE=hpu

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" DEVICE=hpu

Copy to Clipboard

Toggle word wrap

To install both the RAG example and guard model on Intel Gaudi HPU nodes:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="habana.ai/gaudi" DEVICE=hpu

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct LLM_TOLERATION="habana.ai/gaudi" SAFETY=llama-guard-3-8b SAFETY_TOLERATION="habana.ai/gaudi" DEVICE=hpu

Copy to Clipboard

Toggle word wrap

CPU Deployment Example:

To install on CPU nodes only:

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct DEVICE=cpu

make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct DEVICE=cpu

Copy to Clipboard

Toggle word wrap

Simplified Commands (No Tolerations Needed):

If you have no tainted nodes (all worker nodes have accelerators), you can use simplified commands:

# GPU deployment (default - DEVICE=gpu can be omitted)
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b

# HPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=hpu

# CPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=cpu

# GPU deployment (default - DEVICE=gpu can be omitted)
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b

# HPU deployment  
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=hpu

# CPU deployment
make install NAMESPACE=llama-stack-rag LLM=llama-3-2-3b-instruct SAFETY=llama-guard-3-8b DEVICE=cpu

Copy to Clipboard

Toggle word wrap

When prompted, enter your Hugging Face Token.

Note: This process may take 10 to 30 minutes depending on the number and size of models to be downloaded.

Monitor Deployment

oc get pods -n llama-stack-rag

oc get pods -n llama-stack-rag

Copy to Clipboard

Toggle word wrap

Watch for all pods to reach Running or Completed status. Key pods to watch include predictor in their name (these are the KServe model servers running vLLM):

oc get pods -l component=predictor

oc get pods -l component=predictor

Copy to Clipboard

Toggle word wrap

Look for 3/3 under the Ready column.

Verify Installation

Watch the llamastack pod as that one becomes available after all the model servers are up:

oc get pods -l app.kubernetes.io/name=llamastack

oc get pods -l app.kubernetes.io/name=llamastack

Copy to Clipboard

Toggle word wrap

Verify all resources:

oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-rag

oc get pods -n llama-stack-rag
oc get svc -n llama-stack-rag
oc get routes -n llama-stack-rag

Copy to Clipboard

Toggle word wrap

For detailed post-installation verification, configuration options, and usage instructions, see the complete OpenShift deployment guide.

Local Deployment
Copy link

For local development and testing, see the Local Setup Guide.

Centralize company knowledge with an Enterprise RAG Chatbot

Centralize company knowledge with an Enterprise RAG Chatbot
Copy link

Table of Contents
Copy link

Detailed description
Copy link

Architecture diagrams
Copy link

Requirements
Copy link

Minimum hardware requirements
Copy link

Minimum software requirements
Copy link

Required user permissions
Copy link

Deploy
Copy link

Prerequisites
Copy link

Supported Models
Copy link

Installation Steps
Copy link

Local Deployment
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Centralize company knowledge with an Enterprise RAG Chatbot

Centralize company knowledge with an Enterprise RAG ChatbotCopy linkLink copied!

Table of ContentsCopy linkLink copied!

Detailed descriptionCopy linkLink copied!

Architecture diagramsCopy linkLink copied!

RequirementsCopy linkLink copied!

Minimum hardware requirementsCopy linkLink copied!

Minimum software requirementsCopy linkLink copied!

Required user permissionsCopy linkLink copied!

DeployCopy linkLink copied!

PrerequisitesCopy linkLink copied!

Supported ModelsCopy linkLink copied!

Installation StepsCopy linkLink copied!

Local DeploymentCopy linkLink copied!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Centralize company knowledge with an Enterprise RAG Chatbot
Copy link

Table of Contents
Copy link

Detailed description
Copy link

Architecture diagrams
Copy link

Requirements
Copy link

Minimum hardware requirements
Copy link

Minimum software requirements
Copy link

Required user permissions
Copy link

Deploy
Copy link

Prerequisites
Copy link

Supported Models
Copy link

Installation Steps
Copy link

Local Deployment
Copy link