Secure Model Inference with F5 AI Guardrails
Guard AI applications against prompt attacks and sensitive data leakage with enterprise-grade LLM protection using F5 AI Guardrails on Red Hat® OpenShift® AI.
This content is authored by Red Hat experts, but has not yet been tested on every supported configuration.
Secure Model Inference with F5 AI Guardrails Copy linkLink copied!
Guard AI applications against prompt attacks and sensitive data leakage with enterprise-grade LLM protection using F5 AI Guardrails on Red Hat® OpenShift® AI.
Table of contents Copy linkLink copied!
- Detailed description
- Requirements
- Deploy
- AI security capabilities
- References
- Document management
- Tags
Detailed description Copy linkLink copied!
Imagine a financial services team deploying an AI-powered assistant to help underwriters review policies, analyze risk documents, and answer questions about compliance guidelines. The assistant uses a large language model served on Red Hat® OpenShift® AI, with Retrieval-Augmented Generation (RAG) grounding its answers in the firm's own document corpus — underwriting manuals, regulatory filings, and internal procedures. Before moving the application to production, a security review reveals critical AI-specific risks: a crafted prompt could trick the model into ignoring its system instructions and leaking confidential data, the model might return personally identifiable information (PII) embedded in training data, responses could contain toxic or harmful content, and nothing prevents the assistant from answering questions outside its approved domain.
These are not traditional network-layer threats — they are AI-layer threats that operate within legitimate API calls. A WAF cannot inspect whether a model response contains a Social Security number or whether a prompt is attempting instruction injection.
This AI quickstart demonstrates a solution using F5 AI Guardrails (powered by Calypso AI). It deploys a complete RAG chatbot on Red Hat OpenShift AI and secures the model inference endpoints with AI-aware content inspection. You get a working application you can demonstrate to security, compliance, and risk stakeholders — complete with simulated attack scenarios that show exactly how each protection layer responds.
While the included demo content targets financial services, the same architecture applies to any industry handling sensitive data — healthcare organizations protecting patient records, government agencies securing citizen-facing AI services, or any enterprise that needs to enforce content safety policies on LLM endpoints before moving to production.
This quickstart allows you to explore AI security capabilities by:
- Querying financial documents through a RAG-powered chat assistant and seeing grounded, context-aware answers
- Simulating prompt injection attacks that attempt to override system instructions, then enabling a guardrail policy to block them
- Triggering responses that contain sensitive PII (SSNs, credit card numbers), then enabling PII detection to block or redact them
- Eliciting toxic or harmful content from the model, then enabling toxicity filtering to prevent it
- Asking off-topic questions outside the approved business domain, then enabling topic restriction to enforce boundaries
- Walking through each scenario end-to-end with the included AI guardrails use case guide
The solution is built on:
- Red Hat OpenShift AI — MLOps platform with KServe/vLLM model serving and GPU acceleration
- F5 AI Guardrails — AI-layer content inspection: prompt injection detection, PII filtering, toxicity scanning, and topic enforcement
- LLaMA Stack + Streamlit — RAG chatbot interface backed by PGVector for semantic document retrieval
- Helm-based deployment — One-command install and teardown of the RAG stack on any OpenShift cluster
- Manual deployment — Step-by-step operator installation for F5 AI Guardrails
Architecture Copy linkLink copied!

Data flow: The client sends a chat request to the F5 AI Guardrails Moderator endpoint. The Moderator passes the prompt through the Guardrails engine, which evaluates it against active policies (prompt injection, PII, toxicity, topic). If the prompt passes, it is forwarded to LlamaStack, which routes it to the vLLM ServingRuntime via KServe on OpenShift AI. For RAG queries, LlamaStack also retrieves grounding context from the PGVector database (running on OpenShift, outside the OpenShift AI plane). The model response is then scanned again on the way back. If either the prompt or response violates a policy, the request is blocked and the client receives an error.
The diagram above reflects this split: the F5 AI Security Operator manages the Guardrails and Red Team workloads on OpenShift; OpenShift AI (the inner dashed boundary) hosts only the model-serving layer — KServe and the vLLM ServingRuntime — while LlamaStack, the Vector DB, and the Frontend UI run on plain OpenShift.
| Layer/Component | Technology | Purpose |
|---|---|---|
| Platform | Red Hat OpenShift | Container orchestration; hosts F5 AI Security Operator, LlamaStack, Vector DB, and the UI |
| Model Serving | Red Hat OpenShift AI (KServe + vLLM ServingRuntime) | GPU-accelerated LLM serving via KServe InferenceServices |
| F5 AI Security Operator | f5-ai-security-operator (OLM) |
Deploys and manages the Guardrails and Red Team workloads |
| AI Security | F5 AI Guardrails (Moderator + Guardrails) | Prompt/response inspection, policy enforcement |
| Red Team | Calypso AI Red Team | Adversarial testing and vulnerability assessment |
| Policy Engine | Calypso AI Guardrails | Executes guardrail policies (injection, PII, toxicity, topic) |
| Framework | LLaMA Stack | AI application building blocks, OpenAI-compatible API |
| UI Layer | Streamlit | Chat interface for interactive demos |
| LLM | Llama-3.2-1B-Instruct (quantized) | Generates contextual responses |
| Embedding | all-MiniLM-L6-v2 | Text to vector embeddings |
| Vector DB | PostgreSQL + PGVector | Stores embeddings for semantic retrieval (on OpenShift) |
| Database | PostgreSQL | Moderator settings, policies, and scan results |
| Workflow | Prefect | Orchestrates scan and red-team jobs |
Requirements Copy linkLink copied!
Minimum hardware requirements Copy linkLink copied!
- GPU nodes (3x recommended): NVIDIA A40 or equivalent with minimum 24 GB VRAM each. One GPU per scanner/red-team model — GPUs must not be shared with other workloads.
- CPU node: 16 vCPUs, 32 GiB RAM, x86_64 architecture, 100 GiB persistent storage.
- Worker nodes (per GPU component): 4 vCPUs, 16 GiB RAM (32 GiB recommended for Red Team), 100 GiB persistent storage.
- Cluster resources for RAG stack: minimum 8 vCPUs, 32 GB RAM, 100 GB disk for model weights and vector database.
Minimum software requirements Copy linkLink copied!
- OpenShift Client CLI (oc)
- Red Hat OpenShift Container Platform 4.18+
- Red Hat OpenShift AI 2.16+ (tested with 2.22)
- Helm CLI
- F5 AI Security Operator license and registry credentials (contact F5 Sales)
- Optional: huggingface-cli, Hugging Face token, jq for example scripts
Required user permissions Copy linkLink copied!
- Cluster admin for F5 AI Security Operator installation and SCC configuration
- Regular user sufficient for RAG stack deployment
Deploy Copy linkLink copied!
Prerequisites Copy linkLink copied!
- OpenShift cluster with OpenShift AI installed and GPU nodes available
oclogged into the cluster- Helm installed
- F5 AI Guardrails license and container registry credentials from F5
- Hugging Face token and access to Meta Llama (optional, for local model serving)
Supported models Copy linkLink copied!
| Function | Model Name | Hardware | Notes |
|---|---|---|---|
| Embedding | all-MiniLM-L6-v2 |
CPU/GPU/HPU | — |
| Generation | RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8 |
1 GPU, ~2-3 GB VRAM | Default; quantized for efficiency |
| Generation | meta-llama/Llama-3.2-3B-Instruct |
1 GPU, ~6-8 GB VRAM | Full precision |
| Scanner | cai-phi-4 |
1 GPU, 24 GB VRAM | Policy evaluation model |
| Red Team | cai-mistral-nemo |
1 GPU, 48 GB VRAM | Adversarial testing model |
Installation steps Copy linkLink copied!
Step 1: Deploy the RAG stack Copy linkLink copied!
- Log in to OpenShift
oc login --token=<your_sha256_token> --server=<cluster-api-endpoint>
- Clone and go to the deployment directory
git clone https://github.com/rh-ai-quickstart/f5-ai-guardrails.git
cd f5-ai-guardrails/deploy/helm
- Configure and deploy
Create your values file:
Edit
rag-values.yamlto enable one or more models: Set your target namespace and deploy: A successful run ends with: - Verify the RAG stack
Get the LlamaStack route URL (uses
NAMESPACEfrom the previous step): List available models: Test chat completion: Replace<MODEL_ID>with the model identifier from the list models output (e.g.,llama-3-2-1b-instruct-quantized/RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8).
Step 2: Deploy F5 AI Guardrails Copy linkLink copied!
Follow the complete installation guide: Installing F5 AI Guardrails on OpenShift
This guide covers:
- Node Feature Discovery and GPU Operator prerequisites
- F5 AI Security Operator installation via OperatorHub
- SecurityOperator custom resource configuration
- Required OpenShift SCC policies
- Route configuration for UI access
- Prefect Worker RBAC setup
- LlamaStack endpoint integration
Step 3: Verify the complete stack Copy linkLink copied!
# F5 AI Guardrails components
oc get pods -n cai-moderator # cai-moderator + postgres: Running
oc get pods -n f5-ai-sec-inference # inference (kubeai) + model pods: Running
oc get pods -n prefect # prefect-server + prefect-worker: Running
# Moderator UI
echo "https://$(oc get route cai-moderator-ui -n cai-moderator -o jsonpath='{.spec.host}')"
Log into the Moderator UI with the default credentials (admin / pass) and update the admin email on first login.
Step 4: Test the secured endpoint Copy linkLink copied!
# Send a request through F5 AI Guardrails
curl -sk -X POST https://<MODERATOR_HOSTNAME>/openai/llamastack/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <API_TOKEN>" \
-d '{
"model": "llama-3-2-1b-instruct-quantized/RedHatAI/Llama-3.2-1B-Instruct-quantized.w8a8",
"messages": [{"role": "user", "content": "say hi"}],
"max_tokens": 20
}'
Note: Create an API token in the Moderator UI under API tokens. Copy it immediately — it is shown only once.
Step 5: Run the Streamlit chat app Copy linkLink copied!
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
Opens at http://localhost:8501. Enter your API token and the Moderator endpoint URL in the sidebar.
Application access (frontend): The full-featured RAG frontend is also available. Get the route with oc get route -n <NAMESPACE>, open the URL in a browser, and configure the LLM endpoint settings in the web UI.
Step 6 (optional): Run the full-featured RAG frontend locally Copy linkLink copied!
The frontend/ directory contains a full-featured LlamaStack UI with RAG document management, model selection, and sampling parameters.
Prerequisites:
- Python 3.12+
- uv package manager
Setup and run:
cd frontend
LLAMA_STACK_ENDPOINT=http://$LLAMASTACK_URL ./start.sh
From deploy/helm, make deploy-ui-local (or ui-local) starts the UI in the background; make undeploy-ui-local stops it. Pid/log: .f5-guardrails-ui-local.* in the repo root. make deploy-ui-local-foreground runs in the terminal. Optional env: NAMESPACE, PORT_FORWARD=1 (see frontend/dev-on-cluster.sh).
The start.sh script handles virtual environment setup (uv sync), dependency installation, and launching Streamlit. If LLAMA_STACK_ENDPOINT is not set, the script auto-detects the LlamaStack route from OpenShift (requires oc login).
Opens at http://localhost:8501.
Configuring F5 AI Guardrails:
After completing the AI Guardrails Use Case Guide (Step 0: Configure AI Guardrails), you will have a Moderator endpoint and API token. To route chat through F5 AI Guardrails for policy scanning, go to Settings in the app and configure:
| Field | Value |
|---|---|
| Endpoint URL | https://<MODERATOR_HOSTNAME>/openai/<connection-name> |
| API Token | Bearer token created in the Moderator UI (API tokens page) |
When both fields are set, chat requests are routed through the guardrail proxy. Models and vector databases are always fetched from the direct LlamaStack endpoint.
Persistence (cluster / refresh): Settings are saved to a small JSON file. Default path is ~/.config/... locally; the Helm chart sets F5_GUARDRAILS_STATE_FILE=/data/guardrails_state.json with a /data emptyDir (swap for a PVC if the pod is replaced). You can also seed F5_GUARDRAIL_URL / F5_GUARDRAIL_API_TOKEN from a Secret; each field uses the file first, then the env if empty. Prefer a single UI replica for file-backed state.
Note: The
pyproject.tomluses__LLAMASTACK_VERSION__as a placeholder that is normally substituted during container builds (seeContainerfile). For local development, you must replace it manually with the target version (currently0.6.0). Do not commit this change — it will break the container build pipeline.
Features:
- Chat — Multi-turn conversation with model selection, system prompt, and sampling parameters (temperature, top_p, max_tokens)
- Settings — Configure optional F5 AI Guardrails endpoint and API token for policy scanning
- RAG — Upload documents, create vector database collections, and query with retrieval-augmented generation
- Direct/Guardrail modes — Chat directly with LlamaStack or route through F5 AI Guardrails for prompt injection, PII, toxicity, and topic enforcement
Uninstall Copy linkLink copied!
From deploy/helm, make uninstall removes the RAG Helm release, tears down the F5 AI Guardrails operator (SecurityOperator, subscription, CSVs matching f5-ai-security-operator), deletes the default operator namespaces (f5-ai-sec, cai-moderator, f5-ai-sec-inference, prefect), and deletes your RAG NAMESPACE project.
cd deploy/helm
make uninstall NAMESPACE=<NAMESPACE>
Override names if your install differs (see deploy/helm/Makefile defaults):
SECURITYOPERATOR_NAME, CAI_MODERATOR_NS, F5_OPERATOR_NS, OPERATOR_SUBSCRIPTION, GUARDRAILS_PROJECT_NS.
AI security capabilities Copy linkLink copied!
Once deployed, F5 AI Guardrails provides defense-in-depth across multiple AI threat categories. Each protection can be tested interactively through the included hands-on labs.
Out-of-the-box guardrail packages Copy linkLink copied!
| Package | What it catches | Scope |
|---|---|---|
| Prompt Injection | Instruction-override attacks, DAN prompts, system prompt extraction, obfuscation | Prompts |
| PII | SSNs, credit cards, emails, phone numbers, data exfiltration requests | Prompts & Responses |
| EU AI Act | Subliminal manipulation, biometric surveillance, emotion recognition in employment | Prompts & Responses |
| Restricted Topics | Unauthorized financial advice, medical diagnosis, legal guidance | Prompts |
Example: Prompt injection blocked in the chat app

Custom guardrails Copy linkLink copied!
| Type | How it works | Example use case |
|---|---|---|
| GenAI | AI-driven analysis of intent and context via natural-language description | Internal financial forecasts, competitor mentions |
| Keyword | Matches specific words or strings | Confidential project names, classified terminology |
| RegEx | Matches regular expression patterns | Employee IDs, internal account numbers |
Example: Same prompt allowed before custom guardrail, blocked after

Enforcement modes Copy linkLink copied!
Each guardrail operates in one of three modes:
| Mode | Behavior | When to use |
|---|---|---|
| Block | Reject the request — prompt never reaches the LLM | Production enforcement |
| Audit | Allow the request, flag it for review | Initial rollout and tuning |
| Redact | Mask sensitive data and continue the conversation | PII protection without interrupting workflow |
Example: Guardrail details in the Logs UI — full visibility into which guardrails fired

Hands-on labs Copy linkLink copied!
The AI Guardrails Use Case Guide provides step-by-step labs to configure, test, and observe these protections:
| Lab | What you will do |
|---|---|
| Lab 1 — Prompt and Response Scanning | Add OOTB guardrail packages, test safe and unsafe prompts, observe blocked events in the Logs dashboard |
| Lab 2 — Creating Custom Guardrails | Build GenAI, Keyword, and RegEx guardrails tailored to your organization, verify they block matching content |
The labs use the Streamlit chat app and the Moderator UI, with optional curl commands for scripted testing. The use case guide is updated as new guardrail capabilities are released.
References Copy linkLink copied!
- Make commands:
make help # Show all available commands make install # Deploy the RAG application make uninstall # Remove the RAG application make clean # Clean up all RAG resources including namespace make logs # Show logs for all pods make monitor # Monitor deployment status make status # Check deployment status make validate-config # Validate configuration values make deploy-ui-local # Local UI in background (use make undeploy-ui-local to stop) make undeploy-ui-local
Document management Copy linkLink copied!
Documents can be uploaded directly through the UI for RAG-based retrieval.
Supported formats:
- PDF documents — Underwriting guidelines, compliance policies, risk assessment reports
- Text files — Regulatory filings, internal procedure documents
Navigate to Settings → Vector Databases in the frontend to create vector databases and upload documents.