Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit

Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

Open a shell prompt on the RHEL AI server.

Stop the Red Hat AI Inference Server service:

[cloud-user@localhost ~]$ systemctl stop rhaiis

Create a working directory:

[cloud-user@localhost ~]$ mkdir -p model-opt

Change permissions on the project folder and enter the folder:

[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt

Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

import os

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Create log directory in a writable location
LOG_DIR = "./sparse_logs"
os.makedirs(LOG_DIR, exist_ok=True)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Confirm quantized model looks OK
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Export your Hugging Face token.
```
$ export HF=<YOUR_HUGGING_FACE_TOKEN>
```

Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

[cloud-user@localhost ~]$ sudo podman run -it \
  -v ./model-opt:/opt/app-root/model-opt:z \
  -v /var/lib/rhaiis/models:/opt/app-root/models:z \
  --device nvidia.com/gpu=all \
  --workdir /opt/app-root/model-opt \
  -e HF_HOME=/opt/app-root/models \
  -e HF_TOKEN=$HF \
  --entrypoint python \
  registry.redhat.io/rhaii-early-access/model-opt-cuda-rhel9:3.4.0-ea.2 \
  python example.py

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

Example output

2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 154it [00:02, 59.18it/s]

Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links