Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit

Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.

Prerequisites

You have installed Podman or Docker.
You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

Pull the Red Hat AI Model Optimization Toolkit container image:

$ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0

Verify the LLM Compressor version installed in the container:

$ podman run --rm -it \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \
  python -c "import llmcompressor; print(llmcompressor.__version__)"

Example output

v0.9.0.1

Create a working directory and clone the upstream LLM Compressor repository:

$ mkdir model-opt && \
cd model-opt && \
git clone https://github.com/vllm-project/llm-compressor.git

Check out the LLM Compressor tag that matches the version that is installed in the container:
```
$ cd llm-compressor && \
git checkout v0.9.0.1
```
Create or append your HF_TOKEN Hugging Face token to the private.env file and source it:
```
$ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env
$ source private.env
```
If your system has SELinux enabled, configure SELinux to allow device access:
```
$ sudo setsebool -P container_use_devices 1
```

Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:

$ podman run --rm \
  -v "$(pwd):/opt/app-root/model-opt" \
  --device nvidia.com/gpu=all --ipc=host \
  -e HF_TOKEN=<YOUR_HF_TOKEN> \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \
  python /opt/app-root/model-opt/llm-compressor/examples/quantization_w8a8_int8/llama3_example.py

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.

Example output

2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]

Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links