Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit
Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
Pull the Red Hat AI Model Optimization Toolkit container image:
$ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0Verify the LLM Compressor version installed in the container:
$ podman run --rm -it \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \ python -c "import llmcompressor; print(llmcompressor.__version__)"Example output
v0.9.0.1Create a working directory and clone the upstream LLM Compressor repository:
$ mkdir model-opt && \ cd model-opt && \ git clone https://github.com/vllm-project/llm-compressor.gitCheck out the LLM Compressor tag that matches the version that is installed in the container:
$ cd llm-compressor && \ git checkout v0.9.0.1Create or append your
HF_TOKENHugging Face token to theprivate.envfile and source it:$ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env $ source private.envIf your system has SELinux enabled, configure SELinux to allow device access:
$ sudo setsebool -P container_use_devices 1Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:
$ podman run --rm \ -v "$(pwd):/opt/app-root/model-opt" \ --device nvidia.com/gpu=all --ipc=host \ -e HF_TOKEN=<YOUR_HF_TOKEN> \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \ python /opt/app-root/model-opt/llm-compressor/examples/quantization_w8a8_int8/llama3_example.py
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.
Example output
2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]