Chapter 5. Compressing language models with Red Hat AI Model Optimization Toolkit
Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
Pull the Red Hat AI Model Optimization Toolkit container image:
podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3
$ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the LLM Compressor version installed in the container:
podman run --rm -it \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \ python -c "import llmcompressor; print(llmcompressor.__version__)"
$ podman run --rm -it \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \ python -c "import llmcompressor; print(llmcompressor.__version__)"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
0.8.1
0.8.1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a working directory and clone the upstream LLM Compressor repository:
mkdir model-opt && \ cd model-opt && \ git clone https://github.com/vllm-project/llm-compressor.git
$ mkdir model-opt && \ cd model-opt && \ git clone https://github.com/vllm-project/llm-compressor.gitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check out the LLM Compressor tag that matches the version that is installed in the container:
cd llm-compressor && \ git checkout 0.8.1
$ cd llm-compressor && \ git checkout 0.8.1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKENHugging Face token to theprivate.envfile and source it:echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env source private.env
$ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env $ source private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.
Example output
2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 423it [00:13, 30.49it/s]
2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]