Red Hat AI Model Optimization Toolkit
Compressing large language models with the LLM Compressor library
Abstract
Chapter 1. About Red Hat AI Model Optimization Toolkit Copy linkLink copied to clipboard!
Red Hat AI Model Optimization Toolkit is an open source library that incorporates the latest research in model compression, allowing you to generate compressed models with minimal effort. Red Hat AI Model Optimization Toolkit is based on the upstream LLM Compressor project.
The Red Hat AI Model Optimization Toolkit framework leverages the latest quantization, sparsity, and general compression techniques to improve generative AI model efficiency, scalability, and performance while maintaining accuracy. With native Hugging Face and vLLM support, you can seamlessly integrate optimized models with deployment pipelines for faster, cost-saving inference at scale, powered by the compressed-tensors model format.
Chapter 2. Large language model optimization Copy linkLink copied to clipboard!
As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.
Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.
Red Hat AI Model Optimization Toolkit allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by Red Hat AI Model Optimization Toolkit:
- Quantization
-
Converts model weights and activations to lower-bit formats such as
int8, reducing memory usage. - Sparsity
- Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
- Compression
- Shrinks the saved model file size, ideally with minimal impact on performance.
Use these methods together to deploy models more efficiently on resource-limited hardware.
Chapter 3. Supported model compression workflows Copy linkLink copied to clipboard!
LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.
The following advanced model types and deployment workflows are supported:
- Multimodal models: Includes vision-language models
- Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral, with support for calibration including NVFP4 quantization
- Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading
- Multiple quantization schemes applied to a single model: Support for non-uniform quantization, such as combining NVFP4 and FP8 quantization
All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:
-
AWQ: Weight only
INT4quantization -
GPTQ: Weight only
INT4quantization - FP8: Dynamic per-token quantization and DeepSeekV3-style block quantization
- SparseGPT: Post-training sparsity
- SmoothQuant: Activation quantization
- QuIP transforms: Weight and activation quantization
- SpinQuant transforms: Weight and activation quantization
Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.
Chapter 4. Integration with Red Hat AI Inference Server and vLLM Copy linkLink copied to clipboard!
Quantized and sparse models that you create with Red Hat AI Model Optimization Toolkit are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model’s quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Red Hat AI Inference Server or other inference providers.
Chapter 5. Integration with Red Hat OpenShift AI Copy linkLink copied to clipboard!
You can use Red Hat OpenShift AI and Red Hat AI Model Optimization Toolkit to experiment with model training, fine-tuning, and compression. The OpenShift AI integration of Red Hat AI Model Optimization Toolkit provides two introductory examples:
- A workbench image and notebook that demonstrates the compression of a tiny model, that you can run on CPU, highlighting how calibrated compression can improve over data-free approaches.
- A data science pipeline that extends the same workflow to a larger Llama 3.2 model, highlighting how users can build automated, GPU-accelerated experiments that can be shared with other stakeholders from a single URL.
Both are available in the Red Hat AI Examples repository.
The OpenShift AI integration of Red Hat AI Model Optimization Toolkit is a Developer Preview feature.
Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit Copy linkLink copied to clipboard!
Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.
Prerequisites
- You have installed Podman or Docker.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
Pull the Red Hat AI Model Optimization Toolkit container image:
podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0
$ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the LLM Compressor version installed in the container:
podman run --rm -it \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \ python -c "import llmcompressor; print(llmcompressor.__version__)"
$ podman run --rm -it \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \ python -c "import llmcompressor; print(llmcompressor.__version__)"Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example output
v0.9.0.1
v0.9.0.1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create a working directory and clone the upstream LLM Compressor repository:
mkdir model-opt && \ cd model-opt && \ git clone https://github.com/vllm-project/llm-compressor.git
$ mkdir model-opt && \ cd model-opt && \ git clone https://github.com/vllm-project/llm-compressor.gitCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check out the LLM Compressor tag that matches the version that is installed in the container:
cd llm-compressor && \ git checkout v0.9.0.1
$ cd llm-compressor && \ git checkout v0.9.0.1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create or append your
HF_TOKENHugging Face token to theprivate.envfile and source it:echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env source private.env
$ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env $ source private.envCopy to Clipboard Copied! Toggle word wrap Toggle overflow If your system has SELinux enabled, configure SELinux to allow device access:
sudo setsebool -P container_use_devices 1
$ sudo setsebool -P container_use_devices 1Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.
Example output
2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 423it [00:13, 30.49it/s]
2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]