Red Hat AI Model Optimization Toolkit


Red Hat AI Inference Server 3.3

Compressing large language models with the LLM Compressor library

Red Hat AI Documentation Team

Abstract

Describes the LLM Compressor library and how you can use it to optimize and compress large language models before inferencing.

Red Hat AI Model Optimization Toolkit is an open source library that incorporates the latest research in model compression, allowing you to generate compressed models with minimal effort. Red Hat AI Model Optimization Toolkit is based on the upstream LLM Compressor project.

The Red Hat AI Model Optimization Toolkit framework leverages the latest quantization, sparsity, and general compression techniques to improve generative AI model efficiency, scalability, and performance while maintaining accuracy. With native Hugging Face and vLLM support, you can seamlessly integrate optimized models with deployment pipelines for faster, cost-saving inference at scale, powered by the compressed-tensors model format.

Chapter 2. Large language model optimization

As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.

Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.

Red Hat AI Model Optimization Toolkit allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by Red Hat AI Model Optimization Toolkit:

Quantization
Converts model weights and activations to lower-bit formats such as int8, reducing memory usage.
Sparsity
Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
Compression
Shrinks the saved model file size, ideally with minimal impact on performance.

Use these methods together to deploy models more efficiently on resource-limited hardware.

Chapter 3. Supported model compression workflows

LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.

The following advanced model types and deployment workflows are supported:

  • Multimodal models: Includes vision-language models
  • Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral, with support for calibration including NVFP4 quantization
  • Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading
  • Multiple quantization schemes applied to a single model: Support for non-uniform quantization, such as combining NVFP4 and FP8 quantization

All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:

  • AWQ: Weight only INT4 quantization
  • GPTQ: Weight only INT4 quantization
  • FP8: Dynamic per-token quantization and DeepSeekV3-style block quantization
  • SparseGPT: Post-training sparsity
  • SmoothQuant: Activation quantization
  • QuIP transforms: Weight and activation quantization
  • SpinQuant transforms: Weight and activation quantization

Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.

Quantized and sparse models that you create with Red Hat AI Model Optimization Toolkit are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model’s quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Red Hat AI Inference Server or other inference providers.

Chapter 5. Integration with Red Hat OpenShift AI

You can use Red Hat OpenShift AI and Red Hat AI Model Optimization Toolkit to experiment with model training, fine-tuning, and compression. The OpenShift AI integration of Red Hat AI Model Optimization Toolkit provides two introductory examples:

  • A workbench image and notebook that demonstrates the compression of a tiny model, that you can run on CPU, highlighting how calibrated compression can improve over data-free approaches.
  • A data science pipeline that extends the same workflow to a larger Llama 3.2 model, highlighting how users can build automated, GPU-accelerated experiments that can be shared with other stakeholders from a single URL.

Both are available in the Red Hat AI Examples repository.

Important

The OpenShift AI integration of Red Hat AI Model Optimization Toolkit is a Developer Preview feature.

Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

  1. Pull the Red Hat AI Model Optimization Toolkit container image:

    $ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0
    Copy to Clipboard Toggle word wrap
  2. Verify the LLM Compressor version installed in the container:

    $ podman run --rm -it \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \
      python -c "import llmcompressor; print(llmcompressor.__version__)"
    Copy to Clipboard Toggle word wrap

    Example output

    v0.9.0.1
    Copy to Clipboard Toggle word wrap

  3. Create a working directory and clone the upstream LLM Compressor repository:

    $ mkdir model-opt && \
    cd model-opt && \
    git clone https://github.com/vllm-project/llm-compressor.git
    Copy to Clipboard Toggle word wrap
  4. Check out the LLM Compressor tag that matches the version that is installed in the container:

    $ cd llm-compressor && \
    git checkout v0.9.0.1
    Copy to Clipboard Toggle word wrap
  5. Create or append your HF_TOKEN Hugging Face token to the private.env file and source it:

    $ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env
    $ source private.env
    Copy to Clipboard Toggle word wrap
  6. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  7. Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:

    $ podman run --rm \
      -v "$(pwd):/opt/app-root/model-opt" \
      --device nvidia.com/gpu=all --ipc=host \
      -e HF_TOKEN=<YOUR_HF_TOKEN> \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 \
      python /opt/app-root/model-opt/llm-compressor/examples/quantization_w8a8_int8/llama3_example.py
    Copy to Clipboard Toggle word wrap

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.

Example output

2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]
Copy to Clipboard Toggle word wrap

Legal Notice

Copyright © Red Hat.
Except as otherwise noted below, the text of and illustrations in this documentation are licensed by Red Hat under the Creative Commons Attribution–Share Alike 3.0 Unported license . If you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, the Red Hat logo, JBoss, Hibernate, and RHCE are trademarks or registered trademarks of Red Hat, LLC. or its subsidiaries in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
XFS is a trademark or registered trademark of Hewlett Packard Enterprise Development LP or its subsidiaries in the United States and other countries.
The OpenStack® Word Mark and OpenStack logo are trademarks or registered trademarks of the Linux Foundation, used under license.
All other trademarks are the property of their respective owners.
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top