Chapter 5. Compressing language models with Red Hat AI Model Optimization Toolkit


Quantize and compress large language models with llm-compressor compression recipes and Red Hat AI Model Optimization Toolkit.

Prerequisites

  • You have installed Podman or Docker.
  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

  1. Pull the Red Hat AI Model Optimization Toolkit container image:

    $ podman pull registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3
    Copy to Clipboard Toggle word wrap
  2. Verify the LLM Compressor version installed in the container:

    $ podman run --rm -it \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
      python -c "import llmcompressor; print(llmcompressor.__version__)"
    Copy to Clipboard Toggle word wrap

    Example output

    0.8.1
    Copy to Clipboard Toggle word wrap

  3. Create a working directory and clone the upstream LLM Compressor repository:

    $ mkdir model-opt && \
    cd model-opt && \
    git clone https://github.com/vllm-project/llm-compressor.git
    Copy to Clipboard Toggle word wrap
  4. Check out the LLM Compressor tag that matches the version that is installed in the container:

    $ cd llm-compressor && \
    git checkout 0.8.1
    Copy to Clipboard Toggle word wrap
  5. Create or append your HF_TOKEN Hugging Face token to the private.env file and source it:

    $ echo "export HF_TOKEN=<YOUR_HF_TOKEN>" > private.env
    $ source private.env
    Copy to Clipboard Toggle word wrap
  6. If your system has SELinux enabled, configure SELinux to allow device access:

    $ sudo setsebool -P container_use_devices 1
    Copy to Clipboard Toggle word wrap
  7. Run the llama3_example.py compression example using the Red Hat AI Inference Server model optimization container:

    $ podman run --rm \
      -v "$(pwd):/opt/app-root/model-opt" \
      --device nvidia.com/gpu=all --ipc=host \
      -e HF_TOKEN=<YOUR_HF_TOKEN> \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
      python /opt/app-root/model-opt/llm-compressor/examples/quantization_w8a8_int8/llama3_example.py
    Copy to Clipboard Toggle word wrap

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the mounted volume.

Example output

2025-09-18T14:42:27.377028+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 423it [00:13, 30.49it/s]
Copy to Clipboard Toggle word wrap

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat