Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit


Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Stop the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ systemctl stop rhaiis
    Copy to Clipboard Toggle word wrap
  3. Create a working directory:

    [cloud-user@localhost ~]$ mkdir -p model-opt
    Copy to Clipboard Toggle word wrap
  4. Change permissions on the project folder and enter the folder:

    [cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
    Copy to Clipboard Toggle word wrap
  5. Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.utils import dispatch_for_generation
    
    import os
    
    MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    
    # Configure the quantization algorithm and scheme
    recipe = QuantizationModifier(
        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
    )
    
    # Create log directory in a writable location
    LOG_DIR = "./sparse_logs"
    os.makedirs(LOG_DIR, exist_ok=True)
    
    # Apply quantization
    oneshot(model=model, recipe=recipe)
    
    # Confirm quantized model looks OK
    print("========== SAMPLE GENERATION ==============")
    dispatch_for_generation(model)
    input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
        model.device
    )
    output = model.generate(input_ids, max_new_tokens=20)
    print(tokenizer.decode(output[0]))
    print("==========================================")
    
    # Save to disk in compressed-tensors format
    SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
    model.save_pretrained(SAVE_DIR)
    tokenizer.save_pretrained(SAVE_DIR)
    Copy to Clipboard Toggle word wrap
  6. Export your Hugging Face token.

    $ export HF=<YOUR_HUGGING_FACE_TOKEN>
    Copy to Clipboard Toggle word wrap
  7. Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

    [cloud-user@localhost ~]$ sudo podman run -it \
      -v ./model-opt:/opt/app-root/model-opt:z \
      -v /var/lib/rhaiis/models:/opt/app-root/models:z \
      --device nvidia.com/gpu=all \
      --workdir /opt/app-root/model-opt \
      -e HF_HOME=/opt/app-root/models \
      -e HF_TOKEN=$HF \
      --entrypoint python \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
      python example.py
    Copy to Clipboard Toggle word wrap

Verification

  • Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

    Example output

    2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
    Compressing model: 154it [00:02, 59.18it/s]
    Copy to Clipboard Toggle word wrap

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat