Chapter 7. Compressing language models with Red Hat AI Model Optimization Toolkit
Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.
Prerequisites
- You have deployed a Red Hat Enterprise Linux AI instance with NVIDIA CUDA AI accelerators installed.
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
[cloud-user@localhost ~]$ systemctl stop rhaiisCreate a working directory:
[cloud-user@localhost ~]$ mkdir -p model-optChange permissions on the project folder and enter the folder:
[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-optAdd the compression recipe Python script. For example, create the following
example.pyfile that compresses theTinyLlama/TinyLlama-1.1B-Chat-v1.0model in quantizedFP8format:from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.utils import dispatch_for_generation import os MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) # Create log directory in a writable location LOG_DIR = "./sparse_logs" os.makedirs(LOG_DIR, exist_ok=True) # Apply quantization oneshot(model=model, recipe=recipe) # Confirm quantized model looks OK print("========== SAMPLE GENERATION ==============") dispatch_for_generation(model) input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to( model.device ) output = model.generate(input_ids, max_new_tokens=20) print(tokenizer.decode(output[0])) print("==========================================") # Save to disk in compressed-tensors format SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) tokenizer.save_pretrained(SAVE_DIR)Export your Hugging Face token.
$ export HF=<YOUR_HUGGING_FACE_TOKEN>Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:
[cloud-user@localhost ~]$ sudo podman run -it \ -v ./model-opt:/opt/app-root/model-opt:z \ -v /var/lib/rhaiis/models:/opt/app-root/models:z \ --device nvidia.com/gpu=all \ --workdir /opt/app-root/model-opt \ -e HF_HOME=/opt/app-root/models \ -e HF_TOKEN=$HF \ --entrypoint python \ registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.4.0-ea.1 \ python example.py
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the
./model-optfolder.Example output
2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 154it [00:02, 59.18it/s]