Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit
Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.
Prerequisites
- You are logged in as a user with sudo access.
-
You have access to the
registry.redhat.ioimage registry and have logged in. - You have a Hugging Face account and have generated a Hugging Face access token.
This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.
Procedure
- Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
systemctl stop rhaiis
[cloud-user@localhost ~]$ systemctl stop rhaiisCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a working directory:
mkdir -p model-opt
[cloud-user@localhost ~]$ mkdir -p model-optCopy to Clipboard Copied! Toggle word wrap Toggle overflow Change permissions on the project folder and enter the folder:
chmod 775 model-opt && cd model-opt
[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-optCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the compression recipe Python script. For example, create the following
example.pyfile that compresses theTinyLlama/TinyLlama-1.1B-Chat-v1.0model in quantizedFP8format:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Export your Hugging Face token.
export HF=<YOUR_HUGGING_FACE_TOKEN>
$ export HF=<YOUR_HUGGING_FACE_TOKEN>Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the
./model-optfolder.Example output
2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 154it [00:02, 59.18it/s]
2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 154it [00:02, 59.18it/s]Copy to Clipboard Copied! Toggle word wrap Toggle overflow