此内容没有您所选择的语言版本。

Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit

Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

You are logged in as a user with sudo access.
You have access to the registry.redhat.io image registry and have logged in.
You have a Hugging Face account and have generated a Hugging Face access token.

Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

Open a shell prompt on the RHEL AI server.
Stop the Red Hat AI Inference Server service:
```
systemctl stop rhaiis
```
```
[cloud-user@localhost ~]$ systemctl stop rhaiis
```
Copy to Clipboard Toggle word wrap
Create a working directory:
```
mkdir -p model-opt
```
```
[cloud-user@localhost ~]$ mkdir -p model-opt
```
Copy to Clipboard Toggle word wrap
Change permissions on the project folder and enter the folder:
```
chmod 775 model-opt && cd model-opt
```
```
[cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
```
Copy to Clipboard Toggle word wrap

Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

import os

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Create log directory in a writable location
LOG_DIR = "./sparse_logs"
os.makedirs(LOG_DIR, exist_ok=True)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Confirm quantized model looks OK
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

import os

MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Configure the quantization algorithm and scheme
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

# Create log directory in a writable location
LOG_DIR = "./sparse_logs"
os.makedirs(LOG_DIR, exist_ok=True)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Confirm quantized model looks OK
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=20)
print(tokenizer.decode(output[0]))
print("==========================================")

# Save to disk in compressed-tensors format
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR)
tokenizer.save_pretrained(SAVE_DIR)

Copy to Clipboard

Toggle word wrap

Export your Hugging Face token.
```
export HF=<YOUR_HUGGING_FACE_TOKEN>
```
```
$ export HF=<YOUR_HUGGING_FACE_TOKEN>
```
Copy to Clipboard Toggle word wrap

Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

sudo podman run -it \
  -v ./model-opt:/opt/app-root/model-opt:z \
  -v /var/lib/rhaiis/models:/opt/app-root/models:z \
  --device nvidia.com/gpu=all \
  --workdir /opt/app-root/model-opt \
  -e HF_HOME=/opt/app-root/models \
  -e HF_TOKEN=$HF \
  --entrypoint python \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
  python example.py

[cloud-user@localhost ~]$ sudo podman run -it \
  -v ./model-opt:/opt/app-root/model-opt:z \
  -v /var/lib/rhaiis/models:/opt/app-root/models:z \
  --device nvidia.com/gpu=all \
  --workdir /opt/app-root/model-opt \
  -e HF_HOME=/opt/app-root/models \
  -e HF_TOKEN=$HF \
  --entrypoint python \
  registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
  python example.py

Copy to Clipboard

Toggle word wrap

Verification

Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

Example output

2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 154it [00:02, 59.18it/s]

2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 154it [00:02, 59.18it/s]

Copy to Clipboard

Toggle word wrap

此内容没有您所选择的语言版本。

Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links