此内容没有您所选择的语言版本。

Chapter 6. Compressing language models with Red Hat AI Model Optimization Toolkit


Quantize and compress large language models with llm-compressor compression recipes by using Red Hat AI Model Optimization Toolkit.

Prerequisites

  • You are logged in as a user with sudo access.
  • You have access to the registry.redhat.io image registry and have logged in.
  • You have a Hugging Face account and have generated a Hugging Face access token.
Note

This example compression procedure uses the meta-llama/Meta-Llama-3-8B-Instruct model with the llama3_example.py compression recipe. To use this model, you must to request access from the meta-llama/Meta-Llama-3-8B-Instruct Hugging Face page.

Procedure

  1. Open a shell prompt on the RHEL AI server.
  2. Stop the Red Hat AI Inference Server service:

    [cloud-user@localhost ~]$ systemctl stop rhaiis
    Copy to Clipboard Toggle word wrap
  3. Create a working directory:

    [cloud-user@localhost ~]$ mkdir -p model-opt
    Copy to Clipboard Toggle word wrap
  4. Change permissions on the project folder and enter the folder:

    [cloud-user@localhost ~]$ chmod 775 model-opt && cd model-opt
    Copy to Clipboard Toggle word wrap
  5. Add the compression recipe Python script. For example, create the following example.py file that compresses the TinyLlama/TinyLlama-1.1B-Chat-v1.0 model in quantized FP8 format:

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    from llmcompressor import oneshot
    from llmcompressor.modifiers.quantization import QuantizationModifier
    from llmcompressor.utils import dispatch_for_generation
    
    import os
    
    MODEL_ID = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
    
    model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    
    # Configure the quantization algorithm and scheme
    recipe = QuantizationModifier(
        targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
    )
    
    # Create log directory in a writable location
    LOG_DIR = "./sparse_logs"
    os.makedirs(LOG_DIR, exist_ok=True)
    
    # Apply quantization
    oneshot(model=model, recipe=recipe)
    
    # Confirm quantized model looks OK
    print("========== SAMPLE GENERATION ==============")
    dispatch_for_generation(model)
    input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
        model.device
    )
    output = model.generate(input_ids, max_new_tokens=20)
    print(tokenizer.decode(output[0]))
    print("==========================================")
    
    # Save to disk in compressed-tensors format
    SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
    model.save_pretrained(SAVE_DIR)
    tokenizer.save_pretrained(SAVE_DIR)
    Copy to Clipboard Toggle word wrap
  6. Export your Hugging Face token.

    $ export HF=<YOUR_HUGGING_FACE_TOKEN>
    Copy to Clipboard Toggle word wrap
  7. Run the compression recipe using the Red Hat AI Model Optimization Toolkit container:

    [cloud-user@localhost ~]$ sudo podman run -it \
      -v ./model-opt:/opt/app-root/model-opt:z \
      -v /var/lib/rhaiis/models:/opt/app-root/models:z \
      --device nvidia.com/gpu=all \
      --workdir /opt/app-root/model-opt \
      -e HF_HOME=/opt/app-root/models \
      -e HF_TOKEN=$HF \
      --entrypoint python \
      registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3 \
      python example.py
    Copy to Clipboard Toggle word wrap

Verification

  • Monitor the compression run for successful completion and any error messages. The quantization process outputs progress information and saves the compressed model to the ./model-opt folder.

    Example output

    2025-11-12T21:09:20.276558+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
    Compressing model: 154it [00:02, 59.18it/s]
    Copy to Clipboard Toggle word wrap

返回顶部
Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。 了解我们当前的更新.

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

Theme

© 2025 Red Hat