LLM Compressor


Red Hat AI Inference Server 3.0

Compressing large language models with the LLM Compressor library

Red Hat AI Documentation Team

Abstract

Describes the LLM Compressor library and how you can use it to optimize and compress large language models before inferencing.

Preface

LLM Compressor is an open source library that incorporates the latest research in model compression, allowing you to generate compressed models with minimal effort.

The LLM Compressor framework leverages the latest quantization, sparsity, and general compression techniques to improve generative AI model efficiency, scalability, and performance while maintaining accuracy. With native Hugging Face and vLLM support, you can seamlessly integrate optimized models with deployment pipelines for faster, cost-saving inference at scale, powered by the compressed-tensors model format.

Important

LLM Compressor is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.

Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.

LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor:

Quantization
Converts model weights and activations to lower-bit formats such as int8, reducing memory usage.
Sparsity
Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
Compression
Shrinks the saved model file size, ideally with minimal impact on performance.

Use these methods together to deploy models more efficiently on resource-limited hardware.

Chapter 2. Supported model compression workflows

LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.

The following advanced model types and deployment workflows are supported:

  • Multimodal models: Includes vision-language models
  • Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral
  • Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading

All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:

  • AWQ: Weight only INT4 quantization
  • GPTQ: Weight only INT4 quantization
  • FP8: Dynamic per-token quantization
  • SparseGPT: Post-training sparsity
  • SmoothQuant: Activation quantization

Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.

Quantized and sparse models that you create with LLM Compressor are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model’s quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Red Hat AI Inference Server or other inference providers.

Chapter 4. Integration with Red Hat OpenShift AI

You can use Red Hat OpenShift AI and LLM Compressor to experiment with model training, fine-tuning, and compression. The OpenShift AI integration of LLM Compressor provides two introductory examples:

  • A workbench image and notebook that demonstrates the compression of a tiny model, that you can run on CPU, highlighting how calibrated compression can improve over data-free approaches.
  • A data science pipeline that extends the same workflow to a larger Llama 3.2 model, highlighting how users can build automated, GPU-accelerated experiments that can be shared with other stakeholders from a single URL.

Both are available in the Red Hat AI Examples repository.

Important

The OpenShift AI integration of LLM Compressor is a Developer Preview feature.

Legal Notice

Copyright © 2025 Red Hat, Inc.
The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.
Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.
Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.
Linux® is the registered trademark of Linus Torvalds in the United States and other countries.
Java® is a registered trademark of Oracle and/or its affiliates.
XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.
MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.
Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.
The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.
All other trademarks are the property of their respective owners.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat