LLM Compressor

Red Hat AI Inference Server 3.0

Compressing large language models with the LLM Compressor library

Red Hat AI Documentation Team

Abstract

Describes the LLM Compressor library and how you can use it to optimize and compress large language models before inferencing.

Preface
Copy link

LLM Compressor is an open source library that incorporates the latest research in model compression, allowing you to generate compressed models with minimal effort.

The LLM Compressor framework leverages the latest quantization, sparsity, and general compression techniques to improve generative AI model efficiency, scalability, and performance while maintaining accuracy. With native Hugging Face and vLLM support, you can seamlessly integrate optimized models with deployment pipelines for faster, cost-saving inference at scale, powered by the compressed-tensors model format.

Important

LLM Compressor is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Chapter 1. About large language model optimization
Copy link

As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.

Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.

LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor:

Quantization: Converts model weights and activations to lower-bit formats such as int8, reducing memory usage.
Sparsity: Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
Compression: Shrinks the saved model file size, ideally with minimal impact on performance.

Use these methods together to deploy models more efficiently on resource-limited hardware.

Chapter 2. Supported model compression workflows
Copy link

LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.

The following advanced model types and deployment workflows are supported:

Multimodal models: Includes vision-language models
Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral
Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading

All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:

AWQ: Weight only INT4 quantization
GPTQ: Weight only INT4 quantization
FP8: Dynamic per-token quantization
SparseGPT: Post-training sparsity
SmoothQuant: Activation quantization

Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.

Chapter 3. Integration with Red Hat AI Inference Server and vLLM
Copy link

Quantized and sparse models that you create with LLM Compressor are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model’s quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Red Hat AI Inference Server or other inference providers.

Chapter 4. Integration with Red Hat OpenShift AI
Copy link

You can use Red Hat OpenShift AI and LLM Compressor to experiment with model training, fine-tuning, and compression. The OpenShift AI integration of LLM Compressor provides two introductory examples:

A workbench image and notebook that demonstrates the compression of a tiny model, that you can run on CPU, highlighting how calibrated compression can improve over data-free approaches.
A data science pipeline that extends the same workflow to a larger Llama 3.2 model, highlighting how users can build automated, GPU-accelerated experiments that can be shared with other stakeholders from a single URL.

Both are available in the Red Hat AI Examples repository.

Important

The OpenShift AI integration of LLM Compressor is a Developer Preview feature.

Legal Notice
Copy link

The text of and illustrations in this document are licensed by Red Hat under a Creative Commons Attribution–Share Alike 3.0 Unported license ("CC-BY-SA"). An explanation of CC-BY-SA is available at http://creativecommons.org/licenses/by-sa/3.0/. In accordance with CC-BY-SA, if you distribute this document or an adaptation of it, you must provide the URL for the original version.

Red Hat, as the licensor of this document, waives the right to enforce, and agrees not to assert, Section 4d of CC-BY-SA to the fullest extent permitted by applicable law.

Red Hat, Red Hat Enterprise Linux, the Shadowman logo, the Red Hat logo, JBoss, OpenShift, Fedora, the Infinity logo, and RHCE are trademarks of Red Hat, Inc., registered in the United States and other countries.

Linux® is the registered trademark of Linus Torvalds in the United States and other countries.

Java® is a registered trademark of Oracle and/or its affiliates.

XFS® is a trademark of Silicon Graphics International Corp. or its subsidiaries in the United States and/or other countries.

MySQL® is a registered trademark of MySQL AB in the United States, the European Union and other countries.

Node.js® is an official trademark of Joyent. Red Hat is not formally related to or endorsed by the official Joyent Node.js open source or commercial project.

The OpenStack® Word Mark and OpenStack logo are either registered trademarks/service marks or trademarks/service marks of the OpenStack Foundation, in the United States and other countries and are used with the OpenStack Foundation's permission. We are not affiliated with, endorsed or sponsored by the OpenStack Foundation, or the OpenStack community.

All other trademarks are the property of their respective owners.

LLM Compressor

Compressing large language models with the LLM Compressor library

Preface
Copy link

Chapter 1. About large language model optimization
Copy link

Chapter 2. Supported model compression workflows
Copy link

Chapter 3. Integration with Red Hat AI Inference Server and vLLM
Copy link

Chapter 4. Integration with Red Hat OpenShift AI
Copy link

Legal Notice
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

LLM Compressor

Compressing large language models with the LLM Compressor library

PrefaceCopy linkLink copied to clipboard!

Chapter 1. About large language model optimizationCopy linkLink copied to clipboard!

Chapter 2. Supported model compression workflowsCopy linkLink copied to clipboard!

Chapter 3. Integration with Red Hat AI Inference Server and vLLMCopy linkLink copied to clipboard!

Chapter 4. Integration with Red Hat OpenShift AICopy linkLink copied to clipboard!

Legal NoticeCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Preface
Copy link

Chapter 1. About large language model optimization
Copy link

Chapter 2. Supported model compression workflows
Copy link

Chapter 3. Integration with Red Hat AI Inference Server and vLLM
Copy link

Chapter 4. Integration with Red Hat OpenShift AI
Copy link

Legal Notice
Copy link