Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Supported model compression workflows

LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.

The following advanced model types and deployment workflows are supported:

Multimodal models: Includes vision-language models
Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral
Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading

All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:

AWQ: Weight only INT4 quantization
GPTQ: Weight only INT4 quantization
FP8: Dynamic per-token quantization
SparseGPT: Post-training sparsity
SmoothQuant: Activation quantization

Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.

Ce contenu n'est pas disponible dans la langue sélectionnée.

Chapter 2. Supported model compression workflows

Apprendre

Essayez, achetez et vendez

Communautés

À propos de la documentation Red Hat

Rendre l’open source plus inclusif

À propos de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links