Ce contenu n'est pas disponible dans la langue sélectionnée.
Chapter 2. Supported model compression workflows
LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.
The following advanced model types and deployment workflows are supported:
- Multimodal models: Includes vision-language models
- Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral
- Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading
All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:
-
AWQ: Weight only
INT4quantization -
GPTQ: Weight only
INT4quantization - FP8: Dynamic per-token quantization
- SparseGPT: Post-training sparsity
- SmoothQuant: Activation quantization
Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.