Chapter 3. Supported model compression workflows
LLM Compressor supports post-training quantization, a conversion technique that reduces model size and improves CPU and hardware accelerator performance latency, without degrading model accuracy. A streamlined API applies quantization or sparsity based on a data set that you provide.
The following advanced model types and deployment workflows are supported:
- Multimodal models: Includes vision-language models
- Mixture of experts (MoE) models: Supports models like DeepSeek and Mixtral, with support for calibration including NVFP4 quantization
- Large model support: Uses the Hugging Face accelerate library for multi-GPU and CPU offloading
- Multiple quantization schemes applied to a single model: Support for non-uniform quantization, such as combining NVFP4 and FP8 quantization
All workflows are Hugging Face–compatible, enabling models to be quantized, compressed, and deployed with vLLM for efficient inference. LLM Compressor supports several compression algorithms:
-
AWQ: Weight only
INT4quantization -
GPTQ: Weight only
INT4quantization - FP8: Dynamic per-token quantization and DeepSeekV3-style block quantization
- SparseGPT: Post-training sparsity
- SmoothQuant: Activation quantization
- QuIP transforms: Weight and activation quantization
- SpinQuant transforms: Weight and activation quantization
Each of these compression methods computes optimal scales and zero-points for weights and activations. Optimized scales can be per tensor, channel, group, or token. The final result is a compressed model saved with all its applied quantization parameters.