Chapter 4. Integration with Red Hat AI Inference Server and vLLM
Quantized and sparse models that you create with Red Hat AI Model Optimization Toolkit are saved using the compressed-tensors library (an extension of Safetensors). The compression format matches the model’s quantization or sparsity type. These formats are natively supported in vLLM, enabling fast inference through optimized deployment kernels by using Red Hat AI Inference Server or other inference providers.