Chapter 2. Large language model optimization
As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.
Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.
Red Hat AI Model Optimization Toolkit allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by Red Hat AI Model Optimization Toolkit:
- Quantization
-
Converts model weights and activations to lower-bit formats such as
int8, reducing memory usage. - Sparsity
- Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
- Compression
- Shrinks the saved model file size, ideally with minimal impact on performance.
Use these methods together to deploy models more efficiently on resource-limited hardware.