Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 1. About large language model optimization
As AI applications mature and new compression algorithms are published, there is a need for unified tools which can apply various compression algorithms that are specific to a users inference needs, optimized to run on accelerated hardware.
Optimizing large language models (LLMs) involves balancing three key factors: model size, inference speed, and accuracy. Improving any one of these factors can have a negative effect on the other factors. For example, increasing model accuracy usually requires more parameters, which results in a larger model and potentially slower inference. The tradeoff between these factors is a core challenge when serving LLMs.
LLM Compressor allows you to perform model optimization techniques such as quantization, sparsity, and compression to reduce memory use, model size, and improve inference without affecting the accuracy of model responses. The following compression methodologies are supported by LLM Compressor:
- Quantization
-
Converts model weights and activations to lower-bit formats such as
int8, reducing memory usage. - Sparsity
- Sets a portion of model weights to zero, often in fixed patterns, allowing for more efficient computation.
- Compression
- Shrinks the saved model file size, ideally with minimal impact on performance.
Use these methods together to deploy models more efficiently on resource-limited hardware.