Chapter 1. About AI Inference
Red Hat AI Inference provides enterprise-grade stability and security for serving large language models across hybrid cloud and edge environments. Built on the open source vLLM project, AI Inference delivers optimized standalone inference and Kubernetes-native distributed inference to meet the demands of production AI workloads.
The Red Hat AI Inference standalone container uses continuous batching to process requests as they arrive instead of waiting for a full batch to be accumulated. It also uses tensor parallelism to distribute LLM workloads across multiple GPUs. These features provide reduced latency and higher throughput.
To reduce the cost of inferencing models, AI Inference uses paged attention. LLMs use a mechanism called attention to understand conversations with users. Normally, attention uses a significant amount of memory, much of which is wasted. Paged attention addresses this memory waste by provisioning memory for LLMs similar to the way that virtual memory works for operating systems. This approach consumes less memory and lowers costs.
Distributed Inference with llm-d extends AI Inference with Kubernetes-native distributed inference for serving large language models at scale. Distributed Inference with llm-d provides intelligent inference scheduling with prefix-cache aware routing, KV cache management, and support for advanced deployment patterns such as prefill-decode disaggregation.
Red Hat AI Inference supports a wide range of AI accelerators, including NVIDIA CUDA GPUs, AMD Instinct GPUs, Intel Xeon and AMD EPYC CPUs, Google TPUs, AWS Inferentia, Intel Gaudi, and IBM Spyre accelerators.
Red Hat AI Inference is available as a container image from the Red Hat container registry. You can browse available images in the Red Hat Ecosystem Catalog.
To find Red Hat AI Inference container images in the Red Hat Ecosystem Catalog, search for "AI Inference".
Additional resources
- vLLM project
- Red Hat Ecosystem Catalog
- Serving and inferencing with AI Inference
- Validating Red Hat AI Inference benefits using key metrics