Questo contenuto non è disponibile nella lingua selezionata.

Chapter 1. About AI Inference


AI Inference provides enterprise-grade stability and security, building on the open source vLLM project, which provides state-of-the-art inferencing features.

AI Inference uses continuous batching to process requests as they arrive instead of waiting for a full batch to be accumulated. It also uses tensor parallelism to distribute LLM workloads across multiple GPUs. These features provide reduced latency and higher throughput.

To reduce the cost of inferencing models, AI Inference uses paged attention. LLMs use a mechanism called attention to understand conversations with users. Normally, attention uses a significant amount of memory, much of which is wasted. Paged attention addresses this memory waste by provisioning memory for LLMs similar to the way that virtual memory works for operating systems. This approach consumes less memory and lowers costs.

Important

Red Hat AI Inference is available as a container image from the Red Hat container registry. You can browse available images in the Red Hat Ecosystem Catalog.

To find Red Hat AI Inference container images in the Red Hat Ecosystem Catalog, search for "AI Inference".

To verify cost savings and performance gains with AI Inference, complete the following procedures:

  1. Serving and inferencing with AI Inference
  2. Validating Red Hat AI Inference benefits using key metrics
Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni sulla documentazione di Red Hat

Legal Notice

Theme

© 2026 Red Hat
Torna in cima