Chapter 1. Enterprise-grade inference serving
The Distributed Inference with llm-d framework provides enterprise-grade large language model (LLM) inference serving on Openshift Container Platform and managed Kubernetes clusters on public clouds such as Azure Kubernetes Service (AKS) and CoreWeave Kubernetes Service (CKS).
Enterprise platform engineering and infrastructure teams can use Distributed Inference with llm-d to build generative AI model services for internal and external use cases. Cloud service providers can also use it to build Models-as-a-Service (MaaS) offerings. Common use cases include:
- Enterprise-wide Models-as-a-Service (MaaS) for generative AI
- A central platform team provides generative AI and LLM capabilities as a managed service to business units across the organization. Rather than each team provisioning its own inference infrastructure, the platform team uses Distributed Inference with llm-d to offer standardized model serving with consistent performance, cost control, and security.
- Production-ready inference at scale
- An organization deploys a generative AI application in a limited production environment, such as A/B testing with a small user group or a soft launch. The deployment must be production-ready with reliable performance and security, while remaining flexible enough to scale to provider-grade inference as demand grows.