Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.
Chapter 1. About model serving
When you serve a model, you upload a trained model into Red Hat OpenShift AI for querying, which allows you to integrate your trained models into intelligent applications.
You can upload a model to an S3-compatible object storage, persistent volume claim, or Open Container Initiative (OCI) image. You can then access and train the model from your project workbench. After training the model, you can serve or deploy the model using a model-serving platform.
Serving or deploying the model makes the model available as a service, or model runtime server, that you can access using an API. You can then access the inference endpoints for the deployed model from the dashboard and see predictions based on data inputs that you provide through API calls. Querying the model through the API is also called model inferencing.
You can serve models on one of the following model-serving platforms:
- Single-model serving platform
- Multi-model serving platform
- NVIDIA NIM model serving platform
The model-serving platform that you choose depends on your business needs:
- If you want to deploy each model on its own runtime server, or want to use a serverless deployment, select the single-model serving platform. The single-model serving platform is recommended for production use.
- If you want to deploy multiple models with only one runtime server, select the multi-model serving platform. This option is best if you are deploying more than 1,000 small and medium models and want to reduce resource consumption.
- If you want to use NVIDIA Inference Microservices (NIMs) to deploy a model, select the NVIDIA NIM-model serving platform.
1.1. Single-model serving platform
You can deploy each model from a dedicated model serving on the single-model serving platform. Deploying models from a dedicated model server can help you deploy, monitor, scale, and maintain models that require increased resources. This model serving platform is deal for serving large models. The single-model serving platform is based on the KServe component.
The single-model serving platform is helpful for use cases such as:
- Large language models (LLMs)
- Generative AI
For more information about setting up the single-model serving platform, see Installing the single-model serving platform.
1.2. Multi-model serving platform
You can deploy multiple models from the same model server on the multi-model serving platform. Each of the deployed models shares the server resources. Deploying multiple models from the same model server can be advantageous on OpenShift clusters that have finite compute resources or pods. This model serving platform is ideal for serving small and medium models in large quantities. The multi-model serving platform is based on the ModelMesh component.
For more information about setting up the multi-model serving platform, see Installing the multi-model serving platform.
1.3. NVIDIA NIM model serving platform
You can deploy models using NVIDIA Inference Microservices (NIM) on the NVIDIA NIM model serving platform.
NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of microservices designed for secure, reliable deployment of high performance AI model inferencing across clouds, data centers and workstations.
NVIDIA NIM inference services are helpful for use cases such as:
- Using GPU-accelerated containers inferencing models optimized by NVIDIA
- Deploying generative AI for virtual screening, content generation, and avatar creation
The NVIDIA NIM model serving platform is based on the single-model serving platform. To use the NVIDIA NIM model serving platform, you must first install the single-model serving platform. For more information, see Installing the single-model serving platform.