このコンテンツは選択した言語では利用できません。

Chapter 4. Customizing model deployments


You can customize a model’s deployment to suit your specific needs, for example, to deploy a particular family of models or to enhance an existing deployment. You can modify the runtime configuration for a specific deployment by setting additional serving runtime arguments and environment variables.

These customizations apply only to the selected model deployment and do not change the default runtime configuration. You can set these parameters when you first deploy a model or by editing an existing deployment.

4.1. Customizing the parameters of a deployed model-serving runtime

You might need additional parameters beyond the default ones to deploy specific models or to enhance an existing model deployment. In such cases, you can modify the parameters of an existing runtime to suit your deployment needs.

Note

Customizing the parameters of a runtime only affects the selected model deployment.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have deployed a model.

Procedure

  1. From the OpenShift AI dashboard, click AI hub Deployments.

    The Deployments page opens.

  2. Click Stop next to the name of the model you want to customize.
  3. Click the action menu (⋮) and select Edit.

    The Configuration parameters section shows predefined serving runtime parameters, if any are available.

  4. Customize the runtime parameters in the Configuration parameters section:

    1. Modify the values in Additional serving runtime arguments to define how the deployed model behaves.
    2. Modify the values in Additional environment variables to define variables in the model’s environment.

      Note

      Do not modify the port or model serving runtime arguments, because they require specific values to be set. Overwriting these parameters can cause the deployment to fail.

      Note

      Set VLLM_CPU_KVCACHE_SPACE to define the KV cache size for vLLM. For example, VLLM_CPU_KVCACHE_SPACE=40 allocates 40 GiB of memory to the KV cache. Increase this value to enable vLLM to handle more parallel requests. Choose a value that matches your hardware capacity and memory management requirements. The default is 0. When set to 0, vLLM does not reserve dedicated KV cache memory and instead allocates from available system memory at runtime, which can result in out-of-memory errors.

  5. After you are done customizing the runtime parameters, click Redeploy to save.
  6. Click Start to deploy the model with your changes.

Verification

  • Confirm that the deployed model is shown on the Deployments tab for the project, and on the Deployments page of the dashboard with a checkmark in the Status column.
  • Confirm that the arguments and variables that you set appear in spec.predictor.model.args and spec.predictor.model.env by one of the following methods:

    • Checking the InferenceService YAML from the OpenShift Console.
    • Using the following command in the OpenShift CLI:

      oc get -o json inferenceservice <inferenceservicename/modelname> -n <projectname>

4.2. Customizable model serving runtime parameters

You can modify the parameters of an existing model serving runtime to suit your deployment needs.

For more information about parameters for each of the supported serving runtimes, see the following table:

Expand
Serving runtimeResource

NVIDIA Triton Inference Server

NVIDIA Triton Inference Server: Model Parameters

OpenVINO Model Server

OpenVINO Model Server Features: Dynamic Input Parameters

vLLM NVIDIA GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM AMD GPU ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM Intel Gaudi Accelerator ServingRuntime for KServe

vLLM: Engine Arguments
OpenAI-Compatible Server

vLLM Spyre ppc64le ServingRuntime for KServe

Recommended model inference settings for IBM Power with IBM Spyre AI accelerators

4.3. Customizing the vLLM model-serving runtime

In certain cases, you may need to add additional flags or environment variables to the vLLM ServingRuntime for KServe runtime to deploy a family of LLMs.

The following procedure describes customizing the vLLM model-serving runtime to deploy a Llama, Granite or Mistral model.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • For Llama model deployment, you have downloaded a meta-llama-3 model to your object storage.
  • For Granite model deployment, you have downloaded a granite-7b-instruct or granite-20B-code-instruct model to your object storage.
  • For Mistral model deployment, you have downloaded a mistral-7B-Instruct-v0.3 model to your object storage.
  • You have enabled the vLLM ServingRuntime for KServe runtime.
  • You have enabled GPU support in OpenShift AI and have installed and configured the Node Feature Discovery Operator on your cluster. For more information, see Installing the Node Feature Discovery Operator and Enabling NVIDIA GPUs

Procedure

  1. Follow the steps to deploy a model as described in Deploying models on the model serving platform.
  2. In the Serving runtime field, select vLLM ServingRuntime for KServe.
  3. If you are deploying a meta-llama-3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    –-distributed-executor-backend=mp 
    1
    
    --max-model-len=6144 
    2
    1
    Sets the backend to multiprocessing for distributed model workers
    2
    Sets the maximum context length of the model to 6144 tokens
  4. If you are deploying a granite-7B-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --distributed-executor-backend=mp 
    1
    1
    Sets the backend to multiprocessing for distributed model workers
  5. If you are deploying a granite-20B-code-instruct model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --distributed-executor-backend=mp 
    1
    
    –-tensor-parallel-size=4 
    2
    
    --max-model-len=6448 
    3
    1
    Sets the backend to multiprocessing for distributed model workers
    2
    Distributes inference across 4 GPUs in a single node
    3
    Sets the maximum context length of the model to 6448 tokens
  6. If you are deploying a mistral-7B-Instruct-v0.3 model, add the following arguments under Additional serving runtime arguments in the Configuration parameters section:

    --distributed-executor-backend=mp 
    1
    
    --max-model-len=15344 
    2
    1
    Sets the backend to multiprocessing for distributed model workers
    2
    Sets the maximum context length of the model to 15344 tokens
  7. Click Deploy.

Verification

  • Confirm that the deployed model is shown on the Deployments tab for the project, and on the Deployments page of the dashboard with a checkmark in the Status column.
  • For granite models, use the following example command to verify API requests to your deployed model:

    curl -q -X 'POST' \
        "https://<inference_endpoint_url>:443/v1/chat/completions" \
        -H 'accept: application/json' \
        -H 'Content-Type: application/json' \
        -d "{
        \"model\": \"<model_name>\",
        \"prompt\": \"<prompt>",
        \"max_tokens\": <max_tokens>,
        \"temperature\": <temperature>
        }"

4.4. Setting a default cluster-wide deployment strategy

You can set a default deployment strategy for new model deployments across the cluster.

Prerequisites

  • You have logged in to OpenShift AI as a user with OpenShift AI administrator privileges.
  • You have enabled model serving on your cluster.

Procedure

  1. In the dashboard, navigate to Settings Cluster settings.
  2. Click on the General settings tab.
  3. Scroll down to the Model deployment options section.
  4. In the Default deployment strategy, select the desired cluster default:

    • Rolling update
    • Recreate
  5. Click Save changes at the bottom of the page.

Verification

  • Follow the instructions to deploy a new model as described in Deploying models on the model serving platform.
  • In the Advanced settings page of the deployment wizard, locate the Deployment strategy section.
  • The preselected deployment strategy should match the new default you configured.
Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

Red Hat ドキュメントについて

Legal Notice

Theme

© 2026 Red Hat
トップに戻る