이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Inference serving modelcar images with AI Inference Server in OpenShift Container Platform


Deploy a language model in a modelcar container with OpenShift Container Platform by configuring secrets, persistent storage, and a deployment custom resource (CR) that uses Red Hat AI Inference Server to inference serve the modelcar container image.

Prerequisites

  • You have installed the OpenShift CLI (oc).
  • You have logged in as a user with cluster-admin privileges.
  • You have installed NFD and the required GPU Operator for your underlying AI accelerator hardware.
  • You have created a modelcar container image for the language model and pushed it to a container image registry.

Procedure

  1. Create the Docker secret so that the cluster can download the Red Hat AI Inference Server image from the container registry. For example, to create a Secret CR that contains the contents of your local ~/.docker/config.json file, run the following command:

    oc create secret generic docker-secret --from-file=.dockercfg=$HOME/.docker/config.json --type=kubernetes.io/dockercfg -n rhaiis-namespace
    Copy to Clipboard Toggle word wrap
  2. Create a PersistentVolumeClaim (PVC) custom resource (CR) and apply it in the cluster. The following example PVC CR uses a default IBM VPC Block persistence volume.

    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: model-cache
      namespace: rhaiis-namespace
    spec:
      accessModes:
        - ReadWriteOnce
      resources:
        requests:
          storage: 20Gi
      storageClassName: ibmc-vpc-block-10iops-tier
    Copy to Clipboard Toggle word wrap
    Note

    Configuring cluster storage to meet your requirements is outside the scope of this procedure. For more detailed information, see Configuring persistent storage.

  3. Create a Deployment custom resource (CR) that pulls the modelcar image and deploys the Red Hat AI Inference Server container. Reference the following example Deployment CR, which uses AI Inference Server to serve a modelcar image.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 0
      selector:
        matchLabels:
          app: rhaiis-oci-deploy
      template:
        metadata:
          labels:
            app: rhaiis-oci-deploy
        spec:
          imagePullSecrets:
            - name: docker-secret
          volumes:
            - name: model-volume
              persistentVolumeClaim:
                claimName: model-cache 
    1
    
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: "2Gi"
            - name: oci-auth
              secret:
                secretName: docker-secret
                items:
                  - key: .dockercfg
                    path: config.json
          initContainers: 
    2
    
            - name: fetch-model
              image: ghcr.io/oras-project/oras:v1.2.0
              command: ["/bin/sh","-c"]
              args:
                - |
                  set -e
                  # Only pull if /model is empty
                  if [ -z "$(ls -A /model)" ]; then
                    echo "Pulling model…"
                    # Update with the modelcar container image registry URL
                    oras pull <YOUR_MODELCAR_REGISTRY_URL> \
                      --output /model \
                  else
                    echo "Model already present, skipping pull"
                  fi
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: oci-auth
                  mountPath: /auth
                  readOnly: true
          containers:
            - name: granite
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:a6645a8e8d7928dce59542c362caf11eca94bb1b427390e78f0f8a87912041cd'
              imagePullPolicy: IfNotPresent
              env:
                - name: VLLM_SERVER_DEV_MODE
                  value: '1'
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/model'
                - '--served-model-name=ibm-granite/granite-3.1-2b-instruct' 
    3
    
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                  memory: 16Gi
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: model-volume
                  mountPath: /model
                - name: shm
                  mountPath: /dev/shm 
    4
    
          restartPolicy: Always
    Copy to Clipboard Toggle word wrap
    1
    spec.template.spec.volumes.persistentVolumeClaim.claimName must match the name of the PVC that you created.
    2
    This example deployment uses a simple initContainers configuration that runs before the main app container to download the required modelcar image. The model pull step is skipped if the model directory has already been populated, for example, from a previous deployment.
    3
    Update the value for --served-model-name to match the model that you are deploying.
    4
    The /dev/shm volume mount is required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail when the /dev/shm volume mount is not set.
  4. Increase the deployment replica count to the required number. For example, run the following command:

    oc scale deployment granite -n rhaiis-namespace --replicas=1
    Copy to Clipboard Toggle word wrap
  5. Optional: Watch the deployment and ensure that it succeeds:

    $ oc get deployment -n rhaiis-namespace --watch
    Copy to Clipboard Toggle word wrap

    Example output

    NAME                READY   UP-TO-DATE   AVAILABLE   AGE
    rhaiis-oci-deploy   0/1     1            0           2s
    rhaiis-oci-deploy   1/1     1            1           14s
    Copy to Clipboard Toggle word wrap

  6. Create a Service CR for the model inference. For example:

    apiVersion: v1
    kind: Service
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
    spec:
      selector:
        app: rhaiis-oci-deploy
      ports:
        - name: http
          port: 80
          targetPort: 8000
    Copy to Clipboard Toggle word wrap
  7. Optional: Create a Route CR to enable public access to the model. For example:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: rhaiis-oci-deploy
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: rhaiis-oci-deploy
      port:
        targetPort: http
    Copy to Clipboard Toggle word wrap
  8. Get the URL for the exposed route. Run the following command:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    rhaiis-oci-deploy-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

Verification

Ensure that the deployment is successful by querying the model. Run the following command:

curl -v -k   http://rhaiis-oci-deploy-rhaiis-namespace.apps.modelsibm.ibmmodel.rh-ods.com/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model":"ibm-granite/granite-3.1-2b-instruct",
    "messages":[{"role":"user","content":"Hello?"}],
    "temperature":0.1
  }'| jq
Copy to Clipboard Toggle word wrap

Example output

{
  "id": "chatcmpl-07b177360eaa40a3b311c24a8e3c7f43",
  "object": "chat.completion",
  "created": 1755189746,
  "model": "ibm-granite/granite-3.1-2b-instruct",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "reasoning_content": null,
        "content": "Hello! How can I assist you today?",
        "tool_calls": []
      },
      "logprobs": null,
      "finish_reason": "stop",
      "stop_reason": null
    }
  ],
  "usage": {
    "prompt_tokens": 61,
    "total_tokens": 71,
    "completion_tokens": 10,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}
Copy to Clipboard Toggle word wrap

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat