이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 5. Inference serving the model in the disconnected environment


Use Red Hat AI Inference Server deployed in a disconnected OpenShift Container Platform environment to inference serve the language model from cluster persistent storage.

Prerequisites

  • You have installed a mirror registry on the bastion host that is accessible to the disconnected cluster.
  • You have added the model and Red Hat AI Inference Server images to the mirror registry.
  • You have installed the Node Feature Discovery Operator and NVIDIA GPU Operator in the disconnected cluster.

Procedure

  1. In the disconnected cluster, configure persistent storage using Network File System (NFS) and make the model available in the persistent storage that you configure.

    Note

    For more information, see Persistent storage using NFS.

  2. Create a Deployment custom resource (CR). For example, the following Deployment CR uses AI Inference Server to serve a Granite model on a CUDA accelerator.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: granite
      namespace: rhaiis-namespace
      labels:
        app: granite
    spec:
      replicas: 0
      selector:
        matchLabels:
          app: granite
      template:
        metadata:
          labels:
            app: granite
        spec:
          containers:
            - name: granite
              image: 'registry.redhat.io/rhaiis/vllm-cuda-rhel9@sha256:137ac606b87679c90658985ef1fc9a26a97bb11f622b988fe5125f33e6f35d78'
              imagePullPolicy: IfNotPresent
              command:
                - python
                - '-m'
                - vllm.entrypoints.openai.api_server
              args:
                - '--port=8000'
                - '--model=/mnt/models' 
    1
    
                - '--served-model-name=granite-3.1-2b-instruct-quantized.w8a8'
                - '--tensor-parallel-size=1'
              resources:
                limits:
                  cpu: '10'
                  nvidia.com/gpu: '1'
                requests:
                  cpu: '2'
                  memory: 6Gi
                  nvidia.com/gpu: '1'
              volumeMounts:
                - name: cache-volume
                  mountPath: /mnt/models
                - name: shm
                  mountPath: /dev/shm 
    2
    
          volumes:
            - name: cache-volume
              persistentVolumeClaim:
                claimName: granite-31-w8a8
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 2Gi
          restartPolicy: Always
    Copy to Clipboard Toggle word wrap
    1
    The model that you downloaded should be available from this mounted location in the configured persistent volume.
    2
    The /dev/shm volume mount is required by the NVIDIA Collective Communications Library (NCCL). Tensor parallel vLLM deployments fail when the /dev/shm volume mount is not set.
  3. Create a Service CR for the model inference. For example:

    apiVersion: v1
    kind: Service
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      selector:
        app: granite
      ports:
        - protocol: TCP
          port: 80
          targetPort: 8000
    Copy to Clipboard Toggle word wrap
  4. Optional. Create a Route CR to enable public access to the model. For example:

    apiVersion: route.openshift.io/v1
    kind: Route
    metadata:
      name: granite
      namespace: rhaiis-namespace
    spec:
      to:
        kind: Service
        name: granite
      port:
        targetPort: 80
    Copy to Clipboard Toggle word wrap
  5. Get the URL for the exposed route:

    $ oc get route granite -n rhaiis-namespace -o jsonpath='{.spec.host}'
    Copy to Clipboard Toggle word wrap

    Example output

    granite-rhaiis-namespace.apps.example.com
    Copy to Clipboard Toggle word wrap

  6. Query the model by running the following command:

    curl -X POST http://granite-rhaiis-namespace.apps.example.com/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "granite-3.1-2b-instruct-quantized.w8a8",
        "messages": [{"role": "user", "content": "What is AI?"}],
        "temperature": 0.1
      }'
    Copy to Clipboard Toggle word wrap
맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat